diff options
Diffstat (limited to 'glep-0044.rst')
-rw-r--r-- | glep-0044.rst | 327 |
1 files changed, 327 insertions, 0 deletions
diff --git a/glep-0044.rst b/glep-0044.rst new file mode 100644 index 0000000..17c4867 --- /dev/null +++ b/glep-0044.rst @@ -0,0 +1,327 @@ +GLEP: 44 +Title: Manifest2 format +Version: $Revision$ +Last-Modified: $Date$ +Author: Marius Mauch <genone@gentoo.org>, +Status: Final +Type: Standards Track +Content-Type: text/x-rst +Created: 04-Dec-2005 +Post-History: 06-Dec-2005, 23-Jan-2006, 3-Sep-2006 + + +Abstract +======== + +This GLEP proposes a new format for the Portage Manifest and digest file system +by unifying both filetypes into one to improve functional and non-functional +aspects of the Portage Tree. + + +Motivation +========== + +Please see [#reorg-thread]_ for a general overview. +The main long term goals of this proposal are to: + +- Remove the tiny digest files from the tree. They are a major annoyance as on a + typical configuration they waste a lot of disk space and the simple transmission + of the names for all digest files during a ``emerge --sync`` needs a substantial + amount of bandwidth. +- Reduce redundancy when multiple hash functions are used +- Remove potential for checksum collisions if a file is recorded in more than one + digest file +- Difference between filetypes for a more flexible verification system + + +Specification +============= + +The new Manifest format would change the existing format in the following ways: + +- Addition of a filetype specifier, currently planned are + + * ``AUX`` for files directly used by ebuilds (e.g. patches or initscripts), + located in the ``files/`` subdirectory + + * ``EBUILD`` for all ebuilds + + * ``MISC`` for files not directly used by ebuilds like ``ChangeLog`` or + ``metadata.xml`` files + + * ``DIST`` for release tarballs recorded in the ``SRC_URI`` variable of an ebuild, + these were previously recorded in the digest files + + Future portage improvements might extend this list (for example with types + relevant for eclasses or profiles) + +- Only have one line per file listing all information instead of one line per + file and checksum type + +- Remove the separated digest-* files in the ``files/`` subdirectory + +Each line in the new format has the following format: + +:: + + <filetype> <filename> <filesize> <chksumtype1> <chksum1> ... <chksumtypen> <chksumn> + + +However theses entries will be stored in the existing Manifest files. + +An `actual example`__ for a (pure) Manifest2 file.. + +.. __: glep-0044-extras/manifest2-example.txt + + +Compatibility Entries +--------------------- + +To maintain compatibility with existing portage versions a transition period after +is the introduction of the Manifest2 format is required during which portage +will not only have to be capable of using existing Manifest and digest files but +also generate them in addition to the new entries. +Fortunately this can be accomplished by simply mixing old and new style entries +in one file for the Manifest files, existing portage versions will simply ignore +the new style entries. For the digest files there are no new entries to care +about. + +Scope +----- + +It is important to note that this proposal only deals with a change of the +format of the digest and Manifest system. + +It does not expand the scope of it to cover eclasses, profiles or anything +else not already covered by the Manifest system, it also doesn't affect +the Manifest signing efforts in any way (though the implementations of both +might be coupled). + +Also while multiple hash functions will become standard with the proposed +implementation they are not a specific feature of this format [#multi-hash-thread]_. + +Number of hashes +---------------- + +While using multiple hashes for each file is a major feature of this proposal +we have to make sure that the number of hashes listed is limited to avoid +an explosion of the Manifest size that would revert the main benefit of this proposal +(reducing tree size). Therefore the number of hashes that will be generated +will be limited to three different hash functions. For compatibility though we +have to rely on at least one hash function to always be present, this proposal +suggest to use SHA1 for this purpose (as it is supposed to be more secure than MD5 +and currently only SHA1 and MD5 are directly available in python, also MD5 doesn't +have any benefit in terms of compatibility). + +Rationale +========= + +The main goals of the proposal have been listed in the `Motivation`_, here now +the explanation why they are improvements and how the proposed format will +accomplish them. + +Removal of digest files +----------------------- + +Normal users that don't use a "tuned" filesystem for the portage tree are +wasting several dozen to a few hundred megabytes of disk space with the current +system, largely caused by the digest files. +This is due to the filesystem overhead present in most filesystems that +have a standard blocksize of four kilobytes while most digest files are under +one kilobyte in size, so this results in approximately a waste of three kilobytes +per digest file (likely even more). At the time of this writing the tree contains +roughly 22.000 digest files, so the overall waste caused by digest files is +estimated at about 70-100 megabytes. +Furthermore it is assumed that this will also reduce the disk space wasted by +the Manifest files as they now contain more content, but this hasn't been +verified yet. + +By unifying the digest files with the Manifest these tiny files are eliminated +(in the long run), reducing the apparent tree size by about 20%, benefiting +both users and the Gentoo infrastructure. + +Reducing redundancy +------------------- + +When multiple hashes are used with the current system +both the filename and filesize are repeated for every checksum type used as each +checksum is standalone. However this doesn't add any functionality and is +therefore useless, so the new format removes this redundancy. +This is a theoretical improvement at this moment as only one hash function is in +use, but expected to change soon (see [#multi-hash-thread]_). + +Removal of checksum collisions +------------------------------ + +The current system theoretically allows for a ``DIST`` type file to be recorded +in multiple digest files with different sizes and/or checksums. In such a case +one version of a package would report a checksum violation while another one +would not. This could create confusion and uncertainty among users. +So far this case hasn't been observed, but it can't be ruled out with the +existing system. +As the new format lists each file exactly once this would be no longer possible. + +Flexible verification system +---------------------------- + +Right now portage verifies the checksum of every file listed in the Manifest +before using any file of the package and all ``DIST`` files of an ebuild +before using that ebuild. This is unnecessary in many cases: + +- During the "depend" phase (when the ebuild metadata is generated) only + files of type ``EBUILD`` are used, so verifying the other types isn't + necessary. Theoretically it is possible for an ebuild to include other + files like those of type ``AUX`` at this phase, but that would be a + major QA violation and should never occur, so it can be ignored here. + It is also not a security concern as the ebuild is verified before parsing + it, so each manipulation would show up. + +- Generally files of type ``MISC`` don't need to be verified as they are + only used in very specific situations, aren't executed (just parsed at most) + and don't affect the package build process. + +- Files of type ``DIST`` only need to be verified directly after fetching and + before unpacking them (which often will be one step), not every time their + associated ebuild is used. + + +Backwards Compatibility +======================= + +Switching the Manifest system is a task that will need a long transition period +like most changes affecting both portage and the tree. In this case the +implementation will be rolled out in several phases: + +1. Add support for verification of Manifest2 entries in portage + +2. Enable generation of Manifest2 entries in addition to the current system + +3. Ignore digests during ``emerge --sync`` to get the size-benefit clientside. + This step may be omitted if the following steps are expected to follow soon. + +4. Disable generation of entries for the current system + +5. Remove all traces of the current system from the tree (serverside) + +Each step has its own issues. While 1) and 2) can be implemented without any +compatibility problems all later steps have a major impact: + +- Step 3) can only be implemented when the whole tree is Manifest2 ready + (ideally speaking, practically the requirement will be more like 95% coverage + with the expectation that for the remaining 5% either bugs will be filed after + step 3) is completed or they'll be updated at step 5). + +- Steps 4) and 5) will render all portage versions without Manifest2 support + basically useless (users would have to regenerate the digest and Manifest + for each package before being able to merge it), so this requires a almost + 100% coverage of the userbase with Manifest2 capable portage versions + (with step 1) completely implemented). + +Another problem is that some steps affect different targets: + +- Steps 1) and 3) target portage versions used by users + +- Steps 2) and 4) target portage versions used by devs + +- Step 5) targets the portage tree on the cvs server + +While it is relatively easy to get all devs to use a new portage version this is +practically impossible with users as some don't update their systems regularly. +While six months are probably sufficient to reach a 95% coverage one year is +estimated to reach an almost-complete coverage. All times are relative to the +stable-marking of a compatible portage version. + +No timeframe for implementation is presented here as it is highly dependent +on the completion of each step. + +In summary it can be said that while a full conversion will take over a year +to be completed due to compatibility issues mentioned above some benefits of the +system can selectively be used as soon as step 2) is completed. + + +Other problems +============== + +Impacts on infrastructure +------------------------- + +While one long term goal of this proposal is to reduce the size of the tree +and therefore make life for the Gentoo Infrastructure easier this will only +take effect once the implementation is rolled out completely. In the meantime +however it will increase the tree size due to keeping checksums in both formats. +It's not possible to give a usable estimate on the degree of the increase as +it depends on many variables such as the exact implementation timeframe, +propagation of Manifest2 capable portage versions among devs or the update +rate of the tree. It has been suggested that Manifest files that are not gpg +signed could be mass converted in one step, this could certainly help but only +to some degree (according to a recent research [#gpg-numbers]_ about 40% of +all Manifests in the tree are signed, but this number hasn't been verified). + + +Reference Implementation +======================== + +A patch for a prototype implementation of Manifest2 verification and partial +generation has been posted at [#manifest2-patch]_, it will be reworked before +being considered for inclusion in portage. However it shows that adding support +for verification is quite simple, but generation is a bit tricky and will +therefore be implemented later. + + +Options +======= + +Some things have been considered for this GLEP but aren't part of the proposal +yet for various reasons: + +- timestamp field: the author has considered adding a timestamp field for + each entry to list the time the entry was created. However so far no practical + use for such a feature has been found. + +- convert size field into checksum: Another idea was to treat the size field + like any other checksum. But so far no real benefit (other than a slightly + more modular implementation) for this has been seen while it has several + drawbacks: For once, unlike checksums, the size field is definitely required + for all ``DIST`` files, also it would slightly increase the length of + each entry by adding a ``SIZE`` keyword. + +- removal of the ``MISC`` type: It has been suggested to completely drop + entries of type ``MISC``. This would result in a minor space reduction + (its rather unlikely to free any blocks) but completely remove the ability + to check these files for integrity. While they don't influence portage + or packages directly they can contain viable information for users, so + the author has the opinion that at least the option for integrity checks + should be kept. + +Credits +======= + +Thanks to the following persons for their input on or related to this GLEP +(even though they might not have known it): +Ned Ludd (solar), Brian Harring (ferringb), Jason Stubbs (jstubbs), +Robin H. Johnson (robbat2), Aron Griffis (agriffis) + +Also thanks to Nicholas Jones (carpaski) to make the current Manifest system +resistent enough to be able to handle this change without too many transition +problems. + +References +========== + +.. [#reorg-thread] http://thread.gmane.org/gmane.linux.gentoo.devel/21920 + +.. [#multi-hash-thread] http://thread.gmane.org/gmane.linux.gentoo.devel/33434 + +.. [#gpg-numbers] gentoo-core mailing list, topic "Gentoo key signing practices + and official Gentoo keyring", Message-ID <20051117075838.GB15734@curie-int.vc.shawcable.net> + +.. [#manifest2-patch] http://thread.gmane.org/gmane.linux.gentoo.portage.devel/1374 + +.. [#manifest2-example] http://www.gentoo.org/proj/en/glep/glep-0044-extras/manifest2-example + +Copyright +========= + +This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 +Unported License. To view a copy of this license, visit +http://creativecommons.org/licenses/by-sa/3.0/. |