summaryrefslogtreecommitdiff
blob: 8dabe2427e4beb9550024a80a5718eb1d7610979 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
GLEP: 25
Title: Distfile Patching Support
Author: Brian Harring <ferringb@gentoo.org>
Type: Standards Track
Status: Deferred
Version: 1
Created: 2004-03-06
Last-Modified: 2014-01-17
Post-History: 2004-04-04, 2004-11-11
Content-Type: text/x-rst
---

Abstract
========

The intention of this GLEP is to propose the creation of patching support for
Portage, and iron out the implementation details.

Status
======

Timed out


Motivation
==========

Reduce the bandwidth load placed on our mirrors by decreasing the amount of
bytes transferred when upgrading between versions.  Side benefit of this is to
significantly decrease the download requirements for users lacking broadband.

Binary patches vs GNUDiff patches
=================================

Most people are familiar with diff patches (unified diff for example)- this
glep is specifically proposing the use of an actual binary differencer.  The
reason for this is that diff patches are line based- you change a single
character in a line, and the whole line must be included in the patch.  Binary
differencers work at the byte level- it encodes just that byte.  In that
respect binary patches are often much more efficient then diff patches.

Further, the ability to reverse a unified patch is due to the fact the diff
includes **both** the original line, and the modified line.  The author isn't
aware of any binary differencer that is able to create patches the can be
reversed- basically they're unidirectional, the patch that is generated can
only be used to upgrade or downgrade the version, not both.  The plus side of
this limitation is a significantly decreased patch size.

The choice of binary patches over diff patches pretty much comes down to the
fact they're smaller- example being a kdelibs binary patch for 3.1.4->3.1.5 is
75kb, the equivalent diff patch is 123kb, and is unable to result in a correct
md5 [1]_.

Currently, this glep is proposing only the usage of binary patches- that's not
to say (with a fair amount of work) it couldn't be extended to support
standard diffs.

Rationale
=========

The difference between source releases typically isn't very large, especially
for minor releases.  As an example, kdelibs-3.1.4.tar.bz2 is 10.53 MB, and
kdelibs-3.1.5.tar.bz2 is 10.54 MB.  A bzip2'ed patch between those versions is
75.6 kb [2]_, less then 1% the size of 3.1.5's tbz2.

Specification
=============

Quite a few sections of gentoo are affected- mirroring, the portage tree, and
portage itself.

Additions to the tree
---------------------

For adding patch info into the tree, this glep proposes a global patch list
(stored in profiles as patches.global), and individual patch lists stored in
relevant package directories (named patches).  Using the kernel packages as an
example, a global list of patches enables us to create a patch once, add an
entry, and have all kernel packages benefit from that single entry.  Both
patches.global, and individual package patch files share the same format:

::

	MD5 md5-value patch-url size MD5 md5-value ref-file size UMD5 md5-value new-file size

For those familiar with digest file layout, this should look familiar.
Essentially, chksum type, value, filename, size.  The UMD5 chksum type is just
the uncompressed md5/size of the file- so if the UMD5 were for a bzip2
compressed file, it would be the md5 value/size of the uncompressed file.
And an example:

::

	MD5 ccd5411b3558326cbce0306fcae32e26 http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 75687 MD5 82c265de78d53c7060a09c5cb1a78942 kdelibs-3.1.4.tar.bz2 10537433 UMD5 0b1908a51e739c07ff5a88e189d2f7a9 kdelibs-3.1.5.tar.bz2 48056320

In the above example, the md5sum of
http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 is
calculated, compared to the stored value, and then the file size is checked.
The one difference is the UMD5 checksum type- the md5 value and the size are
specific to the *uncompressed* file.  Continuing, for cases where the patch
will reside on one of our mirrors, the patch filename would be sufficient.  

Finally, note that this is a unidirectional patch- using the above example,
kdelibs-3.1.4-3.1.5 can **only** be used to upgrade from 3.1.4 to 3.1.5, not
in reverse (originally explained in `Binary patches vs GNUDiff patches`_).

Portage Implementation
----------------------

This glep proposes the patching support should be (at this stage) optional-
specifically, enabled via FEATURES="patching".

Fetching
''''''''

When patching is enabled, the global patch list is read, and the packages
patch list is read.  From there, portage determines what files could be used
as a base for patching to the desired file- further, determining if it's
actually worth patching (case where it wouldn't be is when the target file is
less then the sum of the patches needed).  Any patches to be used are fetched,
and md5 verified.

Reconstruction
''''''''''''''

Upon fetching and md5 verification of patch(es), the desired file is
reconstructed.  Assuming reconstruction didn't return any errors, the target
file has its uncompressed md5sum calculated and verified, then is recompressed
and the compressed md5sum calculated.  At this point, if the compressed md5
matches the md5 stored in the tree, then portage transfers the file into
distfiles, and continues on its merry way.

If the compressed md5 is different from the tree's value, then the (proposed)
md5 database is updated with new compressed md5.  Details of this database
(and the issue it addresses) follow.

Compressed MD5sums: 
'''''''''''''''''''

There will be instances where a file is reconstructed perfectly, recompressed,
and the recompressed md5sum differs from what is stored in the tree- the
problem is that the md5sum of a compressed file is inherently tied to the
compressor version/options used to compress the original source.  

=====================
The Problem in Detail
=====================

A good example of this problem is related to bzip2 versions used for
compression.  Between bzip2 0.9x and bzip2 1.x, there was a subtle change in
the compressor resulting in a slightly better compression result- end result
being a different file, eg a different md5sum.  Assuming compressor versions
are the same, there also is the issue of what compression level the target
source was originally compressed at- was it compressed with -9, -8 or -7?
That's just a sampling of the various original settings that must be accounted
for, and that's limited to gzip/bzip2; other compressors will add to the
number of variables to be accounted for to produce an exact recreation of the
compressed md5sum.

Tracking the compressor version and options originally used isn't really a
valid option- assuming all options were accounted for, clients would still be
required to have multiple versions of the same compressor installed just for
the sake of recreating a compressed md5sum *even though* the uncompressed
source's md5 has already been verified.

=====================
The Proposed Solution
=====================

The creation of a clientside flatfile/db of valid alternate md5/size pairs
would enable portage to handle perfectly reconstructed files, that have a
different md5sum due to compression differences.  The proposed format is thus:

::

	MD5 md5sum orig-file size MD5 md5sum [ optional new-name ] size

Example:

::

	MD5 984146931906a7d53300b29f58f6a899 OOo_1.0.3_source.tar.bz2 165475319 MD5 0733dd85ed44d88d1eabed704d579721 165444187

An alternate md5/size pair for a file would be added **only** when the
uncompressed source's md5/size has been verified, yet upon recompression the
md5 differs.  For cleansing of older md5/size pairs from this db, a utility
would be required- the author suggests the addition of a distfiles-cleaning
utility to portage, with the ability to also cleanse old md5/size pairs when
the file the pair was created for no longer exists in distfiles.

Where to store the database is debatable- /etc/portage or /var/cache/edb are
definite options.

The reasoning for allowing for an optional new-name is that it provides needed
functionality should anyone attempt to extend portage to allow for clients to
change the compression used for a source (eg, recompress all gzip files as
bzip2).  Granted, no such code or attempt has been made, but nothing is lost
by  leaving the option open should the request/attempt be made.

A potential gotcha of adding this support is that in environments where the
distfiles directory is shared out to multiple systems, this db must be shared
also.



Distfile Mirror Additions
-------------------------

One issue of contention is where these files will actually be stored.  As of
the writing of this glep, a full distfiles mirror is roughly around 40 gb- a
rough estimate by the author places the space requirements for patches for
each version at a total of around 4gb.  Note this isn't even remotely a hard
figure yet, and a better figure is being checked into currently.

Regardless of the exact space figure, finding a place to store the patches
will be problematic.  Expansion of the required mirror space (essentially just
swallowing the patches storage requirement) is unlikely, since it was one of
the main arguments against the now defunct glep9 attempt [2]_.  A couple of
ideas that have been put forth to handle the additional space requirements are
as follows- 

1)	Identification of mirrors willing to handle the extra space requirements-
essentially create an additional patch mirror tier.

2)	Mirroring only a patch for certain package versions, rather then full
source.  Using kdelibs-3.1.5 as an example, only the patch would be mirrored
(rather then the full 10.53 MB source).  Downside to this approach is that a
user who is downloading kdelibs for the first time would either need to pull
it from the original SRC_URI (placing the burden onto the upstream mirror), or
pull the 3.1.4 version, and the patch- pulling 63k more then if they had just
pulled the full version.  The kdelibs 3.1.4/3.1.5 example is something of an
optimal case- not all versions will have such miniscule patches.

3)	A variation on the idea above, essentially mirroring only the patch for
the oldest version(s) of a package; eg, kdelibs currently has version 3.05,
3.1.5, 3.2.0, and 3.2.1- the mirrors would only carry a patch for 3.05, not
full source (think RESTRICT="fetch").  One plus to this is that patches to
downgrade in version are smaller then the patches to upgrade in version- there
are exceptions to this, but they're hard to find.  A major downside to this
approach is A) a user would have to sync up to get the patchlists for that
version, B) creation of a set of patches to go backwards in version (see
`Binary patches vs GNUDiff patches`_)..  

Of the options listed above, the first is the easiest, although the second
could be made to work.  Feedback and any possible alternatives would be
greatly appreciated.

Patch Creation
--------------

Maintenance of patch lists, and the actual patch creation ought to be managed
by a high level script- essentally a dev says "I want a patch between this
version, and that version: make it so", the script churns away
creating/updating the patch list, and generating the patch locally.  The
utility next uploads the new patch to /space/distfiles-local on dev.gentoo.org
(exempting if it's not a locally generated patch), and repoman is used to
commit the updated patch list.

What would be preferable (although possibly wishful thinking), is if hardware
could be co-opted for automatic patch generation, rather then forcing it upon
the devs- something akin to how files are pulled onto the mirror automatically
for new ebuilds.  

The initial bulk of patches to get will be generated by the author, to ease
the transition and offer patches for people to test out.

Backwards Compatibility
=======================

As noted in `The Proposed Solution`_, a system using patching and sharing out
it's distfiles must share out it's alternate md5 db.  Any system that uses the
distfiles share must support the alternate md5 db also.  If this is considered
enough of an issue, it is conceivable to place reconstructed sources with an
alternate md5 into a subdirectory of distdir- portage only looks within
distdir, unwilling to descend into subdirectories.

Also note that `Distfile Mirror Additions`_ may add additional backwards
compatibility issues, depending on what solution is accepted.  

Reference Implementation
========================

TODO

References
==========
.. [1] http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.{patch,diff}.bz2.
.. [2] kdelibs-3.1.4-3.1.5.patch.bz2, switching format patch, created via diffball-0.4_pre4 (diffball is available at http://sourceforge.net/projects/diffball) 
       Bzip2 -9 compressed, the patch is 75,687 bytes, uncompressed it is 337,649 bytes.  The patch is available at http://dev.gentoo.org/~ferringb/kdelibs-3.1.4-3.1.5.patch.bz2 for those curious.
.. [3] Glep9, 'Gentoo Package Update System' 
	(https://www.gentoo.org/glep/glep-0009.html)

Copyright
=========

This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License.  To view a copy of this license, visit
https://creativecommons.org/licenses/by-sa/3.0/.