summaryrefslogtreecommitdiff
path: root/xdelta1/xdelta-0.13.README
diff options
context:
space:
mode:
authordotdotisdead <dotdotisdead@a3eca27d-f21b-0410-9b4a-6511e771f64e>2006-09-30 04:47:47 +0000
committerdotdotisdead <dotdotisdead@a3eca27d-f21b-0410-9b4a-6511e771f64e>2006-09-30 04:47:47 +0000
commitad85653ca73c8126de516b9a4294e8f08577c00d (patch)
tree79fb4d644ccf6a4fe8dac146b801a21d63537b23 /xdelta1/xdelta-0.13.README
parent5a7c245871879325d7b05c06e0b2011203986ee8 (diff)
import 1.1.3
Diffstat (limited to 'xdelta1/xdelta-0.13.README')
-rwxr-xr-xxdelta1/xdelta-0.13.README311
1 files changed, 311 insertions, 0 deletions
diff --git a/xdelta1/xdelta-0.13.README b/xdelta1/xdelta-0.13.README
new file mode 100755
index 0000000..f928f66
--- /dev/null
+++ b/xdelta1/xdelta-0.13.README
@@ -0,0 +1,311 @@
1
2I have placed version 0.13 of XDELTA, new delta generator and
3prototype RCS-replacement program at:
4
5 ftp://www.xcf.berkeley.edu/pub/jmacd/xdelta-0.13.tar.gz
6
7======
8
9Hereafter, the algorithm and program are described. A delta algorithm
10operates on two files, a FROM file and a TO file. It computes enough
11informnation to reconstruct the TO file from the FROM file. So the
12delta and it's inverse, patch, work as follows:
13
14P = delta (FROM, TO)
15TO = patch (FROM, P)
16
17The delta generator portion of this program is a delta algorithm which
18searches for substring matches between the files and then outputs
19instructions to reconstruct the new file from the old file. It
20produces a set of copy/insert instructions that tell how to
21reconstruct the file as a sequence of copies from the FROM file and
22inserts from the delta itself. In this regard, the program is much
23closer to a compression program than to a diff program. However, the
24delta is not "compressed", in that the delta's entropy H(P) will be
25very similar to the entropy of the portions of the TO file not found
26within the FROM file. The delta will compress just as well as the TO
27file will. This is a fundamentally different method of computing
28deltas than in the traditional "diff" program. The diff program and
29it's variants use a least-common-subsequence (LCS) algorithm to find a
30list of inserts and deletes that will modify the FROM file into the TO
31file. LCS is more expensive to compute and is sometimes more useful,
32especially to the human reader. Since LCS is a fairly expensive
33algorithm, diff programs usually divide the input files into
34newline-separated "atoms" before computing a delta. This is a fine
35approximation for text files, but not binary files.
36
37I propose this delta generator as a replacement for "diff" in
38applications where the insert/delete delta is not required. Since the
39copy/insert delta is easier to compute, it does not have to make the
40reduction in input size by breaking the file into lines. The delta
41generator works well on binary files.
42
43The next question is whether it is actually desirable to compute
44deltas between binary files. The answer is certainly yes, even though
45some binary file formats will not have a great degree of similarity
46between versions which were generated between minor modifications to
47their sources. First, I have evidence that some file formats (notably
48FrameMaker documents) work very well. Machine-dependant object files
49and executables don't work very well, but it is still worth doing.
50The reason it is still worthwhile is that compression takes longer
51than finding these deltas, so any space savings the delta generator
52produces it very likely to reduce the total archival time. Even if
53the delta generator saves no space, the total time well end up less
54than twice the time of compression. I will include some measurements
55at the end of this writing.
56
57With the delta generator as a building block, I have written a
58replacement for RCS. It does not have all the user-level features
59someone would want in RCS, but these things only get in the way when
60the code is used by higher level version control programs such as
61PRCS. For now, the prototype supports the following actions:
62
63register create an empty version file
64checkin create a new version
65checkout extract an old version
66
67The program departs from the RCS model in that it does not branch it's
68versions. The reason for doing this is to solve several problems with
69RCS (aside from it's use of diff which is solved by using my delta
70generator).
71
721. Development very often ends up off the "main trunk" of the RCS
73file, which causes an O(N) extract, where N deltas must be applied to
74the head of the trunk before getting the desired file. Once this
75occurs, things get worse and worse. Of course, it is possible to
76remedy the situation by checking it in once on the trunk, but that
77defeats the purpose of branching, because you lose the information
78about branch of development you are on. Further, the notion of
79branching individual files conflicts with the model presented by
80higher level version control programs such as PRCS, which would rather
81not have to go to extra trouble to insure that development does not
82end up off the trunk.
83
842. Deltas are never used more than once in a particular version
85history, even if they contain the same data as another delta.
86
87Conceptually, to get rid of RCS-like branching and retain the "deltas
88are computed between a file and its parent", the xdelta version
89control library computes a reverse delta between the new file and the
90entire version file. The most recently checked in version is always
91the "head", in RCS terms. It is always the quickest and easiest to
92extract. For a repository with N versions, version I requires the
93application of (N-I) deltas, regardless of how the set of versions are
94related.
95
96Each delta contains a set of instructions and all data which was
97originally part of a file version but missing in the FROM file. This
98data is represented without compression or modification. As a result,
99each delta may be used as an input for finding future deltas just as a
100regular file may be.
101
102Several equations describe the process, first some notation:
103
104 V0 is the version file with 1 version
105 V1 is the version file with 2 versions,
106 ...
107 VN is the version file with N+1 versions
108
109 F0 is the 1st file checked in, upon checkin V0 is produced
110 F1 is the 2nd file checked in, upon checkin V1 is produced
111 ...
112 FN is the Nth file checked in, upon checkin VN is produced
113
114 D0 is the 1st delta produced, upon checkin of F1
115 D1 is the 2nd delta produced, upon checkin of F2
116 ...
117 DN is the Nth delta produced, upon checkin of FN+1
118
119Checkin goes as follows:
120
121 V0 <- F0
122
123 D0 <- delta (F1, F0)
124 V1 <- F1 + D0
125
126 D1 <- delta (F2 + D0, F1)
127 V2 <- F2 + D1 + D0
128
129 D2 <- delta (F3 + D1 + D0, F2)
130 V3 <- F3 + D2 + D1 + D0
131
132an so on. To checkout:
133
134 F3 is immediate
135 F2 <- patch (F3 + D1 + D0, D2)
136 F1 <- patch (F2 + D0, D1)
137 F0 <- patch (F1, D0)
138
139Now I will describe why this is just as good as branching, even though
140the branches do not affect which files are used for computing deltas,
141and why it is just as good as the RCS model.
142
143Suppose the we have a sequence of checkins with RCS version numbers
144and their sequence numbers:
145
146F0: 1.1
147F1: 1.2
148F2: 1.3
149F3: 1.1.1.1
150F4: 1.1.1.2
151F5: 1.1.1.3
152F6: 1.4
153F7: 1.5
154
155Of interest are the 4th checkin, from 1.3 to 1.1.1.1, and the 7th
156checkin, from 1.1.1.3 to 1.4. I will now use the '-' sign to denote
157the taking of a delta and '+' for patch. Thus, delta(A,B) is
158equivalent to B-A, and patch(A,xdelta(A,B)) is equivalent to A+(B-A),
159or B, as it should be. In the above sequence of checkins, these
160deltas are produced:
161
162D0=F0-F1
163D1=F1-F2-D0
164D2=F2-F3-D1-D0
165D3=F3-F4-D2-D1-D0
166D4=F4-F5-D3-D2-D1-D0
167D5=F5-F6-D4-D3-D2-D1-D0
168D6=F6-F7-D5-D4-D3-D2-D1-D0
169
170The checkin of F3 produces D2. The concern is whether any deletions
171between 1.1 and 1.3 have to be inserted into D2. They have not,
172because any deletions between 1.1 and 1.3 appear in D0 and D1. Thus,
173using D0, D1, and F3 as an input for computing D2 achieves the desired
174results: the deletion is not reflected twice in separate deltas.
175
176The properties of this version format differ from the RCS format. It
177ensures that recently checked-in versions are quick to check out, no
178matter where in the "tree" they occur. One potential problem with
179this formulation is that according to the formulae above, the ammount
180of memory and computation required for checking in each new version
181grows with the number of previous versions. To prevent this it only
182uses up to some limiting value (currently set at 10) of previous
183deltas in computing new deltas. This means that deletions on a branch
184over 10 checkins old will be duplicated on a checkin. Another
185property is that to delete a version all older versions must first be
186deleted or off-line processing may be done to reconstruct the delta
187log with the version removed (an expensive but not time-critical
188operation). It also lends itself to keeping windowed histories, such
189as "only keep the last 20 revisions, anything earlier than that may be
190thrown away".
191
192---
193
194The current prototype implementation is complete. The program,
195'xdelta', has a command syntax similar to PRCS:
196
197xdelta COMMAND [OPTIONS] ARG1 ARG2 ...
198
199There are 5 commands:
200
201 register Create a versionfile
202 checkin Checkin a new version
203 checkout Checkout a version (latest or -r VERSION)
204 info Show information on a version (all or -r VERSION)
205 delta Produce a delta from ARG1 to ARG2 producing ARG3
206 patch Patch file ARG1 with ARG2 producing ARG3
207
208The -r option specifies a version (the latest version is always the
209default). Versions are sequentially numbered. The file itself is a
210GNU DBM database which stores the deltas and latest version. The file
211format is subject to change as I look for storage methods that are
212portable to Windows (otherwise, the delta generator has already been
213tested on Windows).
214
215The DELTA and PATCH commands are unrelated to the rest, and simply
216serve to generate deltas (and could be used as replacements for "diff"
217and "patch"). All operations are verified with MD5 checksums, which
218are saved for each version in the version file and the FROM and TO
219files when generating and applying deltas.
220
221---
222
223Some timing comparisons (remember all times include MD5 computations)
224on a 300MHz Alpha:
225
226Two kernels:
227
228 7875136 /genvmunix
229 6677320 /vmunix
230
231And the times to compress and delta them:
232
233 time(s) size(bytes)
234
235xdelta delta /genvmunix /vmunix patch: 10.1 4338029
236gzip --best /genvmunix: 34.0 3413963
237gzip --best /vmunix: 63.7 2501766
238gzip --best patch: 51.9 1531163
239
240(I'm not sure why there's such a variation in gzip times--it was
241repeatable.)
242
243xdelta produces a patch which is 4338029 bytes, approximately 2.6
244megabytes smaller than its TO file. The total space required for
245saving the two kernels compressed is 5915729 bytes, produced in 97.7
246seconds. The total space for storing /genvmunix and the delta
247compressed is 4945126 bytes, produced in 96.0 seconds. Using the
248delta saved approximately one megabyte and two seconds.
249
250---
251
252All the files in PRCS 1.0.0 and 1.2.0 concatenated (files ordered
253lexigraphically):
254
255 1678837 /tmp/1.0.0
256 2021819 /tmp/1.2.0
257
258 time(s) space(bytes)
259
260xdelta delta /tmp/1.0.0 /tmp/1.2.0 /tmp/patch: 1.5 695893
261gdiff -a --rcs /tmp/1.0.0 /tmp/1.2.0 /tmp/patch: 3.7 1524705
262
263Even though both files were completely text, gdiff missed a lot of
264similar regions because a line had been partially modified. Xdelta
265performs better in both dimensions.
266
267---
268
269All the .c and .h files in GNU gcc 2.7.2 and 2.7.2.3 concatenated (files
270ordered lexigraphically). These files are highly similar:
271
272 15590954 /tmp/2.7.2
273 15599644 /tmp/2.7.2.3
274 time(s) space(bytes)
275
276xdelta delta /tmp/2.7.2 /tmp/2.7.2.3 /tmp/patch: 7.3 11922
277gdiff -a --rcs /tmp/2.7.2 /tmp/2.7.2.3: 4.8 19477
278
279Here xdelta takes a about 50% longer and produces a slightly smaller
280diff. I think the improved performance in the other two examples
281makes xdelta an all around winner.
282
283---
284
285A note on FrameMaker files, which I happen to know a fair bit about.
286Even making radical modifications to a Frame+SGML document leaves much
287of its content unchanged, due to the fact that the EDD and various
288catalogs do not change. When only minimal modifications are performed
289
290Take a 24 page document, chapter 15 of the SGML developers guide for
291version 5.5. I replaced 38 occurances of "document" with
292"abcdefghijklmnop" and added the source of this README from here up at
293the end, making it 33 pages.
294
295 250880 graphics.before.fm
296 272384 graphics.after.fm
297
298The document grew by about 22 kbytes. The patch 32780 bytes. The
299same diff generated by GNU diff is 101097 bytes. Clearly, Frame
300documents are very good for binary version control.
301
302---
303
304I also have evidence that the program works well on uncompressed image
305formats such as the GIMP's XCF file format.
306
307This program will be the basis of PRCS version 2, which will support
308client/server distributed version control.
309
310Please send any comments, questions, thoughts, or interest to me,
311jmacd@cs.berkeley.edu.