diff options
author | Josh MacDonald <josh.macdonald@gmail.com> | 2016-02-26 21:43:58 -0800 |
---|---|---|
committer | Josh MacDonald <josh.macdonald@gmail.com> | 2016-02-26 21:43:58 -0800 |
commit | d594fbe514e2a381541a510fe01041e546f56a67 (patch) | |
tree | 316072f34cf7dc94518ec14a8d955d57c20d65b7 /xdelta1/xdelta-0.13.README | |
parent | 81aebf78ae67c29f528088d65743643e5355e3d3 (diff) |
Remove xdelta1 from this repo
Diffstat (limited to 'xdelta1/xdelta-0.13.README')
-rw-r--r-- | xdelta1/xdelta-0.13.README | 311 |
1 files changed, 0 insertions, 311 deletions
diff --git a/xdelta1/xdelta-0.13.README b/xdelta1/xdelta-0.13.README deleted file mode 100644 index f928f66..0000000 --- a/xdelta1/xdelta-0.13.README +++ /dev/null | |||
@@ -1,311 +0,0 @@ | |||
1 | |||
2 | I have placed version 0.13 of XDELTA, new delta generator and | ||
3 | prototype RCS-replacement program at: | ||
4 | |||
5 | ftp://www.xcf.berkeley.edu/pub/jmacd/xdelta-0.13.tar.gz | ||
6 | |||
7 | ====== | ||
8 | |||
9 | Hereafter, the algorithm and program are described. A delta algorithm | ||
10 | operates on two files, a FROM file and a TO file. It computes enough | ||
11 | informnation to reconstruct the TO file from the FROM file. So the | ||
12 | delta and it's inverse, patch, work as follows: | ||
13 | |||
14 | P = delta (FROM, TO) | ||
15 | TO = patch (FROM, P) | ||
16 | |||
17 | The delta generator portion of this program is a delta algorithm which | ||
18 | searches for substring matches between the files and then outputs | ||
19 | instructions to reconstruct the new file from the old file. It | ||
20 | produces a set of copy/insert instructions that tell how to | ||
21 | reconstruct the file as a sequence of copies from the FROM file and | ||
22 | inserts from the delta itself. In this regard, the program is much | ||
23 | closer to a compression program than to a diff program. However, the | ||
24 | delta is not "compressed", in that the delta's entropy H(P) will be | ||
25 | very similar to the entropy of the portions of the TO file not found | ||
26 | within the FROM file. The delta will compress just as well as the TO | ||
27 | file will. This is a fundamentally different method of computing | ||
28 | deltas than in the traditional "diff" program. The diff program and | ||
29 | it's variants use a least-common-subsequence (LCS) algorithm to find a | ||
30 | list of inserts and deletes that will modify the FROM file into the TO | ||
31 | file. LCS is more expensive to compute and is sometimes more useful, | ||
32 | especially to the human reader. Since LCS is a fairly expensive | ||
33 | algorithm, diff programs usually divide the input files into | ||
34 | newline-separated "atoms" before computing a delta. This is a fine | ||
35 | approximation for text files, but not binary files. | ||
36 | |||
37 | I propose this delta generator as a replacement for "diff" in | ||
38 | applications where the insert/delete delta is not required. Since the | ||
39 | copy/insert delta is easier to compute, it does not have to make the | ||
40 | reduction in input size by breaking the file into lines. The delta | ||
41 | generator works well on binary files. | ||
42 | |||
43 | The next question is whether it is actually desirable to compute | ||
44 | deltas between binary files. The answer is certainly yes, even though | ||
45 | some binary file formats will not have a great degree of similarity | ||
46 | between versions which were generated between minor modifications to | ||
47 | their sources. First, I have evidence that some file formats (notably | ||
48 | FrameMaker documents) work very well. Machine-dependant object files | ||
49 | and executables don't work very well, but it is still worth doing. | ||
50 | The reason it is still worthwhile is that compression takes longer | ||
51 | than finding these deltas, so any space savings the delta generator | ||
52 | produces it very likely to reduce the total archival time. Even if | ||
53 | the delta generator saves no space, the total time well end up less | ||
54 | than twice the time of compression. I will include some measurements | ||
55 | at the end of this writing. | ||
56 | |||
57 | With the delta generator as a building block, I have written a | ||
58 | replacement for RCS. It does not have all the user-level features | ||
59 | someone would want in RCS, but these things only get in the way when | ||
60 | the code is used by higher level version control programs such as | ||
61 | PRCS. For now, the prototype supports the following actions: | ||
62 | |||
63 | register create an empty version file | ||
64 | checkin create a new version | ||
65 | checkout extract an old version | ||
66 | |||
67 | The program departs from the RCS model in that it does not branch it's | ||
68 | versions. The reason for doing this is to solve several problems with | ||
69 | RCS (aside from it's use of diff which is solved by using my delta | ||
70 | generator). | ||
71 | |||
72 | 1. Development very often ends up off the "main trunk" of the RCS | ||
73 | file, which causes an O(N) extract, where N deltas must be applied to | ||
74 | the head of the trunk before getting the desired file. Once this | ||
75 | occurs, things get worse and worse. Of course, it is possible to | ||
76 | remedy the situation by checking it in once on the trunk, but that | ||
77 | defeats the purpose of branching, because you lose the information | ||
78 | about branch of development you are on. Further, the notion of | ||
79 | branching individual files conflicts with the model presented by | ||
80 | higher level version control programs such as PRCS, which would rather | ||
81 | not have to go to extra trouble to insure that development does not | ||
82 | end up off the trunk. | ||
83 | |||
84 | 2. Deltas are never used more than once in a particular version | ||
85 | history, even if they contain the same data as another delta. | ||
86 | |||
87 | Conceptually, to get rid of RCS-like branching and retain the "deltas | ||
88 | are computed between a file and its parent", the xdelta version | ||
89 | control library computes a reverse delta between the new file and the | ||
90 | entire version file. The most recently checked in version is always | ||
91 | the "head", in RCS terms. It is always the quickest and easiest to | ||
92 | extract. For a repository with N versions, version I requires the | ||
93 | application of (N-I) deltas, regardless of how the set of versions are | ||
94 | related. | ||
95 | |||
96 | Each delta contains a set of instructions and all data which was | ||
97 | originally part of a file version but missing in the FROM file. This | ||
98 | data is represented without compression or modification. As a result, | ||
99 | each delta may be used as an input for finding future deltas just as a | ||
100 | regular file may be. | ||
101 | |||
102 | Several equations describe the process, first some notation: | ||
103 | |||
104 | V0 is the version file with 1 version | ||
105 | V1 is the version file with 2 versions, | ||
106 | ... | ||
107 | VN is the version file with N+1 versions | ||
108 | |||
109 | F0 is the 1st file checked in, upon checkin V0 is produced | ||
110 | F1 is the 2nd file checked in, upon checkin V1 is produced | ||
111 | ... | ||
112 | FN is the Nth file checked in, upon checkin VN is produced | ||
113 | |||
114 | D0 is the 1st delta produced, upon checkin of F1 | ||
115 | D1 is the 2nd delta produced, upon checkin of F2 | ||
116 | ... | ||
117 | DN is the Nth delta produced, upon checkin of FN+1 | ||
118 | |||
119 | Checkin goes as follows: | ||
120 | |||
121 | V0 <- F0 | ||
122 | |||
123 | D0 <- delta (F1, F0) | ||
124 | V1 <- F1 + D0 | ||
125 | |||
126 | D1 <- delta (F2 + D0, F1) | ||
127 | V2 <- F2 + D1 + D0 | ||
128 | |||
129 | D2 <- delta (F3 + D1 + D0, F2) | ||
130 | V3 <- F3 + D2 + D1 + D0 | ||
131 | |||
132 | an so on. To checkout: | ||
133 | |||
134 | F3 is immediate | ||
135 | F2 <- patch (F3 + D1 + D0, D2) | ||
136 | F1 <- patch (F2 + D0, D1) | ||
137 | F0 <- patch (F1, D0) | ||
138 | |||
139 | Now I will describe why this is just as good as branching, even though | ||
140 | the branches do not affect which files are used for computing deltas, | ||
141 | and why it is just as good as the RCS model. | ||
142 | |||
143 | Suppose the we have a sequence of checkins with RCS version numbers | ||
144 | and their sequence numbers: | ||
145 | |||
146 | F0: 1.1 | ||
147 | F1: 1.2 | ||
148 | F2: 1.3 | ||
149 | F3: 1.1.1.1 | ||
150 | F4: 1.1.1.2 | ||
151 | F5: 1.1.1.3 | ||
152 | F6: 1.4 | ||
153 | F7: 1.5 | ||
154 | |||
155 | Of interest are the 4th checkin, from 1.3 to 1.1.1.1, and the 7th | ||
156 | checkin, from 1.1.1.3 to 1.4. I will now use the '-' sign to denote | ||
157 | the taking of a delta and '+' for patch. Thus, delta(A,B) is | ||
158 | equivalent to B-A, and patch(A,xdelta(A,B)) is equivalent to A+(B-A), | ||
159 | or B, as it should be. In the above sequence of checkins, these | ||
160 | deltas are produced: | ||
161 | |||
162 | D0=F0-F1 | ||
163 | D1=F1-F2-D0 | ||
164 | D2=F2-F3-D1-D0 | ||
165 | D3=F3-F4-D2-D1-D0 | ||
166 | D4=F4-F5-D3-D2-D1-D0 | ||
167 | D5=F5-F6-D4-D3-D2-D1-D0 | ||
168 | D6=F6-F7-D5-D4-D3-D2-D1-D0 | ||
169 | |||
170 | The checkin of F3 produces D2. The concern is whether any deletions | ||
171 | between 1.1 and 1.3 have to be inserted into D2. They have not, | ||
172 | because any deletions between 1.1 and 1.3 appear in D0 and D1. Thus, | ||
173 | using D0, D1, and F3 as an input for computing D2 achieves the desired | ||
174 | results: the deletion is not reflected twice in separate deltas. | ||
175 | |||
176 | The properties of this version format differ from the RCS format. It | ||
177 | ensures that recently checked-in versions are quick to check out, no | ||
178 | matter where in the "tree" they occur. One potential problem with | ||
179 | this formulation is that according to the formulae above, the ammount | ||
180 | of memory and computation required for checking in each new version | ||
181 | grows with the number of previous versions. To prevent this it only | ||
182 | uses up to some limiting value (currently set at 10) of previous | ||
183 | deltas in computing new deltas. This means that deletions on a branch | ||
184 | over 10 checkins old will be duplicated on a checkin. Another | ||
185 | property is that to delete a version all older versions must first be | ||
186 | deleted or off-line processing may be done to reconstruct the delta | ||
187 | log with the version removed (an expensive but not time-critical | ||
188 | operation). It also lends itself to keeping windowed histories, such | ||
189 | as "only keep the last 20 revisions, anything earlier than that may be | ||
190 | thrown away". | ||
191 | |||
192 | --- | ||
193 | |||
194 | The current prototype implementation is complete. The program, | ||
195 | 'xdelta', has a command syntax similar to PRCS: | ||
196 | |||
197 | xdelta COMMAND [OPTIONS] ARG1 ARG2 ... | ||
198 | |||
199 | There are 5 commands: | ||
200 | |||
201 | register Create a versionfile | ||
202 | checkin Checkin a new version | ||
203 | checkout Checkout a version (latest or -r VERSION) | ||
204 | info Show information on a version (all or -r VERSION) | ||
205 | delta Produce a delta from ARG1 to ARG2 producing ARG3 | ||
206 | patch Patch file ARG1 with ARG2 producing ARG3 | ||
207 | |||
208 | The -r option specifies a version (the latest version is always the | ||
209 | default). Versions are sequentially numbered. The file itself is a | ||
210 | GNU DBM database which stores the deltas and latest version. The file | ||
211 | format is subject to change as I look for storage methods that are | ||
212 | portable to Windows (otherwise, the delta generator has already been | ||
213 | tested on Windows). | ||
214 | |||
215 | The DELTA and PATCH commands are unrelated to the rest, and simply | ||
216 | serve to generate deltas (and could be used as replacements for "diff" | ||
217 | and "patch"). All operations are verified with MD5 checksums, which | ||
218 | are saved for each version in the version file and the FROM and TO | ||
219 | files when generating and applying deltas. | ||
220 | |||
221 | --- | ||
222 | |||
223 | Some timing comparisons (remember all times include MD5 computations) | ||
224 | on a 300MHz Alpha: | ||
225 | |||
226 | Two kernels: | ||
227 | |||
228 | 7875136 /genvmunix | ||
229 | 6677320 /vmunix | ||
230 | |||
231 | And the times to compress and delta them: | ||
232 | |||
233 | time(s) size(bytes) | ||
234 | |||
235 | xdelta delta /genvmunix /vmunix patch: 10.1 4338029 | ||
236 | gzip --best /genvmunix: 34.0 3413963 | ||
237 | gzip --best /vmunix: 63.7 2501766 | ||
238 | gzip --best patch: 51.9 1531163 | ||
239 | |||
240 | (I'm not sure why there's such a variation in gzip times--it was | ||
241 | repeatable.) | ||
242 | |||
243 | xdelta produces a patch which is 4338029 bytes, approximately 2.6 | ||
244 | megabytes smaller than its TO file. The total space required for | ||
245 | saving the two kernels compressed is 5915729 bytes, produced in 97.7 | ||
246 | seconds. The total space for storing /genvmunix and the delta | ||
247 | compressed is 4945126 bytes, produced in 96.0 seconds. Using the | ||
248 | delta saved approximately one megabyte and two seconds. | ||
249 | |||
250 | --- | ||
251 | |||
252 | All the files in PRCS 1.0.0 and 1.2.0 concatenated (files ordered | ||
253 | lexigraphically): | ||
254 | |||
255 | 1678837 /tmp/1.0.0 | ||
256 | 2021819 /tmp/1.2.0 | ||
257 | |||
258 | time(s) space(bytes) | ||
259 | |||
260 | xdelta delta /tmp/1.0.0 /tmp/1.2.0 /tmp/patch: 1.5 695893 | ||
261 | gdiff -a --rcs /tmp/1.0.0 /tmp/1.2.0 /tmp/patch: 3.7 1524705 | ||
262 | |||
263 | Even though both files were completely text, gdiff missed a lot of | ||
264 | similar regions because a line had been partially modified. Xdelta | ||
265 | performs better in both dimensions. | ||
266 | |||
267 | --- | ||
268 | |||
269 | All the .c and .h files in GNU gcc 2.7.2 and 2.7.2.3 concatenated (files | ||
270 | ordered lexigraphically). These files are highly similar: | ||
271 | |||
272 | 15590954 /tmp/2.7.2 | ||
273 | 15599644 /tmp/2.7.2.3 | ||
274 | time(s) space(bytes) | ||
275 | |||
276 | xdelta delta /tmp/2.7.2 /tmp/2.7.2.3 /tmp/patch: 7.3 11922 | ||
277 | gdiff -a --rcs /tmp/2.7.2 /tmp/2.7.2.3: 4.8 19477 | ||
278 | |||
279 | Here xdelta takes a about 50% longer and produces a slightly smaller | ||
280 | diff. I think the improved performance in the other two examples | ||
281 | makes xdelta an all around winner. | ||
282 | |||
283 | --- | ||
284 | |||
285 | A note on FrameMaker files, which I happen to know a fair bit about. | ||
286 | Even making radical modifications to a Frame+SGML document leaves much | ||
287 | of its content unchanged, due to the fact that the EDD and various | ||
288 | catalogs do not change. When only minimal modifications are performed | ||
289 | |||
290 | Take a 24 page document, chapter 15 of the SGML developers guide for | ||
291 | version 5.5. I replaced 38 occurances of "document" with | ||
292 | "abcdefghijklmnop" and added the source of this README from here up at | ||
293 | the end, making it 33 pages. | ||
294 | |||
295 | 250880 graphics.before.fm | ||
296 | 272384 graphics.after.fm | ||
297 | |||
298 | The document grew by about 22 kbytes. The patch 32780 bytes. The | ||
299 | same diff generated by GNU diff is 101097 bytes. Clearly, Frame | ||
300 | documents are very good for binary version control. | ||
301 | |||
302 | --- | ||
303 | |||
304 | I also have evidence that the program works well on uncompressed image | ||
305 | formats such as the GIMP's XCF file format. | ||
306 | |||
307 | This program will be the basis of PRCS version 2, which will support | ||
308 | client/server distributed version control. | ||
309 | |||
310 | Please send any comments, questions, thoughts, or interest to me, | ||
311 | jmacd@cs.berkeley.edu. | ||