summaryrefslogtreecommitdiff
path: root/xdelta1/xdelta-0.13.README
blob: f928f667ab200891271fed37afe301659cc4b944 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311

I have placed version 0.13 of XDELTA, new delta generator and
prototype RCS-replacement program at:

  ftp://www.xcf.berkeley.edu/pub/jmacd/xdelta-0.13.tar.gz

======

Hereafter, the algorithm and program are described.  A delta algorithm
operates on two files, a FROM file and a TO file.  It computes enough
informnation to reconstruct the TO file from the FROM file.  So the
delta and it's inverse, patch, work as follows:

P = delta (FROM, TO)
TO = patch (FROM, P)

The delta generator portion of this program is a delta algorithm which
searches for substring matches between the files and then outputs
instructions to reconstruct the new file from the old file.  It
produces a set of copy/insert instructions that tell how to
reconstruct the file as a sequence of copies from the FROM file and
inserts from the delta itself.  In this regard, the program is much
closer to a compression program than to a diff program.  However, the
delta is not "compressed", in that the delta's entropy H(P) will be
very similar to the entropy of the portions of the TO file not found
within the FROM file.  The delta will compress just as well as the TO
file will.  This is a fundamentally different method of computing
deltas than in the traditional "diff" program.  The diff program and
it's variants use a least-common-subsequence (LCS) algorithm to find a
list of inserts and deletes that will modify the FROM file into the TO
file.  LCS is more expensive to compute and is sometimes more useful,
especially to the human reader.  Since LCS is a fairly expensive
algorithm, diff programs usually divide the input files into
newline-separated "atoms" before computing a delta.  This is a fine
approximation for text files, but not binary files.

I propose this delta generator as a replacement for "diff" in
applications where the insert/delete delta is not required.  Since the
copy/insert delta is easier to compute, it does not have to make the
reduction in input size by breaking the file into lines.  The delta
generator works well on binary files.

The next question is whether it is actually desirable to compute
deltas between binary files.  The answer is certainly yes, even though
some binary file formats will not have a great degree of similarity
between versions which were generated between minor modifications to
their sources.  First, I have evidence that some file formats (notably
FrameMaker documents) work very well.  Machine-dependant object files
and executables don't work very well, but it is still worth doing.
The reason it is still worthwhile is that compression takes longer
than finding these deltas, so any space savings the delta generator
produces it very likely to reduce the total archival time.  Even if
the delta generator saves no space, the total time well end up less
than twice the time of compression.  I will include some measurements
at the end of this writing.

With the delta generator as a building block, I have written a
replacement for RCS.  It does not have all the user-level features
someone would want in RCS, but these things only get in the way when
the code is used by higher level version control programs such as
PRCS.  For now, the prototype supports the following actions:

register		create an empty version file
checkin			create a new version
checkout		extract an old version

The program departs from the RCS model in that it does not branch it's
versions.  The reason for doing this is to solve several problems with
RCS (aside from it's use of diff which is solved by using my delta
generator).

1. Development very often ends up off the "main trunk" of the RCS
file, which causes an O(N) extract, where N deltas must be applied to
the head of the trunk before getting the desired file.  Once this
occurs, things get worse and worse.  Of course, it is possible to
remedy the situation by checking it in once on the trunk, but that
defeats the purpose of branching, because you lose the information
about branch of development you are on.  Further, the notion of
branching individual files conflicts with the model presented by
higher level version control programs such as PRCS, which would rather
not have to go to extra trouble to insure that development does not
end up off the trunk.

2. Deltas are never used more than once in a particular version
history, even if they contain the same data as another delta.

Conceptually, to get rid of RCS-like branching and retain the "deltas
are computed between a file and its parent", the xdelta version
control library computes a reverse delta between the new file and the
entire version file.  The most recently checked in version is always
the "head", in RCS terms.  It is always the quickest and easiest to
extract.  For a repository with N versions, version I requires the
application of (N-I) deltas, regardless of how the set of versions are
related.

Each delta contains a set of instructions and all data which was
originally part of a file version but missing in the FROM file.  This
data is represented without compression or modification.  As a result,
each delta may be used as an input for finding future deltas just as a
regular file may be.

Several equations describe the process, first some notation:

	V0 is the version file with 1 version
	V1 is the version file with 2 versions,
	...
	VN is the version file with N+1 versions

	F0 is the 1st file checked in, upon checkin V0 is produced
	F1 is the 2nd file checked in, upon checkin V1 is produced
	...
	FN is the Nth file checked in, upon checkin VN is produced

	D0 is the 1st delta produced, upon checkin of F1
	D1 is the 2nd delta produced, upon checkin of F2
	...
	DN is the Nth delta produced, upon checkin of FN+1

Checkin goes as follows:

	V0 <- F0

	D0 <- delta (F1, F0)
	V1 <- F1 + D0

	D1 <- delta (F2 + D0, F1)
	V2 <- F2 + D1 + D0

	D2 <- delta (F3 + D1 + D0, F2)
	V3 <- F3 + D2 + D1 + D0

an so on.  To checkout:

	F3 is immediate
	F2 <- patch (F3 + D1 + D0, D2)
	F1 <- patch (F2 + D0, D1)
	F0 <- patch (F1, D0)

Now I will describe why this is just as good as branching, even though
the branches do not affect which files are used for computing deltas,
and why it is just as good as the RCS model.

Suppose the we have a sequence of checkins with RCS version numbers
and their sequence numbers:

F0: 1.1
F1: 1.2
F2: 1.3
F3: 1.1.1.1
F4: 1.1.1.2
F5: 1.1.1.3
F6: 1.4
F7: 1.5

Of interest are the 4th checkin, from 1.3 to 1.1.1.1, and the 7th
checkin, from 1.1.1.3 to 1.4.  I will now use the '-' sign to denote
the taking of a delta and '+' for patch.  Thus, delta(A,B) is
equivalent to B-A, and patch(A,xdelta(A,B)) is equivalent to A+(B-A),
or B, as it should be.  In the above sequence of checkins, these
deltas are produced:

D0=F0-F1
D1=F1-F2-D0
D2=F2-F3-D1-D0
D3=F3-F4-D2-D1-D0
D4=F4-F5-D3-D2-D1-D0
D5=F5-F6-D4-D3-D2-D1-D0
D6=F6-F7-D5-D4-D3-D2-D1-D0

The checkin of F3 produces D2.  The concern is whether any deletions
between 1.1 and 1.3 have to be inserted into D2.  They have not,
because any deletions between 1.1 and 1.3 appear in D0 and D1.  Thus,
using D0, D1, and F3 as an input for computing D2 achieves the desired
results: the deletion is not reflected twice in separate deltas.

The properties of this version format differ from the RCS format.  It
ensures that recently checked-in versions are quick to check out, no
matter where in the "tree" they occur.  One potential problem with
this formulation is that according to the formulae above, the ammount
of memory and computation required for checking in each new version
grows with the number of previous versions.  To prevent this it only
uses up to some limiting value (currently set at 10) of previous
deltas in computing new deltas.  This means that deletions on a branch
over 10 checkins old will be duplicated on a checkin.  Another
property is that to delete a version all older versions must first be
deleted or off-line processing may be done to reconstruct the delta
log with the version removed (an expensive but not time-critical
operation).  It also lends itself to keeping windowed histories, such
as "only keep the last 20 revisions, anything earlier than that may be
thrown away".

---

The current prototype implementation is complete.  The program,
'xdelta', has a command syntax similar to PRCS:

xdelta COMMAND [OPTIONS] ARG1 ARG2 ...

There are 5 commands:

 register  Create a versionfile
 checkin   Checkin a new version
 checkout  Checkout a version (latest or -r VERSION)
 info      Show information on a version (all or -r VERSION)
 delta     Produce a delta from ARG1 to ARG2 producing ARG3
 patch     Patch file ARG1 with ARG2 producing ARG3

The -r option specifies a version (the latest version is always the
default).  Versions are sequentially numbered.  The file itself is a
GNU DBM database which stores the deltas and latest version.  The file
format is subject to change as I look for storage methods that are
portable to Windows (otherwise, the delta generator has already been
tested on Windows).

The DELTA and PATCH commands are unrelated to the rest, and simply
serve to generate deltas (and could be used as replacements for "diff"
and "patch").  All operations are verified with MD5 checksums, which
are saved for each version in the version file and the FROM and TO
files when generating and applying deltas.

---

Some timing comparisons (remember all times include MD5 computations)
on a 300MHz Alpha:

Two kernels:

   7875136 /genvmunix
   6677320 /vmunix

And the times to compress and delta them:

					time(s)	size(bytes)

xdelta delta /genvmunix /vmunix patch:	10.1	4338029
gzip --best /genvmunix:			34.0	3413963
gzip --best /vmunix:			63.7	2501766
gzip --best patch:			51.9	1531163

(I'm not sure why there's such a variation in gzip times--it was
repeatable.)

xdelta produces a patch which is 4338029 bytes, approximately 2.6
megabytes smaller than its TO file.  The total space required for
saving the two kernels compressed is 5915729 bytes, produced in 97.7
seconds.  The total space for storing /genvmunix and the delta
compressed is 4945126 bytes, produced in 96.0 seconds.  Using the
delta saved approximately one megabyte and two seconds.

---

All the files in PRCS 1.0.0 and 1.2.0 concatenated (files ordered
lexigraphically):

   1678837 /tmp/1.0.0
   2021819 /tmp/1.2.0

							time(s)	space(bytes)

xdelta delta /tmp/1.0.0 /tmp/1.2.0 /tmp/patch:		1.5	695893
gdiff -a --rcs /tmp/1.0.0 /tmp/1.2.0 /tmp/patch:	3.7	1524705

Even though both files were completely text, gdiff missed a lot of
similar regions because a line had been partially modified.  Xdelta
performs better in both dimensions.

---

All the .c and .h files in GNU gcc 2.7.2 and 2.7.2.3 concatenated (files
ordered lexigraphically).  These files are highly similar:

  15590954 /tmp/2.7.2
  15599644 /tmp/2.7.2.3
							time(s)	space(bytes)

xdelta delta /tmp/2.7.2 /tmp/2.7.2.3 /tmp/patch:	7.3	11922
gdiff -a --rcs /tmp/2.7.2 /tmp/2.7.2.3:			4.8	19477

Here xdelta takes a about 50% longer and produces a slightly smaller
diff.  I think the improved performance in the other two examples
makes xdelta an all around winner.

---

A note on FrameMaker files, which I happen to know a fair bit about.
Even making radical modifications to a Frame+SGML document leaves much
of its content unchanged, due to the fact that the EDD and various
catalogs do not change.  When only minimal modifications are performed

Take a 24 page document, chapter 15 of the SGML developers guide for
version 5.5. I replaced 38 occurances of "document" with
"abcdefghijklmnop" and added the source of this README from here up at
the end, making it 33 pages.

  250880 graphics.before.fm
  272384 graphics.after.fm

The document grew by about 22 kbytes.  The patch 32780 bytes.  The
same diff generated by GNU diff is 101097 bytes.  Clearly, Frame
documents are very good for binary version control.

---

I also have evidence that the program works well on uncompressed image
formats such as the GIMP's XCF file format.

This program will be the basis of PRCS version 2, which will support
client/server distributed version control.

Please send any comments, questions, thoughts, or interest to me,
jmacd@cs.berkeley.edu.