I have placed version 0.13 of XDELTA, new delta generator and prototype RCS-replacement program at: ftp://www.xcf.berkeley.edu/pub/jmacd/xdelta-0.13.tar.gz ====== Hereafter, the algorithm and program are described. A delta algorithm operates on two files, a FROM file and a TO file. It computes enough informnation to reconstruct the TO file from the FROM file. So the delta and it's inverse, patch, work as follows: P = delta (FROM, TO) TO = patch (FROM, P) The delta generator portion of this program is a delta algorithm which searches for substring matches between the files and then outputs instructions to reconstruct the new file from the old file. It produces a set of copy/insert instructions that tell how to reconstruct the file as a sequence of copies from the FROM file and inserts from the delta itself. In this regard, the program is much closer to a compression program than to a diff program. However, the delta is not "compressed", in that the delta's entropy H(P) will be very similar to the entropy of the portions of the TO file not found within the FROM file. The delta will compress just as well as the TO file will. This is a fundamentally different method of computing deltas than in the traditional "diff" program. The diff program and it's variants use a least-common-subsequence (LCS) algorithm to find a list of inserts and deletes that will modify the FROM file into the TO file. LCS is more expensive to compute and is sometimes more useful, especially to the human reader. Since LCS is a fairly expensive algorithm, diff programs usually divide the input files into newline-separated "atoms" before computing a delta. This is a fine approximation for text files, but not binary files. I propose this delta generator as a replacement for "diff" in applications where the insert/delete delta is not required. Since the copy/insert delta is easier to compute, it does not have to make the reduction in input size by breaking the file into lines. The delta generator works well on binary files. The next question is whether it is actually desirable to compute deltas between binary files. The answer is certainly yes, even though some binary file formats will not have a great degree of similarity between versions which were generated between minor modifications to their sources. First, I have evidence that some file formats (notably FrameMaker documents) work very well. Machine-dependant object files and executables don't work very well, but it is still worth doing. The reason it is still worthwhile is that compression takes longer than finding these deltas, so any space savings the delta generator produces it very likely to reduce the total archival time. Even if the delta generator saves no space, the total time well end up less than twice the time of compression. I will include some measurements at the end of this writing. With the delta generator as a building block, I have written a replacement for RCS. It does not have all the user-level features someone would want in RCS, but these things only get in the way when the code is used by higher level version control programs such as PRCS. For now, the prototype supports the following actions: register create an empty version file checkin create a new version checkout extract an old version The program departs from the RCS model in that it does not branch it's versions. The reason for doing this is to solve several problems with RCS (aside from it's use of diff which is solved by using my delta generator). 1. Development very often ends up off the "main trunk" of the RCS file, which causes an O(N) extract, where N deltas must be applied to the head of the trunk before getting the desired file. Once this occurs, things get worse and worse. Of course, it is possible to remedy the situation by checking it in once on the trunk, but that defeats the purpose of branching, because you lose the information about branch of development you are on. Further, the notion of branching individual files conflicts with the model presented by higher level version control programs such as PRCS, which would rather not have to go to extra trouble to insure that development does not end up off the trunk. 2. Deltas are never used more than once in a particular version history, even if they contain the same data as another delta. Conceptually, to get rid of RCS-like branching and retain the "deltas are computed between a file and its parent", the xdelta version control library computes a reverse delta between the new file and the entire version file. The most recently checked in version is always the "head", in RCS terms. It is always the quickest and easiest to extract. For a repository with N versions, version I requires the application of (N-I) deltas, regardless of how the set of versions are related. Each delta contains a set of instructions and all data which was originally part of a file version but missing in the FROM file. This data is represented without compression or modification. As a result, each delta may be used as an input for finding future deltas just as a regular file may be. Several equations describe the process, first some notation: V0 is the version file with 1 version V1 is the version file with 2 versions, ... VN is the version file with N+1 versions F0 is the 1st file checked in, upon checkin V0 is produced F1 is the 2nd file checked in, upon checkin V1 is produced ... FN is the Nth file checked in, upon checkin VN is produced D0 is the 1st delta produced, upon checkin of F1 D1 is the 2nd delta produced, upon checkin of F2 ... DN is the Nth delta produced, upon checkin of FN+1 Checkin goes as follows: V0 <- F0 D0 <- delta (F1, F0) V1 <- F1 + D0 D1 <- delta (F2 + D0, F1) V2 <- F2 + D1 + D0 D2 <- delta (F3 + D1 + D0, F2) V3 <- F3 + D2 + D1 + D0 an so on. To checkout: F3 is immediate F2 <- patch (F3 + D1 + D0, D2) F1 <- patch (F2 + D0, D1) F0 <- patch (F1, D0) Now I will describe why this is just as good as branching, even though the branches do not affect which files are used for computing deltas, and why it is just as good as the RCS model. Suppose the we have a sequence of checkins with RCS version numbers and their sequence numbers: F0: 1.1 F1: 1.2 F2: 1.3 F3: 1.1.1.1 F4: 1.1.1.2 F5: 1.1.1.3 F6: 1.4 F7: 1.5 Of interest are the 4th checkin, from 1.3 to 1.1.1.1, and the 7th checkin, from 1.1.1.3 to 1.4. I will now use the '-' sign to denote the taking of a delta and '+' for patch. Thus, delta(A,B) is equivalent to B-A, and patch(A,xdelta(A,B)) is equivalent to A+(B-A), or B, as it should be. In the above sequence of checkins, these deltas are produced: D0=F0-F1 D1=F1-F2-D0 D2=F2-F3-D1-D0 D3=F3-F4-D2-D1-D0 D4=F4-F5-D3-D2-D1-D0 D5=F5-F6-D4-D3-D2-D1-D0 D6=F6-F7-D5-D4-D3-D2-D1-D0 The checkin of F3 produces D2. The concern is whether any deletions between 1.1 and 1.3 have to be inserted into D2. They have not, because any deletions between 1.1 and 1.3 appear in D0 and D1. Thus, using D0, D1, and F3 as an input for computing D2 achieves the desired results: the deletion is not reflected twice in separate deltas. The properties of this version format differ from the RCS format. It ensures that recently checked-in versions are quick to check out, no matter where in the "tree" they occur. One potential problem with this formulation is that according to the formulae above, the ammount of memory and computation required for checking in each new version grows with the number of previous versions. To prevent this it only uses up to some limiting value (currently set at 10) of previous deltas in computing new deltas. This means that deletions on a branch over 10 checkins old will be duplicated on a checkin. Another property is that to delete a version all older versions must first be deleted or off-line processing may be done to reconstruct the delta log with the version removed (an expensive but not time-critical operation). It also lends itself to keeping windowed histories, such as "only keep the last 20 revisions, anything earlier than that may be thrown away". --- The current prototype implementation is complete. The program, 'xdelta', has a command syntax similar to PRCS: xdelta COMMAND [OPTIONS] ARG1 ARG2 ... There are 5 commands: register Create a versionfile checkin Checkin a new version checkout Checkout a version (latest or -r VERSION) info Show information on a version (all or -r VERSION) delta Produce a delta from ARG1 to ARG2 producing ARG3 patch Patch file ARG1 with ARG2 producing ARG3 The -r option specifies a version (the latest version is always the default). Versions are sequentially numbered. The file itself is a GNU DBM database which stores the deltas and latest version. The file format is subject to change as I look for storage methods that are portable to Windows (otherwise, the delta generator has already been tested on Windows). The DELTA and PATCH commands are unrelated to the rest, and simply serve to generate deltas (and could be used as replacements for "diff" and "patch"). All operations are verified with MD5 checksums, which are saved for each version in the version file and the FROM and TO files when generating and applying deltas. --- Some timing comparisons (remember all times include MD5 computations) on a 300MHz Alpha: Two kernels: 7875136 /genvmunix 6677320 /vmunix And the times to compress and delta them: time(s) size(bytes) xdelta delta /genvmunix /vmunix patch: 10.1 4338029 gzip --best /genvmunix: 34.0 3413963 gzip --best /vmunix: 63.7 2501766 gzip --best patch: 51.9 1531163 (I'm not sure why there's such a variation in gzip times--it was repeatable.) xdelta produces a patch which is 4338029 bytes, approximately 2.6 megabytes smaller than its TO file. The total space required for saving the two kernels compressed is 5915729 bytes, produced in 97.7 seconds. The total space for storing /genvmunix and the delta compressed is 4945126 bytes, produced in 96.0 seconds. Using the delta saved approximately one megabyte and two seconds. --- All the files in PRCS 1.0.0 and 1.2.0 concatenated (files ordered lexigraphically): 1678837 /tmp/1.0.0 2021819 /tmp/1.2.0 time(s) space(bytes) xdelta delta /tmp/1.0.0 /tmp/1.2.0 /tmp/patch: 1.5 695893 gdiff -a --rcs /tmp/1.0.0 /tmp/1.2.0 /tmp/patch: 3.7 1524705 Even though both files were completely text, gdiff missed a lot of similar regions because a line had been partially modified. Xdelta performs better in both dimensions. --- All the .c and .h files in GNU gcc 2.7.2 and 2.7.2.3 concatenated (files ordered lexigraphically). These files are highly similar: 15590954 /tmp/2.7.2 15599644 /tmp/2.7.2.3 time(s) space(bytes) xdelta delta /tmp/2.7.2 /tmp/2.7.2.3 /tmp/patch: 7.3 11922 gdiff -a --rcs /tmp/2.7.2 /tmp/2.7.2.3: 4.8 19477 Here xdelta takes a about 50% longer and produces a slightly smaller diff. I think the improved performance in the other two examples makes xdelta an all around winner. --- A note on FrameMaker files, which I happen to know a fair bit about. Even making radical modifications to a Frame+SGML document leaves much of its content unchanged, due to the fact that the EDD and various catalogs do not change. When only minimal modifications are performed Take a 24 page document, chapter 15 of the SGML developers guide for version 5.5. I replaced 38 occurances of "document" with "abcdefghijklmnop" and added the source of this README from here up at the end, making it 33 pages. 250880 graphics.before.fm 272384 graphics.after.fm The document grew by about 22 kbytes. The patch 32780 bytes. The same diff generated by GNU diff is 101097 bytes. Clearly, Frame documents are very good for binary version control. --- I also have evidence that the program works well on uncompressed image formats such as the GIMP's XCF file format. This program will be the basis of PRCS version 2, which will support client/server distributed version control. Please send any comments, questions, thoughts, or interest to me, jmacd@cs.berkeley.edu.