From d594fbe514e2a381541a510fe01041e546f56a67 Mon Sep 17 00:00:00 2001 From: Josh MacDonald Date: Fri, 26 Feb 2016 21:43:58 -0800 Subject: Remove xdelta1 from this repo --- xdelta1/xdelta-0.13.README | 311 --------------------------------------------- 1 file changed, 311 deletions(-) delete mode 100644 xdelta1/xdelta-0.13.README (limited to 'xdelta1/xdelta-0.13.README') diff --git a/xdelta1/xdelta-0.13.README b/xdelta1/xdelta-0.13.README deleted file mode 100644 index f928f66..0000000 --- a/xdelta1/xdelta-0.13.README +++ /dev/null @@ -1,311 +0,0 @@ - -I have placed version 0.13 of XDELTA, new delta generator and -prototype RCS-replacement program at: - - ftp://www.xcf.berkeley.edu/pub/jmacd/xdelta-0.13.tar.gz - -====== - -Hereafter, the algorithm and program are described. A delta algorithm -operates on two files, a FROM file and a TO file. It computes enough -informnation to reconstruct the TO file from the FROM file. So the -delta and it's inverse, patch, work as follows: - -P = delta (FROM, TO) -TO = patch (FROM, P) - -The delta generator portion of this program is a delta algorithm which -searches for substring matches between the files and then outputs -instructions to reconstruct the new file from the old file. It -produces a set of copy/insert instructions that tell how to -reconstruct the file as a sequence of copies from the FROM file and -inserts from the delta itself. In this regard, the program is much -closer to a compression program than to a diff program. However, the -delta is not "compressed", in that the delta's entropy H(P) will be -very similar to the entropy of the portions of the TO file not found -within the FROM file. The delta will compress just as well as the TO -file will. This is a fundamentally different method of computing -deltas than in the traditional "diff" program. The diff program and -it's variants use a least-common-subsequence (LCS) algorithm to find a -list of inserts and deletes that will modify the FROM file into the TO -file. LCS is more expensive to compute and is sometimes more useful, -especially to the human reader. Since LCS is a fairly expensive -algorithm, diff programs usually divide the input files into -newline-separated "atoms" before computing a delta. This is a fine -approximation for text files, but not binary files. - -I propose this delta generator as a replacement for "diff" in -applications where the insert/delete delta is not required. Since the -copy/insert delta is easier to compute, it does not have to make the -reduction in input size by breaking the file into lines. The delta -generator works well on binary files. - -The next question is whether it is actually desirable to compute -deltas between binary files. The answer is certainly yes, even though -some binary file formats will not have a great degree of similarity -between versions which were generated between minor modifications to -their sources. First, I have evidence that some file formats (notably -FrameMaker documents) work very well. Machine-dependant object files -and executables don't work very well, but it is still worth doing. -The reason it is still worthwhile is that compression takes longer -than finding these deltas, so any space savings the delta generator -produces it very likely to reduce the total archival time. Even if -the delta generator saves no space, the total time well end up less -than twice the time of compression. I will include some measurements -at the end of this writing. - -With the delta generator as a building block, I have written a -replacement for RCS. It does not have all the user-level features -someone would want in RCS, but these things only get in the way when -the code is used by higher level version control programs such as -PRCS. For now, the prototype supports the following actions: - -register create an empty version file -checkin create a new version -checkout extract an old version - -The program departs from the RCS model in that it does not branch it's -versions. The reason for doing this is to solve several problems with -RCS (aside from it's use of diff which is solved by using my delta -generator). - -1. Development very often ends up off the "main trunk" of the RCS -file, which causes an O(N) extract, where N deltas must be applied to -the head of the trunk before getting the desired file. Once this -occurs, things get worse and worse. Of course, it is possible to -remedy the situation by checking it in once on the trunk, but that -defeats the purpose of branching, because you lose the information -about branch of development you are on. Further, the notion of -branching individual files conflicts with the model presented by -higher level version control programs such as PRCS, which would rather -not have to go to extra trouble to insure that development does not -end up off the trunk. - -2. Deltas are never used more than once in a particular version -history, even if they contain the same data as another delta. - -Conceptually, to get rid of RCS-like branching and retain the "deltas -are computed between a file and its parent", the xdelta version -control library computes a reverse delta between the new file and the -entire version file. The most recently checked in version is always -the "head", in RCS terms. It is always the quickest and easiest to -extract. For a repository with N versions, version I requires the -application of (N-I) deltas, regardless of how the set of versions are -related. - -Each delta contains a set of instructions and all data which was -originally part of a file version but missing in the FROM file. This -data is represented without compression or modification. As a result, -each delta may be used as an input for finding future deltas just as a -regular file may be. - -Several equations describe the process, first some notation: - - V0 is the version file with 1 version - V1 is the version file with 2 versions, - ... - VN is the version file with N+1 versions - - F0 is the 1st file checked in, upon checkin V0 is produced - F1 is the 2nd file checked in, upon checkin V1 is produced - ... - FN is the Nth file checked in, upon checkin VN is produced - - D0 is the 1st delta produced, upon checkin of F1 - D1 is the 2nd delta produced, upon checkin of F2 - ... - DN is the Nth delta produced, upon checkin of FN+1 - -Checkin goes as follows: - - V0 <- F0 - - D0 <- delta (F1, F0) - V1 <- F1 + D0 - - D1 <- delta (F2 + D0, F1) - V2 <- F2 + D1 + D0 - - D2 <- delta (F3 + D1 + D0, F2) - V3 <- F3 + D2 + D1 + D0 - -an so on. To checkout: - - F3 is immediate - F2 <- patch (F3 + D1 + D0, D2) - F1 <- patch (F2 + D0, D1) - F0 <- patch (F1, D0) - -Now I will describe why this is just as good as branching, even though -the branches do not affect which files are used for computing deltas, -and why it is just as good as the RCS model. - -Suppose the we have a sequence of checkins with RCS version numbers -and their sequence numbers: - -F0: 1.1 -F1: 1.2 -F2: 1.3 -F3: 1.1.1.1 -F4: 1.1.1.2 -F5: 1.1.1.3 -F6: 1.4 -F7: 1.5 - -Of interest are the 4th checkin, from 1.3 to 1.1.1.1, and the 7th -checkin, from 1.1.1.3 to 1.4. I will now use the '-' sign to denote -the taking of a delta and '+' for patch. Thus, delta(A,B) is -equivalent to B-A, and patch(A,xdelta(A,B)) is equivalent to A+(B-A), -or B, as it should be. In the above sequence of checkins, these -deltas are produced: - -D0=F0-F1 -D1=F1-F2-D0 -D2=F2-F3-D1-D0 -D3=F3-F4-D2-D1-D0 -D4=F4-F5-D3-D2-D1-D0 -D5=F5-F6-D4-D3-D2-D1-D0 -D6=F6-F7-D5-D4-D3-D2-D1-D0 - -The checkin of F3 produces D2. The concern is whether any deletions -between 1.1 and 1.3 have to be inserted into D2. They have not, -because any deletions between 1.1 and 1.3 appear in D0 and D1. Thus, -using D0, D1, and F3 as an input for computing D2 achieves the desired -results: the deletion is not reflected twice in separate deltas. - -The properties of this version format differ from the RCS format. It -ensures that recently checked-in versions are quick to check out, no -matter where in the "tree" they occur. One potential problem with -this formulation is that according to the formulae above, the ammount -of memory and computation required for checking in each new version -grows with the number of previous versions. To prevent this it only -uses up to some limiting value (currently set at 10) of previous -deltas in computing new deltas. This means that deletions on a branch -over 10 checkins old will be duplicated on a checkin. Another -property is that to delete a version all older versions must first be -deleted or off-line processing may be done to reconstruct the delta -log with the version removed (an expensive but not time-critical -operation). It also lends itself to keeping windowed histories, such -as "only keep the last 20 revisions, anything earlier than that may be -thrown away". - ---- - -The current prototype implementation is complete. The program, -'xdelta', has a command syntax similar to PRCS: - -xdelta COMMAND [OPTIONS] ARG1 ARG2 ... - -There are 5 commands: - - register Create a versionfile - checkin Checkin a new version - checkout Checkout a version (latest or -r VERSION) - info Show information on a version (all or -r VERSION) - delta Produce a delta from ARG1 to ARG2 producing ARG3 - patch Patch file ARG1 with ARG2 producing ARG3 - -The -r option specifies a version (the latest version is always the -default). Versions are sequentially numbered. The file itself is a -GNU DBM database which stores the deltas and latest version. The file -format is subject to change as I look for storage methods that are -portable to Windows (otherwise, the delta generator has already been -tested on Windows). - -The DELTA and PATCH commands are unrelated to the rest, and simply -serve to generate deltas (and could be used as replacements for "diff" -and "patch"). All operations are verified with MD5 checksums, which -are saved for each version in the version file and the FROM and TO -files when generating and applying deltas. - ---- - -Some timing comparisons (remember all times include MD5 computations) -on a 300MHz Alpha: - -Two kernels: - - 7875136 /genvmunix - 6677320 /vmunix - -And the times to compress and delta them: - - time(s) size(bytes) - -xdelta delta /genvmunix /vmunix patch: 10.1 4338029 -gzip --best /genvmunix: 34.0 3413963 -gzip --best /vmunix: 63.7 2501766 -gzip --best patch: 51.9 1531163 - -(I'm not sure why there's such a variation in gzip times--it was -repeatable.) - -xdelta produces a patch which is 4338029 bytes, approximately 2.6 -megabytes smaller than its TO file. The total space required for -saving the two kernels compressed is 5915729 bytes, produced in 97.7 -seconds. The total space for storing /genvmunix and the delta -compressed is 4945126 bytes, produced in 96.0 seconds. Using the -delta saved approximately one megabyte and two seconds. - ---- - -All the files in PRCS 1.0.0 and 1.2.0 concatenated (files ordered -lexigraphically): - - 1678837 /tmp/1.0.0 - 2021819 /tmp/1.2.0 - - time(s) space(bytes) - -xdelta delta /tmp/1.0.0 /tmp/1.2.0 /tmp/patch: 1.5 695893 -gdiff -a --rcs /tmp/1.0.0 /tmp/1.2.0 /tmp/patch: 3.7 1524705 - -Even though both files were completely text, gdiff missed a lot of -similar regions because a line had been partially modified. Xdelta -performs better in both dimensions. - ---- - -All the .c and .h files in GNU gcc 2.7.2 and 2.7.2.3 concatenated (files -ordered lexigraphically). These files are highly similar: - - 15590954 /tmp/2.7.2 - 15599644 /tmp/2.7.2.3 - time(s) space(bytes) - -xdelta delta /tmp/2.7.2 /tmp/2.7.2.3 /tmp/patch: 7.3 11922 -gdiff -a --rcs /tmp/2.7.2 /tmp/2.7.2.3: 4.8 19477 - -Here xdelta takes a about 50% longer and produces a slightly smaller -diff. I think the improved performance in the other two examples -makes xdelta an all around winner. - ---- - -A note on FrameMaker files, which I happen to know a fair bit about. -Even making radical modifications to a Frame+SGML document leaves much -of its content unchanged, due to the fact that the EDD and various -catalogs do not change. When only minimal modifications are performed - -Take a 24 page document, chapter 15 of the SGML developers guide for -version 5.5. I replaced 38 occurances of "document" with -"abcdefghijklmnop" and added the source of this README from here up at -the end, making it 33 pages. - - 250880 graphics.before.fm - 272384 graphics.after.fm - -The document grew by about 22 kbytes. The patch 32780 bytes. The -same diff generated by GNU diff is 101097 bytes. Clearly, Frame -documents are very good for binary version control. - ---- - -I also have evidence that the program works well on uncompressed image -formats such as the GIMP's XCF file format. - -This program will be the basis of PRCS version 2, which will support -client/server distributed version control. - -Please send any comments, questions, thoughts, or interest to me, -jmacd@cs.berkeley.edu. -- cgit v1.2.3