There seems to be a lot of confusion about zsync's custom embedded version of zlib. I had not documented the exact reason for the patches very well; so I have now committed an explanation of the changes to my own repo. As I may not make another release for a little while yet, I am posting the explanation here also.
I have started a discussion with the zlib maintainer about what sort of API changes could be made such that I could use the standard zlib, but so far no-one other than me understands the requirements and I'm not actually bothered about it (it's the Fedora people who are worked up about it). So, absent anyone else stepping in to do the work, it may take me a while to get to it.
Local changes to zlib used by zsync
There are two different modes of operation that zsync supports that these patches are designed to support:
Changes to the deflate code: Compressing a file in a way that is optimised for zsync's block-based rsync algorithm ‒ starting a new zlib block for each 1024 byte (for example) block in the source file. cf http://zsync.moria.org.uk/paper/ch03s04.html . This is used by makegz.c in the zsync source.
Changes to the inflate code: Working with files compressed with the standard gzip(1). To enable people to get started with zsync, I want it to work with existing compressed content. To achieve optimal results with standard gzip files, I made zsync capable of starting decompression in the middle of a block. In these cases it has to download the block header, then skips forward to the part of the block that gives it the data that it wants. cf http://zsync.moria.org.uk/paper/ch03s02.html
Contrary to some internet discussion, the changes are not related to rsync's changes nor to rsync compatibility (zsync isn't compatible with rsync - whatever that would mean ‒ nor do these changes relate in any way to the rsyncable gzip patch).
Changes to the deflate code
Essentially, I hijacked Z_PARTIAL_FLUSH to mean something new ‒ I want to start a new zlib block, but unlike Z_PARTIAL_FLUSH I don't need to emit a whole byte between blocks, so I took that out. Correctly this ought to be implemented by adding a Z_NEWBLOCKONLY_FLUSH or something like that instead of repurposing an existing state.
(If this were the only issue preventing the use of a standard zlib, distros could change it to use Z_PARTIAL_FLUSH with only a slight loss of compression efficiency.)
Changes to the inflate code
zsync uses the rsync algorithm to construct a desired file from an (e.g.) older local version of the file and then downloading any new/needed blocks from a server; the aim being to minimise the amount of data downloaded to construct the target file. It supports downloading those blocks from a gzipped version of the file on the server. If I want e.g. bytes 4096-8192 of the file from inside the gzipped file, I could download the whole zlib block (using a map of the compressed file that I construct beforehand and is downloaded first) containing the range 4096-8192 (zsync 0.1.0 used this method); but it can do better (fewer bytes downloaded) than that, by downloading just the block header and then downloading the bytes within that compressed block that correspond to bytes 4096-8192 of the contained data.
To do that, I need: a) to be able to start inflating at the start of "any length/literal/eob code in any dynamic or fixed block, or at any stored byte in a stored block.". That is what additional function inflate_advance() and the export of updatewindow() allow me to do.
b) make a map of the gzip file that lets me know what points I should start downloading at in order to inflate particular byte ranges of the contained content. To do this, I can decompress each byte range into a buffer of that size and then quiz zlib for the position in the stream; but I need to know that the position in the stream does correspond to the start of a code or the middle of a stored block (not, e.g., that we have just read a backref and the backref expands to span the boundary; in that case, I would need to know that position where the backref started and the lib doesn't give me a way to find that out).
This is given by inflateSafePoint(), by the modification to cause the inflator to return to the caller at each code in a dynamic block (the LENDO change), and the implicit guarantee provided by using my own copy of the library that I know how the library behaves around internal states and stream position (I need a guarantee that the library won't read ahead more than it needs to, and I need to access certain member variables directly to get the bit position in the stream).
I also removed inflate_fast as I did not want to spend the time working out if it was compatible with these changes.
zsync 0.6.1 is now available from the download page. This fixes a few bugs that have been spotted in the previous version, plus a few minor feature changes:
- recompression support for gzip files made with zlib:gzio.c or gzip -n
- fix compilation on MacOS X
- allow HTTP redirects on the target file; not sure whether this is a good idea or not...
- fix unecessary transfer of whole file where file is smaller than the context size (1x or 2x blocksize)
- use
sequential_matches=1when there is only one block; otherwise we're forced to transfer the whole file for files below 2kiB - fix librcksum handling of zsync streams with
sequential_matches == 1; it was giving false negatives when applying the rsync algorithm, resulting in poor use of local source data whensequential_matches == 1(which didn't actually occur in any recent version of zsync)
zsync 0.6 is now available from the download page. This is mainly a maintenance release, fixing various minor bugs that people have noted over the 2 years since the last release. I have also gone through and tidied up the source code somewhat.
The only functional changes are:
- zsync now preserves the mtime on downloaded files (this requires an extra field in the .zsync, but this format change is entirely compatible with old clients);
- -q option replaces -s (but -s is retained temporarily as a synonym).
These make zsync align better with wget as a file download client.
The full changelog:
- fix out-of-bounds memory access when processing last block of non-compressed download (patch from Timothy Lee). Also fix an error handling fault for the same.
- fix "try a smaller blocksize" failures when zsyncmakeing for huge compressed files on 32bit systems
- preserve mtime on downloaded files
- fix potential crash when re/deallocating checksum hash in librcksum (patch from Timothy Lee)
- explain status code errors better
- better URL handling
- add -q as a substitute for -s, as -q is more conventional (re wget). -q also suppresses the 'no relevant local data' warning now.
- fix some warnings
- code tidy-up and better commenting of what it is doing
- tidy up autoconf use
Version 0.6 is available from the download page, as are all previous versions and the bzr repository.
This fascinating article about HTTP performance caught my eye on Slashdot today. It parallels the W3C paper about HTTP, keep-alives and pipelining that I used as a reference while working on zsync. zsync uses HTTP 1.1 with keep-alive and pipelining, and certainly has terrible performance without it.
The most interesting thing is how the whole article is really about problems caused by not having pipelining enabled by default in web browsers. The recommendation ‒ run 4 different hostnames, as a way of raising the browser based restriction (actually mandated by the HTTP RFC) on the number of connections in parallel to the same site from 2 to 8 ‒ is effectively saying that the system as-is is broken. The RFC limits to 2 connections at a time because there is a tragedy of the commons here: every individual user gains by opening as many parellel connections as they can, but this hurts overall network performance (HTTP keep-alives are designed specifically to avoid having one HTTP connection overhead per HTTP request). Having lots of connections reduces latency, but it creates a lot of TCP connection overhead on the network and for the server.
The article is up-front about the fact that the performance problems are all solved by enabling pipelining. The discussion of upstream bandwidth is interesting, and I hadn't considered the effect of asymmetric Internet connections in this light before. But the upstream bandwidth wouldn't affect the latency (except for the first request) if pipelining was used; zsync works fine on ADSL, because it uses pipelining to transmit the next request up while the current one is being received down. Upstream bandwidth could still affect latency, but far less than the effect the article shows for servers without pipelining; and making multiple connections would certainly do no better.
Since the owners of broken servers are free to disable HTTP 1.1 keep-alives, or just put a proxy in front, I second the idea of just turning on pipelining in the mainstream browsers. HTTP 1.1 already solved this problem. The only reason we aren't benefitting from the solution is that it is disabled due to broken servers; but they will never be fixed if the option is never turned on.
The main feature of this release is that large file support is now enabled on systems (Linux/i386 in particular) where it needs to be explicitely selected. As I do most of the development on Linux/x86_64 now, I was blissfully ignorant of this problem until Robert Lemmen forwarded a complaint about it.
I have also fixed some compilation problems to MacOS X and Solaris. There is also a substitute for getaddrinfo provided for systems that need it, which someone emailed me about.
Finally, I have made the source code repository for zsync available online. This, and the new release, are available from the download page.
I have had this sitting around for a while, so it is about time that I released it. No big changes in this release; I have tidied up the program output, so the program is more silent with -s now. I have also added HTTP basic authentication support ‒ this makes zsync usable in places where you can't have everybody accessing your downloads. Get it from the download page.
I have not seen this before — some of the Ubuntu people are experimenting with methods based off of rsync/zsync for doing package updates (link).
In other news, I have frozen a copy of the zsync technical paper, for reference purposes. It has been ages since I updated it anyway, so I am keeping a copy as-is before I update it with some of my current ideas. It is important that I get a comparison in the paper with what I am calling structured patching systems: like the new differential Package list updating that Debian have implemented.
Just a minor update, fixing a few bugs. Download.
This is just a bugfix release. Someone noticed that zsync would be vulnerable to CAN-2005-2096, due to it's use of zlib code. So the patch for that is now done.
This release also includes some HTTP protocol fixes, as there have been some complaints/observations on how zsync failed to correctly implement some details from the standard. I still have some of these to do, and the fixes in this release have not been heavily tested yet, so let me know if there are any problems.
Get it from the download page.
Nothing much has changed, but I thought it was about time for another release. The only significant change in this release is some fixes to the progress bar display. As zsync is quite stable now, and bug reports have dried up for the moment, I am declaring it to be a beta instead of an alpha now; I expect the file format and command-line interface to be fairly stable from now on.