Drew Daniels' Compression Notes

My resume.
I'm particularly interested in data compression research.

Why is archive compression important?

Archives of files (e.g.'s .tar, zip, rpm...) are commonly used to store and transmit data. For widely distributed files, even a small amount of compression can add up. The savings in bandwidth and storage are N*M where N is the amount saved, and M is the number of copies/transfers.

E.g.: If one byte is saved from a file downloaded one million times, one million bytes worth of bandwidth, and transfer time is saved. Also, one million other systems save one byte for a total savings of one million and one bytes of storage. Practically, frequently there are much higher savings than one byte, and it seems downloads of even individual files is likely less than millions (Firefox downloads are on the order of millions).

Space savings are multiplicative (N*M+1).
Transfer (download) bandiwdth savings are multiplicative (N*M).
Transfer (download) time savings are multiplicative (N*M).

Why are compression requirements less important?

Compression usually only happens once, and decompression usually happens many times.

When is compression time important?

For data streams like video.
Limited buffer size with compression blocking IO.
...

How can compression be improved?

Knowlege of the data being compressed.
Allowing acceptable data loss ("lossy").
Allocating more resources to compession.

Algorithms

Prediction by Partial Match (PPM) seems to have been one of the leading algorithms in recent years in terms of reduced archive sizes.

Context Weighted Trees (CTW) is another leading algorithm.

One of the fundamental problems with historical algorithms is the limitations put on the memory requirements. Sliding context windows, dictionary word sizes, dictionary entry structures, and other kinds of memory uses are restricted.

rzip shows what kind of differences can be found by increasing the context window size.

Not everything can be compressed by the same algorithm. There is a simple "proof by counting". Switching algorithms by storing a type doesn't bypass this because then all switchable algorithms would be part of the same algorithm discussed in the proof.

Some versions of gzip do less compression by default for the sake of easier rsync. This re-start of compression blocks also may allow for increased paralellization and thus faster compression and decompression.

My homepage.

My regularly used e-mail address alias is listed in my resume.