Safety of the lzip format

Abstract

There is little information about the safety of compressed file formats. Users can't usually make an informed decision about what format to use because the designers of compressed formats rarely explain their design decisions or publish tests results (beyond speed or ratio benchmarks). This article explains the safety-related design decisions I made for lzip and measures the safety and accuracy of lzip's integrity checking. The results show that lzip achieves high safety and accuracy in the detection of errors while keeping a high availability. These features along with its simple design, its low overhead, and its full interoperability make of lzip an excellent choice for reliable transmission, distribution, and long-term archiving of important data.

Disclosure statement: The author is also author of the lzip format.

Acknowledgements: The author would like to thank Nissanka Gooneratne for the thousands of hours he has spent testing lzip and for his help in the improvement of lziprecover. Most of the source tarballs used to test lzip were downloaded from the Slackware repository and from the GNU FTP site.

This page is a work in progress. It will be updated from time to time with new data in order to refine the results. Please report any errors or inaccuracies to the author at the lzip mailing list lzip-bug@nongnu.org or in private at antonio@gnu.org.

1 Introduction

Users can't usually make an informed decision about what compressed format to use because the designers of compressed formats rarely explain their design decisions or publish tests results (beyond speed or ratio benchmarks). This article explains the safety-related design decisions I made for lzip and measures the safety and accuracy of lzip's integrity checking.

Since 2008 I have tested lzip with unzcrash and have published some results, for example here. This article presents the latest results. Thousands of millions of test decompressions performed since 2015 on files compressed with lzip 1.16 or newer are examined, and the interaction between LZMA compression and CRC32 in the detection of errors is analyzed.

The results show some of the advantages of lzip for long-term archiving in general and for distribution of source code in particular. (Source tarballs are meant to be archived forever as a way to compile and run a given version of some software in the future, for example to decode old data).

2 Why integrity checking is important

The main task of a decompressor is producing the correct decompressed data from valid compressed data. Other task as important as this one is warning the user when the correct decompressed data can't be produced (for example because the compressed data are corrupt). Failing to perform this second task is known as undetected corruption, and may cause irreparable loss of data. Because of this, the lzip format has been designed to deliver the decompressed data in all cases where this is possible, while making undetected corruption almost impossible. With lzip, false negatives are practically zero, while false positives are minimized and recoverable.

The recovery capabilities of lziprecover (the data recovery tool for the lzip format) are possible because of the accurate integrity checking of the lzip format. The lzip format improves the probability of recovering from corruption without any special measures beyond standard backup practices.

Unfortunately, integrity checking is sometimes neglected by designers of compressed formats in order to gain speed. For example, the format of the POSIX tool 'compress' does not include a check sequence, and integrity checking is optional in formats like xz or zstd. The use of such formats for short-term operations may be acceptable under some circumstances, but they are not adequate for long-term archiving.

3 Significance of these tests

Almost all the packages from the GNU project and the Slackware distribution (plus some files from other sources) have been tested for all possible bit flips and all possible 512-byte sector read errors in the LZMA stream. 2057 files in total. As far as I know, this is the largest controlled test ever run on a compressed format.

If we suppose the following conditions:

The files tested are single-member files with an average compressed size of 1 MiB and are decompressed at a rate of 10 files per second.
The files are read from a hard drive with a sector size of 4096 bytes and an uncorrectable error rate of one read error per 1e14 bits (1.25e13 bytes).

then we get that the 25_940 million test decompressions done to date on files with simulated errors are equivalent to 980 million years of continuous normal use of lzip decompressing files read from a drive. Or one million servers decompressing lzip files from hard drives for 980 years.

Note that the MTBF measured in the sections below is for continuous decompression of corrupt lzip files. The MTBF in real use is much longer than the age of the universe because of the multiplying effect of the low error rate of the drive.

These numbers seem large, but a format adequate for long-term archiving needs to be very safe and be extensively tested. 1000 million computers decompressing just 17 files per day, from drives about as reliable as the one described above, would find one corrupt file per minute. Corruption in downloaded files is much more frequent than that.

4 Accuracy of error detection is a tradeoff

A check sequence (for example a CRC) is not some pixie dust that magically detects corruption in the data. It is also data, and therefore it may also become corrupt, causing the unfortunate situation called "false positive", where the real data are intact but the decoder discards them as if they were corrupt.

"There can be safety tradeoffs with the addition of an error-detection scheme. As with almost all fault tolerance mechanisms, there is a tradeoff between availability and integrity. That is, techniques that increase integrity tend to reduce availability and vice versa. Employing error detection by adding a check sequence to a dataword increases integrity, but decreases availability. The decrease in availability happens through false-positive detections. These failures preclude the use of some data that otherwise would not have been rejected had it not been for the addition of error-detection coding". ([Koopman], p. 33).

As longer check sequences always increase the number of false positives, the optimal accuracy in the detection of errors is achieved by using the shortest check sequence that reduces the number of false negatives below the desired value. This value is frequently expressed as the mean time between false negatives (MTBF) for a given working rate. Once the MTBF reaches a large enough value (e.g., the 1e9 hours, about 114_079 years, set by FAA AC 25.1309-1A, page 15, for extremely improbable failure conditions), it does not make sense to continue increasing the size of the check sequence.

Lzip achieves optimal accuracy in the detection of errors by keeping the format small and simple, and this accuracy can't be improved by using a longer check sequence. The lzip format is formed by the compressed data, preceded by a header containing the parameters needed for decompression, and followed by a trailer containing integrity information. This design minimizes both overhead and false positives. The lzip format is free from any source of false positives beyond the header and trailer.

The probability of a bit flip in a lzip file resulting in a false positive is equal to 26 * number_of_members / file_size. One good thing about the false positives reported by lzip is that, because of its 3-factor integrity checking, in most cases they can be identified as false positives, and the data can be fully recovered without special tools.

4.1 Why CRC32 instead of CRC32-C or CRC64?

Because a CRC32 offers by itself a MTBF longer than 1e9 hours for a decompression rate of up to 4 corrupt files per hour, which is perfectly adequate for most uses. CRC32-C (Castagnoli) is slightly better for small uncompressed sizes than the CRC32 (IEEE 802.3 Ethernet) used by lzip, but for uncompressed sizes between 256 and 512 MiB, the CRC32 used by lzip is slightly better. Here "better" means "has a larger Hamming distance (HD)". (See [Koopman2]). In any case, the small differences in HD between CRC32 and CRC32-C are nullified by the error multiplication of decompression because the data errors not guaranteed to be detected by the CRC seem to contain a number of flipped bits much larger than the HD of both CRCs. A HD of 3 or 4 makes no difference when the error to be detected contains one hundred flipped bits.

CRC64 is overkill for lzip because it doubles the probability of a false positive compared with CRC32 without significantly reducing the (already insignificant in the case of lzip) probability of a false negative.

5 Interaction between compression and check sequence

Integrity checking in compressed files is different from other cases (like Ethernet packets) because the data that can become corrupted are the compressed data, but the data that are checked (the dataword) are the decompressed data. Decompression can cause error multiplication; even a single-bit error in the compressed data may produce any random number of errors in the decompressed data, or even modify the size of the decompressed data.

In the cases where the decompression causes error multiplication, the error model seen by the check sequence is one of unconstrained random data corruption. (Remember that the check sequence checks the integrity of the decompressed data). This means that the choice of error-detection code (CRC or hash) is largely irrelevant, and that the probability of an error being undetected by the check sequence (Pudc) is 1 / (2^n) for a check sequence of n bits. (See [Koopman], p. 5). But in the cases where an error does not cause error multiplication, a CRC is preferable to a hash of the same size because of the burst error detection capabilities of the CRC which guarantee the detection of small errors.

Decompression algorithms are usually able to detect some errors in the compressed data (for example a backreference to a point before the beginning of the data). Therefore, if the errors not detected by the decoder are not "special" for the check sequence (neither guaranteed to be caught nor guaranteed to be missed), then the total probability of an undetected error (Pud) is the product of the probability of the error being undetected by the decoder (Pudd) and the probability of the error being undetected by the check sequence (Pudc): Pud = Pudd * Pudc

It is also possible that a small error in the compressed data does not alter at all the decompressed data. Therefore, for maximum availability, only the decompressed data should be tested for errors. Testing the compressed data beyond what is needed to perform the decompression increases the number of false positives without usually reducing the number of undetected errors.

In any case, it is important to always check the integrity of the decompressed data because decompression is not a trivial task and the check not only guards against corruption in the compressed file, but also against memory errors, undetected bugs in the decompressor, etc.

5.1 Interaction between LZMA compression and CRC32

The LZMA data stream provides embedded error detection. Any distance larger than the dictionary size acts as a forbidden symbol, allowing the decoder to detect the approximate position of most errors, and leaving little work for the check sequence in the detection of errors.

The effects of the few errors not detected by the LZMA decoder on the decompressed output can be classified in four types: the alteration of a single byte, a burst error spawning four bytes or less, the replacement of several occurrences of a concrete byte value by a different concrete byte value (for example the replacement of some 'a' characters by 'c' characters), and random corruption. The first two types of errors are guaranteed to be detected by the CRC. The third type of error is guaranteed to be detected by the CRC except when the number of bytes replaced is a multiple of 2^32 - 1, which is impossible in files smaller than 4 GiB - 1 byte uncompressed and improbable in files larger than that. (The same value must be fed to a CRC32 2^32 - 1 times for the CRC value to repeat itself).

Because of the above, the real Pud of lzip is lower (better) than what a model of unconstrained random data corruption would predict.

6 Error models

Two error models have been used to measure the accuracy of lzip in the detection of errors. The first model consists of random bit flips in the compressed file. Just one bit flip per trial. The second model consists of zeroed 512-byte blocks, simulating a whole sector I/O error. Just one zeroed block per trial. The blocks are tried at every byte position because testing only blocks aligned to a 512-byte boundary would require a testing corpus too large. The first model is considered the most important because bit flips in the compressed data are the most difficult to detect by the LZMA decoder, and they happen even in the most expensive hardware [MSL].

Trial decompressions were performed using lziprecover. The commands used for the two models are:

Single-bit errors: lziprecover -U1 file.lz | lzip -9 -o file.lz.1nht.lz

Zeroed block errors: lziprecover -UB512 file.lz | lzip -9 -o file.lz.512-00d.lz

6.1 Estimated Pud of lzip for single-bit errors

23_059 million trial decompressions have been performed by running the command shown in the previous section on real lzip files. (See section Links to test results for listings of the files tested). All those errors were detected by the decoder except 597, which were then detected by the CRC. Of those, 127 were of types guaranteed to be caught by the CRC (11 single-byte errors, 8 bursts, and 108 replacements), leaving 470 events of random data corruption not guaranteed to be caught by the CRC. (See Interaction between LZMA compression and CRC32).

For the single-bit error model, the Pudd of a LZMA decoder like the one in lzip is estimated from the data above (excluding the errors guaranteed to be caught by the CRC) to be about 2.0382e-8. (It may be slightly different for other files). This additional detection capability is supposed to reduce the Pud by the same factor respect to the Pudc of the CRC32 alone (about 2.3283e-10). Therefore, the estimated Pud for a bit flip in a lzip file, based on the data above, is of about 2.0382e-8 * 2.3283e-10 = 4.745e-18.

After 16 years of testing, the MTBF of lzip can only be roughly estimated because not even one false negative has ever been observed. (And, if the estimate is accurate, it is not expected to ever find one in files larger than about 100 kB by testing). If one were to continuously decompress corrupt lzip files of about one megabyte in size (10 decompressions per second), each of them containing the kind of corruption most difficult to detect (one random bit flip), then a false negative would be expected to happen every (1 / Pud) / 10 / 31_556_952(s/year) = 667 million years.

Estimating lzip's Pud for single-bit errors on small files (a few kB compressed) is a pending task. Preliminary tests show that about 3 orders of magnitude more errors escape detection by the decoder because of the small size, but almost all of these errors seem to be of the kind that is guaranteed to be detected by the CRC.

6.2 Zeroed block errors

For the zeroed-block error model, the additional detection capability of a LZMA decoder is probably much larger than for a bit flip. A LZMA stream is a check sequence in itself, and large errors seem less probable to escape detection than small ones. In fact, I do not try to estimate the Pud of lzip for zeroed blocks because it is too small. The LZMA decoder detected the error in all the 2881 million trial decompressions run with a zeroed block, which suggests a lower bound for the MTBF longer than the age of the universe.

7 Links to test results

These are the listings of the results files produced by lziprecover for all the files tested to date. The files themselves are not published because they are many, some are large, and all are easily reproducible from the corresponding source file.

8 Conclusions

Lzip achieves optimal accuracy in the detection of errors; false negatives are reduced to an insignificant level, while false positives are minimized and recoverable. This allows lzip provide both high safety and high availability, which, along with its simple design, its low overhead, and its full interoperability, makes of lzip an excellent choice for reliable transmission, distribution, and long-term archiving of important data.

9 References

10 Glossary

^: Exponentiation symbol.
Accuracy: Freedom from mistakes. Is defined as 1 - inaccuracy.
Bit flip: Error that inverts a single bit value.
Burst Error: A set of bit errors contained within a span shorter than the length of the codeword. The burst length is the distance between the first and last bit errors. Zero or more of the intervening bits may be erroneous.
Check sequence: An error-detecting code value stored with associated data. A dataword plus a check sequence makes up a codeword.
Codeword: A dataword combined with a check sequence.
Cyclic Redundancy Check (CRC): A type of check sequence.
Dataword: The data being protected by a check sequence. A dataword can be any number of bits and is independent of the machine word size (e.g., a dataword could be 64 megabytes).
Failure rate: 1 False negatives per time unit (inverse of the MTBF). 2 False negatives per corrupt member decompressed (Pud).
False negative: Undetected error producing incorrect decompressed data.
False positive: Inability or refusal to decompress the data in a damaged compressed file even if the compressed data proper remain decompressible. (i.e., either the compressed data have not suffered any damage or the damage does not hinder decompression).
Inaccuracy: Ratio of error detection mistakes. Is defined as ( false_negatives + false_positives ) / total_cases.
Mean Time Between Failures (MTBF): Mean time between false negatives for a given decompression rate of corrupt members per time unit.
Overhead: Everything in the file except the compressed data proper. (Headers, check sequences, padding, etc).
Overkill: An unnecessary excess of whatever is needed to achieve a goal.
Pud: Probability of an undetected error (false negative). Pud is estimated as Pudd * Pudc.
Pudc: Probability of an error undetected by the check sequence.
Pudd: Probability of an error undetected by the decoder. Only errors not guaranteed to be caught by the check sequence (CRC) are counted when using Pudd to estimate Pud.
Year (Gregorian): 365.2425 days.

You are free to copy and distribute this article without limitation, but you are not allowed to modify it.

First published: 2020-12-27
Updated: 2024-12-13

This page does not use javascript.