This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Fw: xz instead of bzip2

TL;DR. The xz format has a long list of defects, and at least some of them are not minor. Please, don't promote such a defective format using it in GCC. Thanks. :-)

R0b0t1<>  wrote:

That article is rather interesting but unfortunately it does not
compare and contrast. It only lists facts of XZ but does so mostly in
a vacuum

From the abstract:

"The relevant weaknesses and design errors in the xz format are analyzed and, where applicable, compared with the corresponding behavior of the bzip2, gzip and lzip formats".
"(4) LZMA2 is unsafe and less efficient than the original LZMA".
"(7) error detection in xz is several times less accurate than in bzip2, gzip and lzip".

The body of the article contains statements like these:

"The xz format has more overhead than bzip2, gzip or lzip".
"Bzip2, gzip and lzip are free from these defects".
"The extensibility of bzip2 and lzip is better".
"Bzip2 is affected by this defect to a lesser extent".
"Gzip may be considered free from this defect".
"Lzip is free from this defect".

so I can't really tell whether or not those things are actually bad
without trusting the author.

There is not any need to trust the author in order to verify, for example, that using a CRC larger than the dataword is an error. All you need is some knowledge about the properties of CRCs.

My personal experience with the XZ format finds it better for
releases, as compression tends to take longer (but give a better
ratio) than decompression.

Lzip gives even a slightly better ratio. See

Section 2.7 is kind of interesting in that case, as it claims that
LZMA2 has a lower compression ceiling than LZMA. I don't know how far
typical compression strays from the ideal.

The main problem with LZMA2 is not that it compresses slightly less, but its unprotected critical fields and its bad interaction with the optional integrity checks of xz. See

2.9 is fairly minor but I do agree with it. It and the later sections
lead me to my next point: most of the discussion seems to be on the
failure of XZ's format to provide proper error checking. However in
practice I have not found anyone to rely on the error checking
provided by compressed data formats, and have had it suggested that
doing so is an exercise in futility. That most software projects
provide digests of their source archives seems to agree with this.

The cryptographically secure hashes provided by most software projects have nothing to do with error detection, but with preventing tampering. For example a MITM attack replacing gcc-8-20170604.tar.xz with malware.

The integrity checking provided by bzip2, gzip and lzip is perfectly fine, and keeping it is important because it not only guards against corruption in the compressed file, but also against memory errors, undetected bugs in the decompressor, etc.

Also, distributing signed tarballs is not the only use for a compressed format. Many people compress their own data and rely on the integrity checking provided by the format.

Trying to provide error checking within the archive itself falls prey
to the two general's problem

The two generals' problem is about bidirectional communication. It has nothing to do with integrity checking. If "lzip --test" suceeds, the file (message) has no errors. There is no need to communicate back to the ftp server that the download went well. Note that'_Problem does not question the integrity of the messages transmitted.

While I'm not sure this should reflect on the author, I am not able to
understand his conclusion in 2.10, 2.10.3, and 2.10.4. It doesn't seem
to be explained how is he detecting false positives (the given
formulas describe what he is doing with the false positive rate after
it is found).

The false positives considered in sections 2.10 and 2.10.x are not detected but calculated, and they just depend on the relative sizes of the dataword and the check sequence (CRC).

For example, in section 2.10.3 the 'Block Header' field has usually a size of 12 bytes. From these 12 bytes, 3 are padding and 4 are the CRC32. Therefore, 7 of every 12 bit flips affect the padding or the CRC, and only 5 of every 12 affect the header data proper. This means that 7 of every 12 reported errors (caused by a bit flip) are false positives (the data are intact).

The funny thing with xz is that the padding bytes are protected by the CRC. Xz treats the useless padding as if it were useful data. The result is that if the corruption affects the padding, the CRC won't match, and the error will appear to be a true positive (the data are intact but the decoder reports an error anyway).

Thus to me it seems entirely likely he might be interpreting things
backwards; CRC algorithms actually pass through a large number of
errors compared to digest algorithms, so figure 3 makes no sense.

Note that figure 3 makes sense because it does not plot the false negatives, but the inaccuracy, which is defined as:
  ( false_negatives + false_positives ) / total_cases.

About the difference between CRCs and hashes (digest algorithms), in section 2.10 you can read:

"Because of the error multiplication caused by decompression, the error model seen by the check sequence is one of unconstrained random data corruption. (Remember that the check sequence verifies the integrity of the decompressed data). This means that the choice of error-detection code (CRC or hash) is largely irrelevant, and that the probability of an error being undetected by the check sequence (Pudc) is 1/(2^n) for a check sequence of n bits. (See [Koopman], p. 5). Note that if some errors do not produce error multiplication, a CRC is then preferable to a hash of the same size because of the burst error detection capabilities of the CRC."

Mr. Fonzo, if you think this is worth the consideration of the GCC
developers, you might consider contacting the author of that web page
so that he is able to explain it.

I hope you find the above explanations satisfactory. But don't hesitate to ask if you have more doubts. :-)

Best regards,

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]