This is the mail archive of the
mailing list for the GCC project.
Re: Fw: xz instead of bzip2
- From: Antonio Diaz Diaz <antonio at gnu dot org>
- To: gcc at gcc dot gnu dot org
- Cc: Matias Fonzo <selk at dragora dot org>, R0b0t1 <r030t1 at gmail dot com>, Antonio Diaz Diaz <antonio at gnu dot org>
- Date: Tue, 06 Jun 2017 21:14:41 +0200
- Subject: Re: Fw: xz instead of bzip2
- Authentication-results: sourceware.org; auth=none
- References: <20170605201951.6353d41c@rafaela>
TL;DR. The xz format has a long list of defects, and at least some of
them are not minor. Please, don't promote such a defective format using
it in GCC. Thanks. :-)
That article is rather interesting but unfortunately it does not
compare and contrast. It only lists facts of XZ but does so mostly in
From the abstract:
"The relevant weaknesses and design errors in the xz format are analyzed
and, where applicable, compared with the corresponding behavior of the
bzip2, gzip and lzip formats".
"(4) LZMA2 is unsafe and less efficient than the original LZMA".
"(7) error detection in xz is several times less accurate than in bzip2,
gzip and lzip".
The body of the article contains statements like these:
"The xz format has more overhead than bzip2, gzip or lzip".
"Bzip2, gzip and lzip are free from these defects".
"The extensibility of bzip2 and lzip is better".
"Bzip2 is affected by this defect to a lesser extent".
"Gzip may be considered free from this defect".
"Lzip is free from this defect".
so I can't really tell whether or not those things are actually bad
without trusting the author.
There is not any need to trust the author in order to verify, for
example, that using a CRC larger than the dataword is an error. All you
need is some knowledge about the properties of CRCs.
My personal experience with the XZ format finds it better for
releases, as compression tends to take longer (but give a better
ratio) than decompression.
Lzip gives even a slightly better ratio. See
Section 2.7 is kind of interesting in that case, as it claims that
LZMA2 has a lower compression ceiling than LZMA. I don't know how far
typical compression strays from the ideal.
The main problem with LZMA2 is not that it compresses slightly less, but
its unprotected critical fields and its bad interaction with the
optional integrity checks of xz. See
2.9 is fairly minor but I do agree with it. It and the later sections
lead me to my next point: most of the discussion seems to be on the
failure of XZ's format to provide proper error checking. However in
practice I have not found anyone to rely on the error checking
provided by compressed data formats, and have had it suggested that
doing so is an exercise in futility. That most software projects
provide digests of their source archives seems to agree with this.
The cryptographically secure hashes provided by most software projects
have nothing to do with error detection, but with preventing tampering.
For example a MITM attack replacing gcc-8-20170604.tar.xz with malware.
The integrity checking provided by bzip2, gzip and lzip is perfectly
fine, and keeping it is important because it not only guards against
corruption in the compressed file, but also against memory errors,
undetected bugs in the decompressor, etc.
Also, distributing signed tarballs is not the only use for a compressed
format. Many people compress their own data and rely on the integrity
checking provided by the format.
Trying to provide error checking within the archive itself falls prey
to the two general's problem
The two generals' problem is about bidirectional communication. It has
nothing to do with integrity checking. If "lzip --test" suceeds, the
file (message) has no errors. There is no need to communicate back to
the ftp server that the download went well. Note that
http://en.wikipedia.org/wiki/Two_Generals'_Problem does not question the
integrity of the messages transmitted.
While I'm not sure this should reflect on the author, I am not able to
understand his conclusion in 2.10, 2.10.3, and 2.10.4. It doesn't seem
to be explained how is he detecting false positives (the given
formulas describe what he is doing with the false positive rate after
it is found).
The false positives considered in sections 2.10 and 2.10.x are not
detected but calculated, and they just depend on the relative sizes of
the dataword and the check sequence (CRC).
For example, in section 2.10.3 the 'Block Header' field has usually a
size of 12 bytes. From these 12 bytes, 3 are padding and 4 are the
CRC32. Therefore, 7 of every 12 bit flips affect the padding or the CRC,
and only 5 of every 12 affect the header data proper. This means that 7
of every 12 reported errors (caused by a bit flip) are false positives
(the data are intact).
The funny thing with xz is that the padding bytes are protected by the
CRC. Xz treats the useless padding as if it were useful data. The result
is that if the corruption affects the padding, the CRC won't match, and
the error will appear to be a true positive (the data are intact but the
decoder reports an error anyway).
Thus to me it seems entirely likely he might be interpreting things
backwards; CRC algorithms actually pass through a large number of
errors compared to digest algorithms, so figure 3 makes no sense.
Note that figure 3 makes sense because it does not plot the false
negatives, but the inaccuracy, which is defined as:
( false_negatives + false_positives ) / total_cases.
About the difference between CRCs and hashes (digest algorithms), in
section 2.10 you can read:
"Because of the error multiplication caused by decompression, the error
model seen by the check sequence is one of unconstrained random data
corruption. (Remember that the check sequence verifies the integrity of
the decompressed data). This means that the choice of error-detection
code (CRC or hash) is largely irrelevant, and that the probability of an
error being undetected by the check sequence (Pudc) is 1/(2^n) for a
check sequence of n bits. (See [Koopman], p. 5). Note that if some
errors do not produce error multiplication, a CRC is then preferable to
a hash of the same size because of the burst error detection
capabilities of the CRC."
Mr. Fonzo, if you think this is worth the consideration of the GCC
developers, you might consider contacting the author of that web page
so that he is able to explain it.
I hope you find the above explanations satisfactory. But don't hesitate
to ask if you have more doubts. :-)