This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Fw: xz instead of bzip2

From: Antonio Diaz Diaz <antonio at gnu dot org>
To: gcc at gcc dot gnu dot org
Cc: Matias Fonzo <selk at dragora dot org>, R0b0t1 <r030t1 at gmail dot com>, Antonio Diaz Diaz <antonio at gnu dot org>
Date: Tue, 06 Jun 2017 21:14:41 +0200
Subject: Re: Fw: xz instead of bzip2
Authentication-results: sourceware.org; auth=none
References: <20170605201951.6353d41c@rafaela>

TL;DR. The xz format has a long list of defects, and at least some ofthem are not minor. Please, don't promote such a defective format usingit in GCC. Thanks. :-)


R0b0t1<r030t1@gmail.com>  wrote:

     http://lzip.nongnu.org/xz_inadequate.html


That article is rather interesting but unfortunately it does not
compare and contrast. It only lists facts of XZ but does so mostly in
a vacuum


From the abstract:

"The relevant weaknesses and design errors in the xz format are analyzedand, where applicable, compared with the corresponding behavior of thebzip2, gzip and lzip formats".

"(4) LZMA2 is unsafe and less efficient than the original LZMA".

"(7) error detection in xz is several times less accurate than in bzip2,gzip and lzip".


The body of the article contains statements like these:

"The xz format has more overhead than bzip2, gzip or lzip".
"Bzip2, gzip and lzip are free from these defects".
"The extensibility of bzip2 and lzip is better".
"Bzip2 is affected by this defect to a lesser extent".
"Gzip may be considered free from this defect".
"Lzip is free from this defect".
...etc.

so I can't really tell whether or not those things are actually bad
without trusting the author.

There is not any need to trust the author in order to verify, forexample, that using a CRC larger than the dataword is an error. All youneed is some knowledge about the properties of CRCs.

My personal experience with the XZ format finds it better for
releases, as compression tends to take longer (but give a better
ratio) than decompression.

Lzip gives even a slightly better ratio. Seehttp://www.nongnu.org/lzip/lzip_benchmark.html#xz1

Section 2.7 is kind of interesting in that case, as it claims that
LZMA2 has a lower compression ceiling than LZMA. I don't know how far
typical compression strays from the ideal.

The main problem with LZMA2 is not that it compresses slightly less, butits unprotected critical fields and its bad interaction with theoptional integrity checks of xz. Seehttp://www.nongnu.org/lzip/lzip_benchmark.html#busybox

2.9 is fairly minor but I do agree with it. It and the later sections
lead me to my next point: most of the discussion seems to be on the
failure of XZ's format to provide proper error checking. However in
practice I have not found anyone to rely on the error checking
provided by compressed data formats, and have had it suggested that
doing so is an exercise in futility. That most software projects
provide digests of their source archives seems to agree with this.

The cryptographically secure hashes provided by most software projectshave nothing to do with error detection, but with preventing tampering.For example a MITM attack replacing gcc-8-20170604.tar.xz with malware.

The integrity checking provided by bzip2, gzip and lzip is perfectlyfine, and keeping it is important because it not only guards againstcorruption in the compressed file, but also against memory errors,undetected bugs in the decompressor, etc.

Also, distributing signed tarballs is not the only use for a compressedformat. Many people compress their own data and rely on the integritychecking provided by the format.

Trying to provide error checking within the archive itself falls prey
to the two general's problem

The two generals' problem is about bidirectional communication. It hasnothing to do with integrity checking. If "lzip --test" suceeds, thefile (message) has no errors. There is no need to communicate back tothe ftp server that the download went well. Note thathttp://en.wikipedia.org/wiki/Two_Generals'_Problem does not question theintegrity of the messages transmitted.

While I'm not sure this should reflect on the author, I am not able to
understand his conclusion in 2.10, 2.10.3, and 2.10.4. It doesn't seem
to be explained how is he detecting false positives (the given
formulas describe what he is doing with the false positive rate after
it is found).

The false positives considered in sections 2.10 and 2.10.x are notdetected but calculated, and they just depend on the relative sizes ofthe dataword and the check sequence (CRC).

For example, in section 2.10.3 the 'Block Header' field has usually asize of 12 bytes. From these 12 bytes, 3 are padding and 4 are theCRC32. Therefore, 7 of every 12 bit flips affect the padding or the CRC,and only 5 of every 12 affect the header data proper. This means that 7of every 12 reported errors (caused by a bit flip) are false positives(the data are intact).

The funny thing with xz is that the padding bytes are protected by theCRC. Xz treats the useless padding as if it were useful data. The resultis that if the corruption affects the padding, the CRC won't match, andthe error will appear to be a true positive (the data are intact but thedecoder reports an error anyway).

Thus to me it seems entirely likely he might be interpreting things
backwards; CRC algorithms actually pass through a large number of
errors compared to digest algorithms, so figure 3 makes no sense.

Note that figure 3 makes sense because it does not plot the falsenegatives, but the inaccuracy, which is defined as:

  ( false_negatives + false_positives ) / total_cases.

About the difference between CRCs and hashes (digest algorithms), insection 2.10 you can read:

"Because of the error multiplication caused by decompression, the errormodel seen by the check sequence is one of unconstrained random datacorruption. (Remember that the check sequence verifies the integrity ofthe decompressed data). This means that the choice of error-detectioncode (CRC or hash) is largely irrelevant, and that the probability of anerror being undetected by the check sequence (Pudc) is 1/(2^n) for acheck sequence of n bits. (See [Koopman], p. 5). Note that if someerrors do not produce error multiplication, a CRC is then preferable toa hash of the same size because of the burst error detectioncapabilities of the CRC."

Mr. Fonzo, if you think this is worth the consideration of the GCC
developers, you might consider contacting the author of that web page
so that he is able to explain it.

I hope you find the above explanations satisfactory. But don't hesitateto ask if you have more doubts. :-)



Best regards,
Antonio.

References:
- Fw: xz instead of bzip2
  - From: Matias Fonzo

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]