Bug 51017 - [11/12/13/14 Regression] GCC performance regression (vs. 4.4/4.5), PRE/LIM increase register pressure too much
Summary: [11/12/13/14 Regression] GCC performance regression (vs. 4.4/4.5), PRE/LIM in...
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.6.2
: P2 normal
Target Milestone: 11.5
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks: 79704
  Show dependency treegraph
 
Reported: 2011-11-08 00:43 UTC by Alexander Peslyak
Modified: 2023-07-07 10:29 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.3.4
Known to fail: 4.8.3, 4.9.2, 5.0
Last reconfirmed: 2015-02-09 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Peslyak 2011-11-08 00:43:08 UTC
GCC 4.6 happens to produce approx. 25% slower code on at least x86_64 than 4.4 and 4.5 did for John the Ripper 1.7.8's bitslice DES implementation.  To reproduce, download http://download.openwall.net/pub/projects/john/1.7.8/john-1.7.8.tar.bz2 and build it with "make linux-x86-64" (will use SSE2 intrinsics), "make linux-x86-64-avx" (will use AVX instead), or "make generic" (won't use any intrinsics).  Then run "../run/john -te=1".  With GCC 4.4 and 4.5, the "Traditional DES" benchmark reports a speed of around 2500K c/s for the "linux-x86-64" (SSE2) build on a 2.33 GHz Core 2 (this is using one core).  With 4.6, this drops to about 1850K c/s.  Similar slowdown was observed for AVX on Core i7-2600K when going from GCC 4.5.x to 4.6.x.  And it is reproducible for the without-intrinsics code as well, although that's of less practical importance (the intrinsics are so much faster).  Similar slowdown with GCC 4.6 was reported by a Mac OS X user.  It was also spotted by Phoronix in their recently published C compiler benchmarks, but misinterpreted as a GCC vs. clang difference.

Adding "-Os" to OPT_INLINE in the Makefile partially corrects the performance (to something like 2000K c/s - still 20% slower than GCC 4.4/4.5's).  Applying the OpenMP patch from http://download.openwall.net/pub/projects/john/1.7.8/john-1.7.8-omp-des-4.diff.gz and then running with OMP_NUM_THREADS=1 (for a fair comparison) corrects the performance almost fully.  Keeping the patch applied, but removing -fopenmp still keeps the performance at a good level.  So it's some change made to the source code by this patch that mitigates the GCC regression.  Similar behavior is seen with current CVS version of John the Ripper, even though it has OpenMP support for DES heavily revised and integrated into the tree.
Comment 1 Alexander Peslyak 2011-11-08 00:47:49 UTC
(In reply to comment #0)
> [...] Similar behavior
> is seen with current CVS version of John the Ripper, even though it has OpenMP
> support for DES heavily revised and integrated into the tree.

I forgot to note that in the CVS version, I changed the default for non-OpenMP builds to use the supplied SSE2 assembly code, which hides this GCC issue for SSE2 non-OpenMP builds.  The C code may be re-enabled in x86-64.h, or alternatively an -avx or generic build may be used.  (Yes, -avx is still fully affected by the GCC regression even in the latest version of JtR code.)

But it is probably simpler to use the 1.7.8 release to reproduce this bug anyway.
Comment 2 Alexander Peslyak 2011-11-08 00:56:47 UTC
The affected code is in DES_bs_b.c: DES_bs_crypt_25().  (Sorry, I should have mentioned that right away.)
Comment 3 Andrew Pinski 2011-12-15 00:28:51 UTC
It might be interesting to get numbers for the trunk.  There have been some register allocator fixes which might have improved this.
Comment 4 Alexander Peslyak 2012-01-03 04:45:43 UTC
(In reply to comment #3)
> It might be interesting to get numbers for the trunk.  There have been some
> register allocator fixes which might have improved this.

I've just tested the gcc-4.7-20111231 snapshot vs. 4.6.2 release.  There's no improvement as it relates to this issue: I am getting the same poor performance (a lot worse than for 4.5).  This is for generating x86-64 code with SSE2 intrinsics, benchmarking the resulting code on a Core 2'ish CPU (I used Xeon E5420 this time).
Comment 5 Alexander Peslyak 2012-01-04 19:39:26 UTC
I wrote and ran some scripts to test many versions/snapshots of gcc.  It turns out that 4.6-20100703 (oldest 4.6 snapshot available for FTP) was already affected by this regression, whereas 4.5-20111229 and 4.4-20120103 are not affected (as expected).  Also, it turns out that there was a smaller regression at this same benchmark between 4.3 and 4.4.  That is, 4.3 produces the fastest code of all gcc versions I tested.  Here are some numbers:

4.3.5 20100502 - 2950K c/s, 28229 bytes
4.3.6 20110626 - 2950K c/s, 28229 bytes
4.4.5 20100504 - 2697K c/s, 29764 bytes
4.4.7 20120103 - 2691K c/s, 29316 bytes
4.5.1 20100603 - 2729K c/s, 29203 bytes
4.5.4 20111229 - 2710K c/s, 29203 bytes
4.6.0 20100703 - 2133K c/s, 29911 bytes
4.6.0 20100807 - 2119K c/s, 29940 bytes
4.6.0 20100904 - 2142K c/s, 29848 bytes
4.6.0 20101106 - 2124K c/s, 29848 bytes
4.6.0 20101204 - 2114K c/s, 29624 bytes
4.6.3 20111230 - 2116K c/s, 29624 bytes
4.7.0 20111231 - 2147K c/s, 29692 bytes

These are for JtR 1.7.9 with DES_BS_ASM set to 0 on line 157 of x86-64.h (to disable this version's workaround for this GCC 4.6 regression), built with "make linux-x86-64" and run on one core in a Xeon E5420 2.5 GHz (the system is otherwise idle).  The code sizes given are for .text of DES_bs_b.o (which contains three similar functions, of which one is in use by this benchmark - that is, the code size in the loop is about 10 KB).

As you can see, 4.3 generated code that was both significantly faster and a bit smaller than all other versions'.  In 4.4, the speed decreased by 8.5% and code size increased by 4.4%.  4.5 corrected this to a very limited extent - still 8% slower and 3.5% larger than 4.3's.  4.6 brought a huge performance drop and a slight code size increase.  4.7.0 20111231's code is still 27% slower than 4.3's.
Comment 6 Jakub Jelinek 2012-01-04 22:42:37 UTC
The big performance drop seems to be from r143756 to r143757, i.e. RA changes: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=143757
CCing Vladimir.
Comment 7 Alexander Peslyak 2012-01-04 23:00:24 UTC
(I ran the tests below and wrote this comment before seeing Jakub's.  Then I thought I'd post it anyway.)

Here are some numbers for gcc releases:

4.0.0 - 383K c/s, 71879 bytes (this old version of gcc generates function calls
for SSE2 intrinsics)
4.1.0 - 2959K c/s, 28182 bytes
4.1.2 - 2964K c/s, 28365 bytes
4.2.0 - 2968K c/s, 28363 bytes
4.2.4 - 2971K c/s, 28382 bytes
4.3.0 - 2971K c/s, 28229 bytes
4.3.6 - 2959K c/s, 28229 bytes
4.4.0 - 2625K c/s, 29770 bytes
4.4.6 - 2695K c/s, 29316 bytes
4.5.0 - 2729K c/s, 29203 bytes
4.5.3 - 2716K c/s, 29203 bytes
4.6.0 - 2111K c/s, 29624 bytes
4.6.2 - 2123K c/s, 29624 bytes

So thing were really good for versions 4.1.0 through 4.3.6, but started to get
worse afterwards and got really bad with 4.6.

To be fair, things are very different for some other hash/cipher types
supported by JtR - e.g., for Blowfish-based hashing we went from 560 c/s for
4.1.0 to 700 c/s for 4.6.2.

<plug>JtR 1.7.9 and 1.7.9-jumbo include a benchmark comparison tool called relbench, which calculates geometric mean, median, and some other metrics for multiple individual outputs from a pair of JtR benchmark invocations (e.g., built with different versions of gcc).  In 1.7.9-jumbo-5, there are over 160 individual benchmark outputs (for different hashes/ciphers) and it may be built in a variety of ways (with/without explicit assembly code, with/without intrinsics etc.)  relbench combines those 160+ outputs into a nice summary showing overall speedup/slowdown and more.  It might be useful for testing of future gcc versions for potential performance regressions like this.</plug>
Comment 8 Andrew Pinski 2015-02-09 00:12:45 UTC
Can you try GCC 4.9?
Comment 9 Alexander Peslyak 2015-02-16 00:08:11 UTC
(In reply to Andrew Pinski from comment #8)
> Can you try GCC 4.9?

Yes.  Bad news: things mostly became even worse.  Same machine, same JtR version, same test script as in my previous comment:

4.9.2 - 1849K c/s, 28256 bytes

The code size is back to 4.1.0 to 4.3.6 levels (good), but the performance decreased by another 13% since 4.6.2 (and by 38% since it peaked with 4.3.0).  I ran this benchmark multiple times, and I also re-ran benchmarks with some previous gcc versions to make sure this isn't caused by some change in my environment - no, I am getting consistently poor results for 4.9.2, and the same results as before for other gcc versions.  I'll plan to test with some versions in the range 4.7.0 to 4.9.0 next.

(I also see some much smaller regressions with 4.9.2 for other hash types.)
Comment 10 Alexander Peslyak 2015-02-16 01:10:17 UTC
I decided to take a look at the generated code.  Compared to 4.6.2, GCC 4.9.2 started generating lots of xorps, orps, andps, andnps where it previously generated pxor, por, pand, pandn.  Changing those with:

sed -i 's/xorps/pxor/g; s/orps/por/g; s/andps/pand/g; s/andnps/pandn/g'

made no difference for performance on this machine (still 4.9.2's poor performance).

The next suspect were the varieties of MOV instructions.  In 4.9.2's generated code, there were 1319 movaps, 721 movups.  In 4.6.2's, there were 1258 movaps, 465 movups.  Simply changing all movups to movaps in 4.9.2's original code with sed (thus, with no other changes except for this one), resulting in a total of 2040 movaps, brought the performance to levels similar to GCC 4.4 and 4.5's (and is better than 4.6's, but worse than 4.3's).  So movups appear to be the main culprit.  The same hack for 4.6.2's code brought its performance almost to 4.3's level (still 5% worse, though), and significantly above 4.9.2's (so there's still some other, smaller regression with 4.9.2).

Here are my new results:

4.1.0o - 2960K c/s, 28182 bytes, 1758 movaps, 0 movups
4.3.6o - 2956K c/s, 28229 bytes, 1755 movaps, 0 movups
4.4.6o - 2694K c/s, 29316 bytes, 1709 movaps, 7 movups
4.4.6h - 2714K c/s, 29316 bytes, 1716 movaps, 0 movups
4.5.3o - 2709K c/s, 29203 bytes, 1669 movaps, 0 movups
4.6.2o - 2121K c/s, 29624 bytes, 1258 movaps, 465 movups
4.6.2h - 2817K c/s, 29624 bytes, 1723 movaps, 0 movups
4.9.2o - 1852K c/s, 28256 bytes, 1319 movaps, 721 movups
4.9.2h - 2688K c/s, 28256 bytes, 2040 movaps, 0 movups

"o" means original, "h" means hacked generated assembly code (all movups changed to movaps).  (BTW, there were no movdqa/movdqu in any of these code versions.)

Now I am wondering to what extent this is a GCC issue and to what extent it might be my source code's, if GCC is somehow unsure it can assume alignment.  What are the conditions when GCC should in fact use movups?  Is it intentional that newer versions of GCC are being more careful at this, resulting in worse performance?
Comment 11 Richard Biener 2015-02-16 10:51:35 UTC
As for movaps vs. movups when movaps actually works shouldn't make any difference on modern architectures.  So I wonder if you could share the exact CPU type
you are using?

We are putting quite heavy register-pressure on the thing by means of
partial redundancy elimination, thus disabling PRE using -fno-tree-pre
might help (we still spill a lot).

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     103296 c/s real, 103296 c/s virtual
Only one salt:  100736 c/s real, 100736 c/s virtual

improves to

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     126848 c/s real, 126848 c/s virtual
Only one salt:  123008 c/s real, 123008 c/s virtual

with that for me (gcc 4.8, SSE2).  Which is close to what 4.5.3 gets for me:

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     128384 c/s real, 128384 c/s virtual
Only one salt:  124800 c/s real, 124800 c/s virtual

albeit that doesn't need -fno-tree-pre to fix things.

Note that we have to use movups because DES_bs_all is not aligned as seen
from DES_bs_b.c (it's defined in DES_bs.c and only there annotated with
CC_CACHE_ALIGN, not at the point of declaration in DES_bs.h).  So the
unaligned moves are the sources fault.  Annotating that with CC_CACHE_ALIGN
produces the desired movaps instructions (with no effect on performance for me).

I think for the effect of PRE increasing register pressure we do have some
duplicate bugs (but no good heuristic to fix anything).  LIM store-motion can
have the very same issue.
Comment 12 Alexander Peslyak 2015-02-17 02:20:55 UTC
(In reply to Richard Biener from comment #11)
> I wonder if you could share the exact CPU type you are using?

This is on (dual) Xeon E5420 (using only one core for these benchmarks), but there was similar slowdown with GCC 4.6 on other Core 2'ish CPUs as well (such as desktop Core 2 Duo CPUs). You might not call these "modern".

> Note that we have to use movups because [...]

Thank you for looking into this. I still have a question, though: does this mean you're treating older GCC's behavior, where it dared to use movaps anyway, a bug?

I was under impression that with most SSE*/AVX* intrinsics (except for those explicitly defined to do unaligned loads/stores) natural alignment is assumed and is supposed to be provided by the programmer. Not only with GCC, but with compilers for x86(-64) in general. I thought this was part of the contract: I use intrinsics and I guarantee alignment. (Things would certainly not work for me at least with older GCC if I assumed the compiler would use unaligned loads whenever it was unsure of alignment.) Was I wrong, or has this changed (in GCC? or in some compiler-neutral specification?), or is GCC wrong in not assuming alignment now?

Is there a command-line option to ask GCC to assume alignment, like it did before?
Comment 13 Alexander Peslyak 2015-02-17 02:55:58 UTC
(In reply to Richard Biener from comment #11)
> We are putting quite heavy register-pressure on the thing by means of
> partial redundancy elimination, thus disabling PRE using -fno-tree-pre
> might help (we still spill a lot).

It looks like -fno-tree-pre or equivalent was implied in the options I was using, which were "-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions" - yes, with -Os added after -O2 when compiling this specific source file.  IIRC, this was experimentally derived as producing best performance with 4.6.x or older.  Adding -fno-tree-pre after all of these options merely changes the label names in the generated assembly code, while resulting in identical object files (and obviously no performance change).  Also, I now realize -Os was probably the reason why GCC preferred SSE "floating-point" bitwise ops and MOVs here, instead of SSE2's integer ones (they have longer encodings). Omitting -Os results in usage of the SSE2 instructions (both bitwise and MOVs), with correspondingly larger code. And yes, when I omit -Os, I do need to add -fno-tree-pre to regain roughly the same performance, and then to s/movdqu/movdqa/g to regain almost the full speed (movdqu is just as slow as movups on this CPU). I've just tested all of this with GCC 4.8.4 to possibly match yours (you mentioned you used 4.8). So I think you uncovered yet another performance regression I had already worked around with -Os.

FWIW, here are the generated assembly code sizes ("wc" output) with GCC 4.8.4:

-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions
  5870  17420 137636 1.s
-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions -fno-tree-pre
  5870  17420 137636 2.s
-O2 -fomit-frame-pointer -funroll-loops -finline-functions
  6814  20193 156837 a.s
-O2 -fomit-frame-pointer -funroll-loops -finline-functions -fno-tree-pre
  6028  17842 138284 b.s

As you can see, -fno-tree-pre reduces the size almost to the -Os level. (But the .text size would be significantly larger because of the SSE2 instruction encodings.  This is why I show the assembly code sizes for this comparison.)
Comment 14 Alexander Peslyak 2015-02-17 03:11:11 UTC
For completeness, here are the results for 4.7.x, 4.8.x, and 4.9.0:

4.7.0o - 2142K c/s, 29692 bytes, 1267 movaps, 465 movups
4.7.0h - 2823K c/s, 29692 bytes, 1732 movaps, 0 movups
4.7.4o - 2144K c/s, 29692 bytes, 1267 movaps, 465 movups
4.7.4h - 2827K c/s, 29692 bytes, 1732 movaps, 0 movups
4.8.0o - 1825K c/s, 27813 bytes, 1341 movaps, 721 movups
4.8.0h - 2792K c/s, 27813 bytes, 2062 movaps, 0 movups
4.8.4o - 1827K c/s, 27807 bytes, 1341 movaps, 721 movups
4.8.4h - 2786K c/s, 27807 bytes, 2062 movaps, 0 movups
4.9.0o - 1852K c/s, 28262 bytes, 1319 movaps, 721 movups
4.9.0h - 2685K c/s, 28262 bytes, 2040 movaps, 0 movups

4.8 produces the smallest code so far, but even with the aligned loads hack is still 6% slower than 4.3.

All of these are with "-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions", like similar results I had posted before.  Xeon E5420, x86_64.
Comment 15 Richard Biener 2015-02-17 09:25:39 UTC
(In reply to Alexander Peslyak from comment #12)
> (In reply to Richard Biener from comment #11)
> > I wonder if you could share the exact CPU type you are using?
> 
> This is on (dual) Xeon E5420 (using only one core for these benchmarks), but
> there was similar slowdown with GCC 4.6 on other Core 2'ish CPUs as well
> (such as desktop Core 2 Duo CPUs). You might not call these "modern".
> 
> > Note that we have to use movups because [...]
> 
> Thank you for looking into this. I still have a question, though: does this
> mean you're treating older GCC's behavior, where it dared to use movaps
> anyway, a bug?

If you used intrinsics for aligned loads then no.

> I was under impression that with most SSE*/AVX* intrinsics (except for those
> explicitly defined to do unaligned loads/stores) natural alignment is
> assumed and is supposed to be provided by the programmer. Not only with GCC,
> but with compilers for x86(-64) in general. I thought this was part of the
> contract: I use intrinsics and I guarantee alignment. (Things would
> certainly not work for me at least with older GCC if I assumed the compiler
> would use unaligned loads whenever it was unsure of alignment.) Was I wrong,
> or has this changed (in GCC? or in some compiler-neutral specification?), or
> is GCC wrong in not assuming alignment now?

GCC was changed to be more permissive to broken programs and also intrinsics
were changed to map to plain C code in some cases (thus they are not visible
as intrinsics to the compiler).

> Is there a command-line option to ask GCC to assume alignment, like it did
> before?

No.
Comment 16 Richard Biener 2015-02-17 09:27:10 UTC
(In reply to Alexander Peslyak from comment #14)
> For completeness, here are the results for 4.7.x, 4.8.x, and 4.9.0:
> 
> 4.7.0o - 2142K c/s, 29692 bytes, 1267 movaps, 465 movups
> 4.7.0h - 2823K c/s, 29692 bytes, 1732 movaps, 0 movups
> 4.7.4o - 2144K c/s, 29692 bytes, 1267 movaps, 465 movups
> 4.7.4h - 2827K c/s, 29692 bytes, 1732 movaps, 0 movups
> 4.8.0o - 1825K c/s, 27813 bytes, 1341 movaps, 721 movups
> 4.8.0h - 2792K c/s, 27813 bytes, 2062 movaps, 0 movups
> 4.8.4o - 1827K c/s, 27807 bytes, 1341 movaps, 721 movups
> 4.8.4h - 2786K c/s, 27807 bytes, 2062 movaps, 0 movups
> 4.9.0o - 1852K c/s, 28262 bytes, 1319 movaps, 721 movups
> 4.9.0h - 2685K c/s, 28262 bytes, 2040 movaps, 0 movups
> 
> 4.8 produces the smallest code so far, but even with the aligned loads hack
> is still 6% slower than 4.3.
> 
> All of these are with "-O2 -fomit-frame-pointer -Os -funroll-loops
> -finline-functions", like similar results I had posted before.  Xeon E5420,
> x86_64.

I'm completely confused now as to what the original regression was reported
against.  I thought it was the default options in the Makefile, -O2 -fomit-frame-pointer, which showed the regression and you found -Os would mitigate it somewhat (and I more specifically told you it is -fno-tree-pre that makes the actual difference).

So - what options give good results with old compilers but bad results with new compilers?
Comment 17 Alexander Peslyak 2015-02-18 00:03:37 UTC
(In reply to Richard Biener from comment #16)
> I'm completely confused now as to what the original regression was reported
> against.

I'm sorry, I should have re-read my original description of the regression before I wrote comment 13.  Together, these are indeed confusing.

> I thought it was the default options in the Makefile, -O2
> -fomit-frame-pointer, which showed the regression and you found -Os would
> mitigate it somewhat (and I more specifically told you it is -fno-tree-pre
> that makes the actual difference).

That's one of the regressions I mentioned in the original description.  Yes, you identified -fno-tree-pre as the component of -Os that makes the difference - Thank You!  However, I also mentioned in the original description that a bigger regression with 4.6+ vs. 4.5 and 4.4 remained despite of -Os, and I had no similar workaround for it at the time (but enabling -fopenmp made it go away, perhaps due to changes to declarations in the source code in #ifdef _OPENMP blocks).  I think we can now say that this bigger 4.6+ regression was primarily caused by the unaligned load instructions.  So two regressions are figured out, and the remaining slowdown (not investigated yet) vs. 4.1 to 4.3 (which worked best) is only 6% to 10% in recent versions (9% in 4.9.2).

> So - what options give good results with old compilers but bad results with
> new compilers?

On CPUs where movups/movdqu are slower than their aligned counterparts (for addresses that happen to be aligned), any sane optimization options of 4.6+ give bad results as compared to pre-4.6 with same options.  As you say, this can be fixed in the source code (and I most likely will fix it there), but I think many other programs may experience similar slowdowns, so maybe GCC should do something about this too.

Other than that, either -Os or -fno-tree-pre works around the second worst slowdown seen in 4.6+.

To avoid confusion, maybe this bug should focus on one of the three regressions?  Should we keep it for PRE only?

Should we create a new bug for the unnecessary and non-optional use of unaligned load instructions for source code like this, or is this considered the new intended behavior despite of the major slowdown on such CPUs?  (Presumably not only for JtR.  I'd expect this to affect many programs.)

Should we also create a bug for investigating the remaining slowdown of 9% in 4.9.2 (vs. 4.1 to 4.3), or is it considered too minor to bother?

Thank you!
Comment 18 Alexander Peslyak 2015-02-18 01:25:23 UTC
(In reply to Richard Biener from comment #11)
> Note that we have to use movups because DES_bs_all is not aligned as seen
> from DES_bs_b.c (it's defined in DES_bs.c and only there annotated with
> CC_CACHE_ALIGN, not at the point of declaration in DES_bs.h).  So the
> unaligned moves are the sources fault.  Annotating that with CC_CACHE_ALIGN
> produces the desired movaps instructions

Confirmed also with GCC 4.9.2 on JtR 1.8.0's version of the code.

> (with no effect on performance for me).

... with the expected performance improvement for me.  I'll commit this fix.  Thanks again!
Comment 19 Alexander Peslyak 2015-02-18 03:19:52 UTC
(In reply to Alexander Peslyak from comment #17)
> Should we create a new bug for the unnecessary and non-optional use of
> unaligned load instructions for source code like this, or is this considered
> the new intended behavior despite of the major slowdown on such CPUs? 
> (Presumably not only for JtR.  I'd expect this to affect many programs.)

Upon further analysis, I now think that this was my fault, and (presumably) not common in other programs.  What I had was differing definition vs. declaration, so a bug.  The lack of alignment specification in the declaration of the struct essentially told (newer) GCC not to assume alignment - to an extent greater than e.g. a pointer would.  As far as I can tell, GCC does not currently produce unaligned load instructions (so assumes that SSE* vectors are properly aligned) when all it has is a pointer coming from another object file.  I think that's the common scenario, whereas mine was uncommon (and incorrect).

So let's focus on PRE only.
Comment 20 Richard Biener 2015-02-18 10:32:31 UTC
(In reply to Alexander Peslyak from comment #19)
> (In reply to Alexander Peslyak from comment #17)
> > Should we create a new bug for the unnecessary and non-optional use of
> > unaligned load instructions for source code like this, or is this considered
> > the new intended behavior despite of the major slowdown on such CPUs? 
> > (Presumably not only for JtR.  I'd expect this to affect many programs.)
> 
> Upon further analysis, I now think that this was my fault, and (presumably)
> not common in other programs.  What I had was differing definition vs.
> declaration, so a bug.  The lack of alignment specification in the
> declaration of the struct essentially told (newer) GCC not to assume
> alignment - to an extent greater than e.g. a pointer would.  As far as I can
> tell, GCC does not currently produce unaligned load instructions (so assumes
> that SSE* vectors are properly aligned) when all it has is a pointer coming
> from another object file.  I think that's the common scenario, whereas mine
> was uncommon (and incorrect).

Yes.  Note that we are trying to be more forgiving to users here and do not
exploit undefined behavior fully.

> So let's focus on PRE only.

Ok.  There are related bugreports for that I think.
Comment 21 Richard Biener 2015-02-18 11:09:34 UTC
We do already inhibit creating loop-carried dependencies of some kind, but only
when vectorization is enabled (because it can inhibit vectorization).  But we
still PRE invariant loads:

Replaced MEM[(vtype * {ref-all})&DES_bs_all + 20528B] with prephitmp_2898 in all uses of _1195 = MEM[(vtype * {ref-all})&DES_bs_all + 20528B] because we know
it's {0, 0} on entry.  Note that store motion doesn't apply here because
those stores are said to alias with the MEM[(vtype * {ref-all})k_2 + 848B]
kinds (iterating DES_bs_all.KS.v - unfortunately field-sensitive points-to analysis doesn't help here as the points-to result itself isn't field-sensitive).
Of course without store-motion applying this kind of PRE is not really useful.
If store-motion applied it would create the same kind of problem, of course
(in this case up to 0x300(?) live registers).

One possible solution is to simply avoid this kind of "partly" store-motion,
that is converting

  for (;;)
    reg = MEM;
    MEM = fn(reg);

to

  reg = MEM;
  for (;;)
    reg = fn(reg);
    MEM = reg;

of course this is also a profitable transform.  Thus the solution might be
instead to limit register pressure in some way by somehow assessing costs
to individual transforms.  At least it seems to be too difficult for
the register allocator to re-materialize 'reg' from MEM (as it would also
need to perform sophisticated analysis to determine that, basically
undoing the PRE transform).
Comment 22 Richard Biener 2015-06-23 08:13:40 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 23 Jakub Jelinek 2015-06-26 19:56:14 UTC
GCC 4.9.3 has been released.
Comment 24 Richard Biener 2016-03-04 13:00:55 UTC
GCC 4.5 vs GCC 5 still shows GCC 4.5 is faster almost everywhere

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE      Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     3636K c/s real, 3636K c/s virtual             | Many salts:     3488K c/s real, 3488K c/s virtual
Only one salt:  3047K c/s real, 3047K c/s virtual             | Only one salt:  2896K c/s real, 2896K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE      Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     127360 c/s real, 127360 c/s virtual           | Many salts:     108800 c/s real, 108800 c/s virtual
Only one salt:  124288 c/s real, 123057 c/s virtual           | Only one salt:  106112 c/s real, 106112 c/s virtual

Benchmarking: FreeBSD MD5 [32/64 X2]... DONE                    Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
Raw:    15392 c/s real, 15392 c/s virtual                     | Raw:    15936 c/s real, 15936 c/s virtual

Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE         Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw:    900 c/s real, 900 c/s virtual                         | Raw:    892 c/s real, 892 c/s virtual

Benchmarking: Kerberos AFS DES [48/64 4K]... DONE               Benchmarking: Kerberos AFS DES [48/64 4K]... DONE
Short:  478208 c/s real, 473473 c/s virtual                   | Short:  476672 c/s real, 476672 c/s virtual
Long:   1470K c/s real, 1470K c/s virtual                     | Long:   1473K c/s real, 1473K c/s virtual

Benchmarking: LM DES [128/128 BS SSE2-16]... DONE               Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    16977K c/s real, 16977K c/s virtual                   | Raw:    14971K c/s real, 14971K c/s virtual

Benchmarking: generic crypt(3) [?/64]... DONE                   Benchmarking: generic crypt(3) [?/64]... DONE
Many salts:     362784 c/s real, 362784 c/s virtual           | Many salts:     296352 c/s real, 296352 c/s virtual
Only one salt:  361728 c/s real, 361728 c/s virtual           | Only one salt:  292182 c/s real, 295104 c/s virtual

Benchmarking: dummy [N/A]... DONE                               Benchmarking: dummy [N/A]... DONE
Raw:    60157K c/s real, 60157K c/s virtual                   | Raw:    53849K c/s real, 53316K c/s virtual


GCC 5 vs. GCC 6 shows some progress (and some small regressions), but not for
BSDI DES.

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE      Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     3488K c/s real, 3488K c/s virtual             | Many salts:     3446K c/s real, 3446K c/s virtual
Only one salt:  2896K c/s real, 2896K c/s virtual             | Only one salt:  2895K c/s real, 2895K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE      Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     108800 c/s real, 108800 c/s virtual           | Many salts:     104934 c/s real, 105984 c/s virtual
Only one salt:  106112 c/s real, 106112 c/s virtual           | Only one salt:  103040 c/s real, 103040 c/s virtual

Benchmarking: FreeBSD MD5 [32/64 X2]... DONE                    Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
Raw:    15936 c/s real, 15936 c/s virtual                     | Raw:    15864 c/s real, 15864 c/s virtual

Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE         Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw:    892 c/s real, 892 c/s virtual                         | Raw:    916 c/s real, 916 c/s virtual

Benchmarking: Kerberos AFS DES [48/64 4K]... DONE               Benchmarking: Kerberos AFS DES [48/64 4K]... DONE
Short:  476672 c/s real, 476672 c/s virtual                   | Short:  471808 c/s real, 471808 c/s virtual
Long:   1473K c/s real, 1473K c/s virtual                     | Long:   1449K c/s real, 1449K c/s virtual

Benchmarking: LM DES [128/128 BS SSE2-16]... DONE               Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    14971K c/s real, 14971K c/s virtual                   | Raw:    15917K c/s real, 15917K c/s virtual

Benchmarking: generic crypt(3) [?/64]... DONE                   Benchmarking: generic crypt(3) [?/64]... DONE
Many salts:     296352 c/s real, 296352 c/s virtual           | Many salts:     348096 c/s real, 348096 c/s virtual
Only one salt:  292182 c/s real, 295104 c/s virtual           | Only one salt:  347616 c/s real, 347616 c/s virtual

Benchmarking: dummy [N/A]... DONE                               Benchmarking: dummy [N/A]... DONE
Raw:    53849K c/s real, 53316K c/s virtual                   | Raw:    60114K c/s real, 60114K c/s virtual


Note that -fno-tree-pre no longer helps.  With GCC 5/6 most intrinsics are
using a generic implementation and thus are transparent to the GIMPLE middle-end
apart from __builtin_ia32_pandn128 which is used by _mm_andnot_si128.
What helps is -fno-tree-loop-im in addition to -fno-tree-pre so the
underlying issue is still that of register pressure it seems and it is
not really the loop-carried stuff we introduce but the excessive invariant
motion.

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     118528 c/s real, 117354 c/s virtual
Only one salt:  114944 c/s real, 114944 c/s virtual

        movq    DES_bs_all+18632(%rip), %rdi
        movq    DES_bs_all+18624(%rip), %rcx
        movq    DES_bs_all+18712(%rip), %rbp
        movq    DES_bs_all+18696(%rip), %r9
        movq    DES_bs_all+18688(%rip), %r10
        movq    DES_bs_all+18680(%rip), %r11
        movq    %rdi, 624(%rsp)
        movq    %rcx, 616(%rsp)
        movq    %rbp, 320(%rsp)

etc. - of course quite stupid to load sth and then spill it immediately...

We're also back to all unaligned loads/stores.
Comment 25 Richard Biener 2016-08-03 10:55:07 UTC
GCC 4.9 branch is being closed
Comment 26 Jakub Jelinek 2017-10-10 13:24:57 UTC
GCC 5 branch is being closed
Comment 27 Jakub Jelinek 2018-10-26 10:07:00 UTC
GCC 6 branch is being closed
Comment 28 Richard Biener 2019-11-14 07:54:13 UTC
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
Comment 29 Jakub Jelinek 2020-03-04 09:35:40 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 30 Jakub Jelinek 2021-05-14 09:46:32 UTC
GCC 8 branch is being closed.
Comment 31 Richard Biener 2021-06-01 08:05:32 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 32 Richard Biener 2022-05-27 09:34:29 UTC
GCC 9 branch is being closed
Comment 33 Jakub Jelinek 2022-06-28 10:30:10 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 34 Richard Biener 2023-07-07 10:29:38 UTC
GCC 10 branch is being closed.