Created attachment 21938 [details]
The attached .c file contains two functions, which (unless I screwed up) compute exactly the same (mathematical) function - they take an array of 8 bytes, permute its elements, and stuff them into a 64-bit integer, which is then returned. However, GCC generates very different code for each (on x86-64). There seem to be two missed optimization opportunities here:
1) I don't know *which* of the two code generation possibilities here is better, but it seems like GCC ought to know and ought to generate that code for both functions.
2) Could we be taking advantage of SSEn vector permute instructions here?
Created attachment 21939 [details]
assembly for test case with gcc 4.5.1 on x86-64
We do at least have some infrastructure for this kind of optimization
(the bswap pass). But it needs to be taught of memory source/destinations
and SSE shuffles.