Consider this source code: #include <stdint.h> uint64_t ReadLittleEndian64(uint64_t word) { const unsigned char* const ptr = reinterpret_cast<const unsigned char*>(&word); return uint64_t{ptr[0]} | (uint64_t{ptr[1]} << 8) | (uint64_t{ptr[2]} << (8 * 2)) | (uint64_t{ptr[3]} << (8 * 3)) | (uint64_t{ptr[4]} << (8 * 4)) | (uint64_t{ptr[5]} << (8 * 5)) | (uint64_t{ptr[6]} << (8 * 6)) | (uint64_t{ptr[7]} << (8 * 7)); } gcc-6.3 generates good machine code on x86-64 with -O2: ReadLittleEndian64(unsigned long): mov rax, rdi ret gcc-7.1 leaves out one byte from the optimization: ReadLittleEndian64(unsigned long): movabs rax, 72057594037927935 and rax, rdi shr rdi, 56 sal rdi, 56 or rax, rdi ret Proof: https://gcc.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAKxAEZSBnVAV2OUxAHIBSAJgGY8AO2QAbZlgDU3fgGEGBfEIIA6BDOzcADAEFtO5sIIA2ACwB9ApIBKmAIboAMoQKicQ/HaFmIh5WctJAHcSdABKaQB2ACF9SUk0IQVJZiS8YCFMdASEO2IAKgTUJKsABwJiaX4AETj4%2BuJMI0xiUsaCc2Q7BRlZROTUhnTM7ORcgo0IPmMQ4nCZWN149tYhFKMAgm4Y8uJuAFZorQPayOqo2Tr6698Niy2dioPo2hPt897eyQAOCO3LpbXG5%2BEz3bbRXbPXhvM5VS5ySQQb6SQq8MJ/SIAnRA4F3SzgyGHfgwj5yL5IlGSfjoi5XHG3fxgx57Q6mElw8nIwqmGn/OlAhmg/HM577dmfBEUwr7XmY/m4xnCiFPQ7GcVkyVcyTGWVYnH1QWbAkq6KRdXw2SIrWRdELfTvfScMKkURcfacUhCLhaD2oLiXXixQOSJisdjSAS0D0Eb1O50AaxA/H4KmTafTGeMLq4pg9Xs4PtIfs4HoYIC0pBjBadpDgsCQaAAtqU8G4yBQIE2W22QMBjPxSAAzVsEFpliAAI1jpCbjcwygA8kJRABPaf4RrIAh4ABumDL1fIykwrurztQ5TwxQPAFoF7xSyw2BxaM7T%2B7PdPiwAPb7GG9mJIwDIMg2qpoiuCECQEb8PQkiyKgzati0MGvtGsboqQibJqmGZ4cmWanrmn6HsWpblpWGG1jAiAoIh3YtOQlBdshxC9pE9DDqIo7EOOU6HrO84EEuq7rngm7bnuB6Fs0p4%2Buel7Xlwd4PowT7sHQb5unmX5cL%2B/6AcBoGRCotAQfgRCVHwsGkPB9GsTBaLoWeCZJim%2BH4dmnDEfmhZkYwFFVvJWmcKpvm%2BlwznBaQe68VeXqmEAA%3D%3D
Started with r239778.
bswap was happy dealing with _2 = MEM[(const unsigned char *)&word]; _3 = (uint64_t) _2; _4 = MEM[(const unsigned char *)&word + 1B]; _5 = (uint64_t) _4; _6 = _5 << 8; _8 = MEM[(const unsigned char *)&word + 2B]; _9 = (uint64_t) _8; _10 = _9 << 16; _32 = _6 | _10; _11 = _3 | _32; etc, but has trouble with _21 = word_31(D) & 255; _1 = BIT_FIELD_REF <word_31(D), 8, 8>; _23 = (uint64_t) _1; _2 = _23 << 8; _4 = BIT_FIELD_REF <word_31(D), 8, 16>; _24 = (uint64_t) _4; _5 = _24 << 16; _32 = _2 | _5; _6 = _21 | _32;
I'll have a look at that tomorrow.
Created attachment 41749 [details] gcc8-pr81396.patch While the bswap pass finds out this is a nop reshuffling, it hits: /* Useless bit manipulation performed by code. */ if (!n->base_addr && n->n == cmpnop) return NULL; and doesn't do anything (while in GCC6 it was less optimized and thus n->base_addr was non-NULL and the bswap pass did something with it). Either we can do something in the bswap pass with it as done in this untested patch, or we could consider match.pd optimization for: _3 = BIT_FIELD_REF <_7, 8, 16>; _32 = (typeof(_7)) _3; _4 = _32 << 16; into: _4 = _7 & (((1ULL << 8) - 1) << 16); if the shift matches the bitpos and then rest of match.pd would be able to figure it out, like it optimizes now at forwprop1 time e.g.: typedef unsigned long long uint64_t; uint64_t ReadLittleEndian64(uint64_t word) { uint64_t ret = 0; ret |= (word & 0xff); ret |= (((word & 0xff00) >> 8) << 8); ret |= (((word & 0xff0000) >> 16) << 16); ret |= (((word & 0xff000000U) >> 24) << 24); ret |= (((word & 0xff00000000ULL) >> 32) << 32); ret |= (((word & 0xff0000000000ULL) >> 40) << 40); ret |= (((word & 0xff000000000000ULL) >> 48) << 48); ret |= (((word & 0xff00000000000000ULL) >> 56) << 56); return ret; }
Or both this bswap change and the match.pd addition.
(In reply to Jakub Jelinek from comment #5) > Or both this bswap change and the match.pd addition. Doing both sounds good to me :-)
Author: jakub Date: Mon Jul 17 08:14:16 2017 New Revision: 250257 URL: https://gcc.gnu.org/viewcvs?rev=250257&root=gcc&view=rev Log: PR tree-optimization/81396 * tree-ssa-math-opts.c (struct symbolic_number): Add n_ops field. (init_symbolic_number): Initialize it to 1. (perform_symbolic_merge): Add n_ops from both operands into the new n_ops. (find_bswap_or_nop): Don't consider n->n == cmpnop computations without base_addr as useless if they need more than one operation. (bswap_replace): Handle !bswap case for NULL base_addr. * gcc.dg/tree-ssa/pr81396.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr81396.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-math-opts.c
Fixed for 8.1+, it is a new optimization, so I think it shouldn't be backported to 7.x.
Should we open a separate PR for the transformation you suggested in comment 4, or does that seem not useful enough now, or will be part of bitfield gimple lowering when that lands?
(In reply to Jakub Jelinek from comment #4) > Either we can do something in the bswap pass with it as done in this > untested patch, or we could consider match.pd optimization for: > _3 = BIT_FIELD_REF <_7, 8, 16>; > _32 = (typeof(_7)) _3; > _4 = _32 << 16; > into: > _4 = _7 & (((1ULL << 8) - 1) << 16); I am implementing the above because when I was adding detecting bit_insert more, bswap goes not understand bit_insert, plus we can optimize this earlier.