Bug 81396 - [7/8 Regression] Optimization of reading Little-Endian 64-bit number with portable code has a regression
Summary: [7/8 Regression] Optimization of reading Little-Endian 64-bit number with por...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.1.0
: P3 normal
Target Milestone: 7.2
Assignee: Jakub Jelinek
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-11 13:36 UTC by Marcin Kowalczyk
Modified: 2019-12-24 22:28 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-07-11 00:00:00


Attachments
gcc8-pr81396.patch (1.42 KB, patch)
2017-07-13 17:38 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Marcin Kowalczyk 2017-07-11 13:36:22 UTC
Consider this source code:

#include <stdint.h>

uint64_t ReadLittleEndian64(uint64_t word) {
  const unsigned char* const ptr =
      reinterpret_cast<const unsigned char*>(&word);
  return uint64_t{ptr[0]} |
         (uint64_t{ptr[1]} << 8) |
         (uint64_t{ptr[2]} << (8 * 2)) |
         (uint64_t{ptr[3]} << (8 * 3)) |
         (uint64_t{ptr[4]} << (8 * 4)) |
         (uint64_t{ptr[5]} << (8 * 5)) |
         (uint64_t{ptr[6]} << (8 * 6)) |
         (uint64_t{ptr[7]} << (8 * 7));
}

gcc-6.3 generates good machine code on x86-64 with -O2:

ReadLittleEndian64(unsigned long):
        mov     rax, rdi
        ret

gcc-7.1 leaves out one byte from the optimization:

ReadLittleEndian64(unsigned long):
        movabs  rax, 72057594037927935
        and     rax, rdi
        shr     rdi, 56
        sal     rdi, 56
        or      rax, rdi
        ret

Proof: https://gcc.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAKxAEZSBnVAV2OUxAHIBSAJgGY8AO2QAbZlgDU3fgGEGBfEIIA6BDOzcADAEFtO5sIIA2ACwB9ApIBKmAIboAMoQKicQ/HaFmIh5WctJAHcSdABKaQB2ACF9SUk0IQVJZiS8YCFMdASEO2IAKgTUJKsABwJiaX4AETj4%2BuJMI0xiUsaCc2Q7BRlZROTUhnTM7ORcgo0IPmMQ4nCZWN149tYhFKMAgm4Y8uJuAFZorQPayOqo2Tr6698Niy2dioPo2hPt897eyQAOCO3LpbXG5%2BEz3bbRXbPXhvM5VS5ySQQb6SQq8MJ/SIAnRA4F3SzgyGHfgwj5yL5IlGSfjoi5XHG3fxgx57Q6mElw8nIwqmGn/OlAhmg/HM577dmfBEUwr7XmY/m4xnCiFPQ7GcVkyVcyTGWVYnH1QWbAkq6KRdXw2SIrWRdELfTvfScMKkURcfacUhCLhaD2oLiXXixQOSJisdjSAS0D0Eb1O50AaxA/H4KmTafTGeMLq4pg9Xs4PtIfs4HoYIC0pBjBadpDgsCQaAAtqU8G4yBQIE2W22QMBjPxSAAzVsEFpliAAI1jpCbjcwygA8kJRABPaf4RrIAh4ABumDL1fIykwrurztQ5TwxQPAFoF7xSyw2BxaM7T%2B7PdPiwAPb7GG9mJIwDIMg2qpoiuCECQEb8PQkiyKgzati0MGvtGsboqQibJqmGZ4cmWanrmn6HsWpblpWGG1jAiAoIh3YtOQlBdshxC9pE9DDqIo7EOOU6HrO84EEuq7rngm7bnuB6Fs0p4%2Buel7Xlwd4PowT7sHQb5unmX5cL%2B/6AcBoGRCotAQfgRCVHwsGkPB9GsTBaLoWeCZJim%2BH4dmnDEfmhZkYwFFVvJWmcKpvm%2BlwznBaQe68VeXqmEAA%3D%3D
Comment 1 Jakub Jelinek 2017-07-11 15:59:21 UTC
Started with r239778.
Comment 2 Marc Glisse 2017-07-11 16:42:30 UTC
bswap was happy dealing with

  _2 = MEM[(const unsigned char *)&word];
  _3 = (uint64_t) _2;
  _4 = MEM[(const unsigned char *)&word + 1B];
  _5 = (uint64_t) _4;
  _6 = _5 << 8;
  _8 = MEM[(const unsigned char *)&word + 2B];
  _9 = (uint64_t) _8;
  _10 = _9 << 16;
  _32 = _6 | _10;
  _11 = _3 | _32;

etc, but has trouble with

  _21 = word_31(D) & 255;
  _1 = BIT_FIELD_REF <word_31(D), 8, 8>;
  _23 = (uint64_t) _1;
  _2 = _23 << 8;
  _4 = BIT_FIELD_REF <word_31(D), 8, 16>;
  _24 = (uint64_t) _4;
  _5 = _24 << 16;
  _32 = _2 | _5;
  _6 = _21 | _32;
Comment 3 Jakub Jelinek 2017-07-11 16:55:01 UTC
I'll have a look at that tomorrow.
Comment 4 Jakub Jelinek 2017-07-13 17:38:52 UTC
Created attachment 41749 [details]
gcc8-pr81396.patch

While the bswap pass finds out this is a nop reshuffling, it hits:
  /* Useless bit manipulation performed by code.  */
  if (!n->base_addr && n->n == cmpnop)
    return NULL;
and doesn't do anything (while in GCC6 it was less optimized and thus n->base_addr was non-NULL and the bswap pass did something with it).
Either we can do something in the bswap pass with it as done in this untested patch, or we could consider match.pd optimization for:
  _3 = BIT_FIELD_REF <_7, 8, 16>;
  _32 = (typeof(_7)) _3;
  _4 = _32 << 16;
into:
  _4 = _7 & (((1ULL << 8) - 1) << 16);
if the shift matches the bitpos and then rest of match.pd would be able to figure it out, like it optimizes now at forwprop1 time e.g.:
typedef unsigned long long uint64_t;
uint64_t ReadLittleEndian64(uint64_t word) {
  uint64_t ret = 0;
  ret |= (word & 0xff);
  ret |= (((word & 0xff00) >> 8) << 8);
  ret |= (((word & 0xff0000) >> 16) << 16);
  ret |= (((word & 0xff000000U) >> 24) << 24);
  ret |= (((word & 0xff00000000ULL) >> 32) << 32);
  ret |= (((word & 0xff0000000000ULL) >> 40) << 40);
  ret |= (((word & 0xff000000000000ULL) >> 48) << 48);
  ret |= (((word & 0xff00000000000000ULL) >> 56) << 56);
  return ret;
}
Comment 5 Jakub Jelinek 2017-07-13 17:39:16 UTC
Or both this bswap change and the match.pd addition.
Comment 6 Marc Glisse 2017-07-14 06:20:16 UTC
(In reply to Jakub Jelinek from comment #5)
> Or both this bswap change and the match.pd addition.

Doing both sounds good to me :-)
Comment 7 Jakub Jelinek 2017-07-17 08:14:49 UTC
Author: jakub
Date: Mon Jul 17 08:14:16 2017
New Revision: 250257

URL: https://gcc.gnu.org/viewcvs?rev=250257&root=gcc&view=rev
Log:
	PR tree-optimization/81396
	* tree-ssa-math-opts.c (struct symbolic_number): Add n_ops field.
	(init_symbolic_number): Initialize it to 1.
	(perform_symbolic_merge): Add n_ops from both operands into the new
	n_ops.
	(find_bswap_or_nop): Don't consider n->n == cmpnop computations
	without base_addr as useless if they need more than one operation.
	(bswap_replace): Handle !bswap case for NULL base_addr.

	* gcc.dg/tree-ssa/pr81396.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr81396.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-math-opts.c
Comment 8 Jakub Jelinek 2017-07-20 08:19:04 UTC
Fixed for 8.1+, it is a new optimization, so I think it shouldn't be backported to 7.x.
Comment 9 Marc Glisse 2017-07-20 09:41:41 UTC
Should we open a separate PR for the transformation you suggested in comment 4, or does that seem not useful enough now, or will be part of bitfield gimple lowering when that lands?
Comment 10 Andrew Pinski 2019-12-24 22:28:31 UTC
(In reply to Jakub Jelinek from comment #4)
> Either we can do something in the bswap pass with it as done in this
> untested patch, or we could consider match.pd optimization for:
>   _3 = BIT_FIELD_REF <_7, 8, 16>;
>   _32 = (typeof(_7)) _3;
>   _4 = _32 << 16;
> into:
>   _4 = _7 & (((1ULL << 8) - 1) << 16);

I am implementing the above because when I was adding detecting bit_insert more, bswap goes not understand bit_insert, plus we can optimize this earlier.