Bug 89811

Summary: uint32_t load is not recognized if shifts are done in a fixed-size loop
Product: gcc Reporter: Nikita Kniazev <nok.raven>
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: enhancement CC: gabravier, goon.pri.low, segher
Priority: P3 Keywords: missed-optimization
Version: 9.0   
Target Milestone: ---   
Host: x86_64 Target: x86_64
Build: Known to work:
Known to fail: Last reconfirmed: 2021-09-04 00:00:00
Bug Depends on:    
Bug Blocks: 94094    

Description Nikita Kniazev 2019-03-24 18:31:15 UTC
I was expecting that fixed-size loop will be unrolled and the uint32_t load pattern is recognized, but it does not happen. Clang has no problems with this https://godbolt.org/z/8ES09V

#include <cstdint>
#include <cstring>

// recognized
std::uint32_t good(const unsigned char *p)
{
    std::uint32_t result = 0;
    result |= (static_cast<std::uint32_t>(p[0]) << 0);
    result |= (static_cast<std::uint32_t>(p[1]) << 8);
    result |= (static_cast<std::uint32_t>(p[2]) << 16);
    result |= (static_cast<std::uint32_t>(p[3]) << 24);
    return result;
}

// not recognized if done in a loop
std::uint32_t loop(const unsigned char *p)
{
  std::uint32_t result = 0;
  for (int i = 0; i < 4; ++i)
      result |= (static_cast<std::uint32_t>(p[i]) << (i * 8));

  return result;
}

// other variations are not recognized too
std::uint32_t bad(const unsigned char *p)
{
  std::uint32_t result = 0;
  //result <<= 8;
  result |= static_cast<std::uint32_t>(p[3]);
  result <<= 8;
  result |= static_cast<std::uint32_t>(p[2]);
  result <<= 8;
  result |= static_cast<std::uint32_t>(p[1]);
  result <<= 8;
  result |= static_cast<std::uint32_t>(p[0]);

  return result;
}

std::uint32_t loop2(const unsigned char *p)
{
  std::uint32_t result = 0;
  for (int i = 0; i < 4; ++i) {
      result <<= 8;
      result |= static_cast<std::uint32_t>(p[3 - i]);
  }

  return result;
}
Comment 1 Richard Biener 2019-03-25 09:30:38 UTC
Unrolling happens too late for bswap.
Comment 2 Segher Boessenkool 2019-04-22 12:37:55 UTC
On PowerPC, for "bad" we get

        addi 9,3,2
        lbz 0,1(3)
        lbz 3,0(3)
        lhbrx 10,0,9
        rlwimi 0,10,8,0,31-8
        rlwimi 3,0,8,0,31-8
        rldicl 3,3,0,32
        blr

(BE -m64); it managed to recognise the top two bytes as a byte-reverse load,
but not the lower two.

(And yup, "loop" uses no byte-reverse at all.)
Comment 3 Andrew Pinski 2021-09-04 21:33:06 UTC
*** Bug 94834 has been marked as a duplicate of this bug. ***