Bug 96573 - [10 Regression] Regression in optimization on x86-64 with -O3
Summary: [10 Regression] Regression in optimization on x86-64 with -O3
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 10.2.0
: P2 normal
Target Milestone: 10.4
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2020-08-11 17:45 UTC by Remi Andruccioli
Modified: 2021-04-08 12:02 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-08-11 00:00:00


Attachments
This file contains the source code of the function described in the report. (332 bytes, text/x-csrc)
2020-08-11 17:45 UTC, Remi Andruccioli
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Remi Andruccioli 2020-08-11 17:45:09 UTC
Created attachment 49046 [details]
This file contains the source code of the function described in the report.

Hello,

I'd like to describe here what seems to be a regression/missed optimization since GCC 10.

The host and target architecture is x86-64.
I'm using Fedora 32 with Linux 5.7, but I could reproduce it on many other Linux platforms.
I'm using GCC 10.2, but I could reproduce it with any GCC 10 minor version.


Here is a function, written in pure ANSI C99:

#include <stdlib.h>
#include <stdint.h>

void *
ReverseBytesOfPointer(void * const pointer)
{
  const size_t maxIndex = sizeof(pointer) - 1;
  const uint8_t * const oldPointerPointer = (uint8_t*)&pointer;
  void *newPointer;
  uint8_t * const newPointerPointer = (uint8_t *)&newPointer;
  uint8_t i;

  for (i = 0; i <= maxIndex; ++i) {
    newPointerPointer[maxIndex - i] = oldPointerPointer[i];
  }

  return newPointer;
}


What this function does is simply to reverse all the bytes of a pointer. It is written in pure C99 and is extremely portable, as it works from 16-bit to 64-bit machines. (I wrote it and use it for embedded development and I'm happy with it).

What makes it magical is that when compiled with -O3 (and -std=c99 -Wall -Werror -Wextra), GCC 9 (yes, 9) is clever enough to deduce the intent of this function and compiles it all as:

ReverseBytesOfPointer:
        mov     rax, rdi
        bswap   rax
        ret


However, since GCC 10, the magic seems to have disappeared. This is the ASM code that is generated now, with the exact same command line invocation:

ReverseBytesOfPointer:
        movq    %rdi, %rax
        movb    %dil, -1(%rsp)
        movzbl  %ah, %edx
        shrq    $56, %rax
        movb    %dl, -2(%rsp)
        movq    %rdi, %rdx
        shrq    $16, %rdx
        movb    %al, -8(%rsp)
        movb    %dl, -3(%rsp)
        movq    %rdi, %rdx
        shrq    $24, %rdx
        movb    %dl, -4(%rsp)
        movq    %rdi, %rdx
        shrq    $32, %rdx
        movb    %dl, -5(%rsp)
        movq    %rdi, %rdx
        shrq    $40, %rdx
        movb    %dl, -6(%rsp)
        movq    %rdi, %rdx
        shrq    $48, %rdx
        movb    %dl, -7(%rsp)
        movq    -8(%rsp), %rax
        ret


For your convenience, I include here a link to a snippet on compiler-explorer to show the comparison between versions 9.3 and 10.2. This snippet also includes a few unit tests:

https://godbolt.org/z/YG1KPf


Regards,

Remi Andruccioli
Comment 1 Jakub Jelinek 2020-08-11 18:22:21 UTC
Stopped being recognized as bswap with r274922.
That changed the IL before store-merging from
  _22 = MEM[(const uint8_t *)&pointer];
  MEM[(uint8_t *)&newPointer + 7B] = _22;
  _33 = MEM[(const uint8_t *)&pointer + 1B];
  MEM[(uint8_t *)&newPointer + 6B] = _33;
  _44 = MEM[(const uint8_t *)&pointer + 2B];
  MEM[(uint8_t *)&newPointer + 5B] = _44;
  _55 = MEM[(const uint8_t *)&pointer + 3B];
  MEM[(uint8_t *)&newPointer + 4B] = _55;
  _66 = MEM[(const uint8_t *)&pointer + 4B];
  MEM[(uint8_t *)&newPointer + 3B] = _66;
  _77 = MEM[(const uint8_t *)&pointer + 5B];
  MEM[(uint8_t *)&newPointer + 2B] = _77;
  _88 = MEM[(const uint8_t *)&pointer + 6B];
  MEM[(uint8_t *)&newPointer + 1B] = _88;
  _5 = MEM[(const uint8_t *)&pointer + 7B];
  MEM[(uint8_t *)&newPointer] = _5;
  _9 = newPointer;
  newPointer ={v} {CLOBBER};
  return _9;
to:
  _22 = BIT_FIELD_REF <pointer_36(D), 8, 0>;
  MEM[(uint8_t *)&newPointer + 7B] = _22;
  _33 = BIT_FIELD_REF <pointer_36(D), 8, 8>;
  MEM[(uint8_t *)&newPointer + 6B] = _33;
  _44 = BIT_FIELD_REF <pointer_36(D), 8, 16>;
  MEM[(uint8_t *)&newPointer + 5B] = _44;
  _55 = BIT_FIELD_REF <pointer_36(D), 8, 24>;
  MEM[(uint8_t *)&newPointer + 4B] = _55;
  _66 = BIT_FIELD_REF <pointer_36(D), 8, 32>;
  MEM[(uint8_t *)&newPointer + 3B] = _66;
  _77 = BIT_FIELD_REF <pointer_36(D), 8, 40>;
  MEM[(uint8_t *)&newPointer + 2B] = _77;
  _88 = BIT_FIELD_REF <pointer_36(D), 8, 48>;
  MEM[(uint8_t *)&newPointer + 1B] = _88;
  _5 = BIT_FIELD_REF <pointer_36(D), 8, 56>;
  MEM[(uint8_t *)&newPointer] = _5;
  _9 = newPointer;
  newPointer ={v} {CLOBBER};
  return _9;
store merging has code to handle this, but punts on non-integral types (POINTER_TYPE_P in this case).
Just changing:
--- gcc/gimple-ssa-store-merging.c.jj	2020-07-28 15:39:09.909757589 +0200
+++ gcc/gimple-ssa-store-merging.c	2020-08-11 20:19:19.293299266 +0200
@@ -333,7 +333,8 @@ init_symbolic_number (struct symbolic_nu
 {
   int size;
 
-  if (! INTEGRAL_TYPE_P (TREE_TYPE (src)))
+  if (! INTEGRAL_TYPE_P (TREE_TYPE (src))
+      && ! POINTER_TYPE_P (TREE_TYPE (src)))
     return false;
 
   n->base_addr = n->offset = n->alias_set = n->vuse = NULL_TREE;
seems to work, though I must say I don't find much rationale for byte-swapping pointers...
Comment 2 Remi Andruccioli 2020-08-12 15:42:05 UTC
Hello,

Thanks for your quick reply and investigations.

I agree this is quite an uncommon need. I use it when I have to deal with endianness and I/O. So I mainly use it when dumping content and variables from the memory to files on the disk in a constant endianness across all the different target machines, for later analysis. (same idea than what is done with the endianness of network)


Regards,

Remi Andruccioli
Comment 3 Richard Biener 2021-03-31 09:13:32 UTC
See also PR96135
Comment 4 CVS Commits 2021-04-01 08:52:14 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:5b9a65ecbeb22ef6dd3344baae97f85b645522e3

commit r11-7946-g5b9a65ecbeb22ef6dd3344baae97f85b645522e3
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Apr 1 10:51:03 2021 +0200

    bswap: Handle bswapping of pointers [PR96573]
    
    In GCC8/9 we used to optimize this into a bswap, but we no longer do.
    Handling byteswapping of pointers is easy, all we need is to allow them,
    for the __builtin_bswap* we already use TYPE_PRECISION to determine
    the precision and we cast the operand and result to the correct type
    if they aren't uselessly convertible to what the builtin expects.
    
    2021-04-01  Jakub Jelinek  <jakub@redhat.com>
    
            PR tree-optimization/96573
            * gimple-ssa-store-merging.c (init_symbolic_number): Handle
            also pointer types.
    
            * gcc.dg/pr96573.c: New test.
Comment 5 Jakub Jelinek 2021-04-01 09:22:13 UTC
Fixed on the trunk so far.
Comment 6 Andrew Pinski 2021-04-04 19:42:28 UTC
This test fails on aarch64-linux-gnu:
FAIL: gcc.dg/pr96573.c scan-tree-dump optimized "__builtin_bswap"
Comment 7 CVS Commits 2021-04-06 10:46:33 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:bfeb36bd03c2168af263daa13370a20a96c42b5d

commit r11-8002-gbfeb36bd03c2168af263daa13370a20a96c42b5d
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Apr 6 12:44:51 2021 +0200

    testsuite: Fix up pr96573.c on aarch64 [PR96573]
    
    On Thu, Apr 01, 2021 at 02:16:55PM +0100, Alex Coplan via Gcc-patches wrote:
    > FYI, I'm seeing the new test failing on aarch64:
    >
    > PASS: gcc.dg/pr96573.c (test for excess errors)
    > FAIL: gcc.dg/pr96573.c scan-tree-dump optimized "__builtin_bswap"
    
    The vectorizer in the aarch64 case manages to emit a VEC_PERM_EXPR instead
    (which is just as efficient).
    
    So, do we want to go for the following (and/or perhaps also restrict the test to
    a couple of targets where it works?  In my last distro build it failed only
    on aarch64-linux, while armv7hl-linux-gnueabi and
    {i686,x86_64,powerpc64le,s390x}-linux were fine)?
    
    2021-04-06  Jakub Jelinek  <jakub@redhat.com>
    
            PR tree-optimization/96573
            * gcc.dg/pr96573.c: Instead of __builtin_bswap accept also
            VEC_PERM_EXPR with bswapping permutation.
Comment 8 Richard Biener 2021-04-08 12:02:27 UTC
GCC 10.3 is being released, retargeting bugs to GCC 10.4.