Bug 102124 - [11/12 Regression] -ftree-loop-vectorize Causing Data To Lose Sign Bit on AARCH64 Platform
Summary: [11/12 Regression] -ftree-loop-vectorize Causing Data To Lose Sign Bit on AAR...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 11.2.1
: P2 normal
Target Milestone: 11.3
Assignee: Jakub Jelinek
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2021-08-30 08:07 UTC by Tomas Chang
Modified: 2025-02-24 00:25 UTC (History)
6 users (show)

See Also:
Host:
Target: aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-08-30 00:00:00


Attachments
Test program for gcc 11.2.1 on AARCH64 (4.20 KB, application/x-xz)
2021-08-30 08:07 UTC, Tomas Chang
Details
gcc12-pr102124.patch (1.25 KB, patch)
2021-08-30 10:15 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tomas Chang 2021-08-30 08:07:16 UTC
Created attachment 51374 [details]
Test program for gcc 11.2.1 on AARCH64

Description of problem:
When I'm building libgcrypt 1.9.4 with GCC 11.2.1 on my AARCH64 box(armv8.2 cortex-a76), I'm using -O3 compile options. however, -O3 generated code failed to pass "basic" test of libgcrypt, it fails on "gcry_cipher_checktag" function.
After investigation, I found that, the problem occurs in buf_eq_const() function in file cipher/bufhelp.h of libgcrypt-1.9.4

 362 /* Constant-time compare of two buffers.  Returns 1 if buffers are equal,
 363    and 0 if buffers differ.  */
 364 static inline int
 365 buf_eq_const(const void *_a, const void *_b, size_t len)
 366 {
 367   const byte *a = _a;
 368   const byte *b = _b;
 369   int ab, ba;
 370   size_t i;
 371 
 372   /* Constant-time compare. */
 373   for (i = 0, ab = 0, ba = 0; i < len; i++)
 374     {
 375       /* If a[i] != b[i], either ab or ba will be negative. */
 376       ab |= a[i] - b[i];
 377       ba |= b[i] - a[i];
 378     }
 379 
 380   /* 'ab | ba' is negative when buffers are not equal. */
 382   return (ab | ba) >= 0;
 383 }

The calculation of 2 different array becomes >= 0 on the return value, however, it should be negative value.

After I change -O3 to -O2, this function works again.
Then I compile libgcrypt 1.9.4 with -O2 plus additional GCC options which are added by -O3 to locate the actual option that causing this issue, finally I found that, if "-ftree-loop-vectorize" is used to compile this code, the calculated result is a positive value, if removing "-ftree-loop-vectorize", the calculated result is negative.

Then I downgraded GCC to 10.x, -ftree-loop-vectorize won't cause such issue.
So I'm sure this is a GCC bug.

It seems that "-ftree-loop-vectorize" causing "|=" operation ignore the "sign bit" of "a[i] - b[i]" or "b[i] - a[i]".

I have summarized a test-case, which is attached


I have also compiled a cross-toolchain, using gcc git version (98e482761b083dbc35ae59704ee1eeb0b8eeb5d1), which is also gcc 11.2.1, this git version also has such issue.



Version-Release number of selected component (if applicable):
GCC 11.2.1

Additional Info:
GCC 11 for x86 / x86_64 doesn't have such issue.
GCC 10.x for aarch64 doesn't have such issue.

I have also submitted this bug to Fedora bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=1998964
Comment 1 Tomas Chang 2021-08-30 08:11:38 UTC
Test Program is as below:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

typedef uint8_t byte;
int buf_eq_const(const void *_a, const void *_b, size_t len)
 {
   const byte *a = _a; 
   const byte *b = _b; 
   int ab, ba; 
   size_t i;
 
   /* Constant-time compare. */
   for (i = 0, ab = 0, ba = 0; i < len; i++)
     {
       /* If a[i] != b[i], either ab or ba will be negative. */
       ab |= a[i] - b[i];
       ba |= b[i] - a[i];
     }   
 
   /* 'ab | ba' is negative when buffers are not equal. */
   printf("(ab | ba) = %d(%08x), ab = %d(%08x), ba = %d(%08x)\n", (ab | ba), (ab | ba), ab, ab, ba, ba);
   return (ab | ba) >= 0;
 }

void print_array(const char *name, uint8_t *a, size_t len)
{
    printf("%s[%lu] = {", name, len);
    for (size_t i = 0; i < len; ++ i) {
        printf("\'%c\', ", a[i]);
    }
    printf("}\n");
}

int main(int argc, char *argv[])
{
    uint8_t a[32] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'};
    uint8_t b[32] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'};
    uint8_t c[32] = {'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b'};

    printf("Comparing array a and array b:\n");
    print_array("a", a, 16);
    print_array("b", b, 16);
    if (buf_eq_const(a, b, 16)) {
        printf("a == b\n");
    } else {
        printf("a != b\n");
    }

    printf("Comparing array a and array c:\n");
    print_array("a", a, 16);
    print_array("c", c, 16);
    if (buf_eq_const(a, c, 16)) {
        printf("a == c\n");
    } else {
        printf("a != c\n");
    }

    return 0;
}

Compile this file with the following command:
gcc -Wall -O2 -march=armv8.2-a+crypto+fp16+rcpc+dotprod -mtune=cortex-a76 -o $@ $<

Running Results:
./test-gcc-O2
Comparing array a and array b:
a[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', }
b[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', }
(ab | ba) = 0(00000000), ab = 0(00000000), ba = 0(00000000)
a == b
Comparing array a and array c:
a[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', }
c[16] = {'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', }
(ab | ba) = -1(ffffffff), ab = -1(ffffffff), ba = 1(00000001)
a != c


Compile this file with the following command:
test-gcc-O2-tree-loop-vectorize: test-gcc.c
        gcc -Wall -O2 -march=armv8.2-a+crypto+fp16+rcpc+dotprod -mtune=cortex-a76 -ftree-loop-vectorize -o $@ $<

Running Results:
./test-gcc-O2-tree-loop-vectorize 
Comparing array a and array b:
a[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', }
b[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', }
(ab | ba) = 0(00000000), ab = 0(00000000), ba = 0(00000000)
a == b
Comparing array a and array c:
a[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', }
c[16] = {'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', }
(ab | ba) = 65535(0000ffff), ab = 65535(0000ffff), ba = 1(00000001)
a == c
Comment 2 Jakub Jelinek 2021-08-30 08:32:46 UTC
Started with r11-5160-g9fc9573f9a5e9432e53c7de93985cfbd267f0309 .

Slightly reduced testcase:
int
foo (const unsigned char *a, const unsigned char *b, unsigned long len)
{
  int ab, ba; 
  unsigned long i;
  for (i = 0, ab = 0, ba = 0; i < len; i++)
    {
      ab |= a[i] - b[i];
      ba |= b[i] - a[i];
    }   
  return (ab | ba) >= 0;
}

int main ()
{
  unsigned char a[32] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'};
  unsigned char b[32] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'};
  unsigned char c[32] = {'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b'};
  if (!foo (a, b, 16))
    __builtin_abort ();
  if (foo (a, c, 16))
    __builtin_abort ();
  return 0;
}
Comment 3 Jakub Jelinek 2021-08-30 08:44:55 UTC
Before vectorization the loop body is:
  _1 = a_19(D) + i_29;
  _2 = *_1;
  _3 = (int) _2;
  _4 = b_20(D) + i_29;
  _5 = *_4;
  _6 = (int) _5;
  _7 = _3 - _6;
  ab_21 = _7 | ab_25;
  _10 = _6 - _3;
  ba_22 = _10 | ba_27;
where _2 and _5 are unsigned char and _3, _6, _7, _10 are int.
It is vectorized as
  vect_patt_54.17_61 = WIDEN_MINUS_LO_EXPR <vect__2.13_56, vect__5.16_17>;
  vect_patt_54.17_62 = WIDEN_MINUS_HI_EXPR <vect__2.13_56, vect__5.16_17>;
  vect_patt_53.18_63 = [vec_unpack_lo_expr] vect_patt_54.17_61;
  vect_patt_53.18_64 = [vec_unpack_hi_expr] vect_patt_54.17_61;
  vect_patt_53.18_65 = [vec_unpack_lo_expr] vect_patt_54.17_62;
  vect_patt_53.18_66 = [vec_unpack_hi_expr] vect_patt_54.17_62;
  _7 = _3 - _6;
  vect_ab_21.19_67 = vect_patt_53.18_63 | vect_ab_25.9_13;
  vect_ab_21.19_68 = vect_patt_53.18_64 | vect_ab_21.19_67;
  vect_ab_21.19_69 = vect_patt_53.18_65 | vect_ab_21.19_68;
  vect_ab_21.19_70 = vect_patt_53.18_66 | vect_ab_21.19_69;
  ab_21 = _7 | ab_25;
which means it is vectorized as if it was instead of (int) _2 - (int) _6
(int) (unsigned short) ((unsigned short) _2 - (unsigned short) _6).
Comment 4 Tomas Chang 2021-08-30 09:28:15 UTC
(In reply to Jakub Jelinek from comment #3)
> Before vectorization the loop body is:
>   _1 = a_19(D) + i_29;
>   _2 = *_1;
>   _3 = (int) _2;
>   _4 = b_20(D) + i_29;
>   _5 = *_4;
>   _6 = (int) _5;
>   _7 = _3 - _6;
>   ab_21 = _7 | ab_25;
>   _10 = _6 - _3;
>   ba_22 = _10 | ba_27;
> where _2 and _5 are unsigned char and _3, _6, _7, _10 are int.
> It is vectorized as
>   vect_patt_54.17_61 = WIDEN_MINUS_LO_EXPR <vect__2.13_56, vect__5.16_17>;
>   vect_patt_54.17_62 = WIDEN_MINUS_HI_EXPR <vect__2.13_56, vect__5.16_17>;
>   vect_patt_53.18_63 = [vec_unpack_lo_expr] vect_patt_54.17_61;
>   vect_patt_53.18_64 = [vec_unpack_hi_expr] vect_patt_54.17_61;
>   vect_patt_53.18_65 = [vec_unpack_lo_expr] vect_patt_54.17_62;
>   vect_patt_53.18_66 = [vec_unpack_hi_expr] vect_patt_54.17_62;
>   _7 = _3 - _6;
>   vect_ab_21.19_67 = vect_patt_53.18_63 | vect_ab_25.9_13;
>   vect_ab_21.19_68 = vect_patt_53.18_64 | vect_ab_21.19_67;
>   vect_ab_21.19_69 = vect_patt_53.18_65 | vect_ab_21.19_68;
>   vect_ab_21.19_70 = vect_patt_53.18_66 | vect_ab_21.19_69;
>   ab_21 = _7 | ab_25;
> which means it is vectorized as if it was instead of (int) _2 - (int) _6
> (int) (unsigned short) ((unsigned short) _2 - (unsigned short) _6).

Thank you very much for confirming this issue so quicky!
Hope this issue can be fixed soon!
Comment 5 Jakub Jelinek 2021-08-30 10:05:57 UTC
I think the bug is in vect_recog_widen_op_pattern.
When we have half_type unsigned, for PLUS_EXPR, MULT_EXPR (and supposedly LSHIFT_EXPR) it is ok that itype is twice the precision and same sign as half_type, say
unsigned char uc1 = 0xfe, uc2 = 0xff;
int t1 = uc1, t2 = uc2;
t1 + t2 (or t1 * t2) wants to perform the widening operation and then zero-extend to the result type.
But MINUS_EXPR seems to be special,
t1 - t2
is negative despite half_type being unsigned (and, even if type is unsigned, such as for
unsigned u1 = uc1, u2 = uc2;
u1 - u2
wants sign-extension from unsigned short (aka itype) to unsigned int (aka type) rather than zero-extension.
Comment 6 Jakub Jelinek 2021-08-30 10:15:01 UTC
Created attachment 51377 [details]
gcc12-pr102124.patch

Untested fix.
Comment 7 Tomas Chang 2021-09-01 04:19:39 UTC
(In reply to Jakub Jelinek from comment #6)
> Created attachment 51377 [details]
> gcc12-pr102124.patch
> 
> Untested fix.

After applying this patch on GCC 11.2.1 code base, I re-built GCC on my AARCH64 box (spending 36 hours) and tested. This issue is gone.

Thanks for fixing this bug so quickly!
Comment 8 GCC Commits 2021-09-01 11:40:59 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:bea07159d1d4c9a61c8f7097e9f88c2b206b1b2f

commit r12-3286-gbea07159d1d4c9a61c8f7097e9f88c2b206b1b2f
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Wed Sep 1 13:30:51 2021 +0200

    vectorizer: Fix up vectorization using WIDEN_MINUS_EXPR [PR102124]
    
    The following testcase is miscompiled on aarch64-linux at -O3 since the
    introduction of WIDEN_MINUS_EXPR.
    The problem is if the inner type (half_type) is unsigned and the result
    type in which the subtraction is performed (type) has precision more than
    twice as larger as the inner type's precision.
    For other widening operations like WIDEN_{PLUS,MULT}_EXPR, if half_type
    is unsigned, the addition/multiplication result in itype is also unsigned
    and needs to be zero-extended to type.
    But subtraction is special, even when half_type is unsigned, the subtraction
    behaves as signed (also regardless of whether the result type is signed or
    unsigned), 0xfeU - 0xffU is -1 or 0xffffffffU, not 0x0000ffff.
    
    I think it is better not to use mixed signedness of types in
    WIDEN_MINUS_EXPR (have unsigned vector of operands and signed result
    vector), so this patch instead adds another cast to make sure we always
    sign-extend the result from itype to type if type is wider than itype.
    
    2021-09-01  Jakub Jelinek  <jakub@redhat.com>
    
            PR tree-optimization/102124
            * tree-vect-patterns.c (vect_recog_widen_op_pattern): For ORIG_CODE
            MINUS_EXPR, if itype is unsigned with smaller precision than type,
            add an extra cast to signed variant of itype to ensure sign-extension.
    
            * gcc.dg/torture/pr102124.c: New test.
Comment 9 GCC Commits 2021-09-01 11:41:57 UTC
The releases/gcc-11 branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:051040f0642cfd002d31f655a70aef50e6f44d25

commit r11-8948-g051040f0642cfd002d31f655a70aef50e6f44d25
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Wed Sep 1 13:30:51 2021 +0200

    vectorizer: Fix up vectorization using WIDEN_MINUS_EXPR [PR102124]
    
    The following testcase is miscompiled on aarch64-linux at -O3 since the
    introduction of WIDEN_MINUS_EXPR.
    The problem is if the inner type (half_type) is unsigned and the result
    type in which the subtraction is performed (type) has precision more than
    twice as larger as the inner type's precision.
    For other widening operations like WIDEN_{PLUS,MULT}_EXPR, if half_type
    is unsigned, the addition/multiplication result in itype is also unsigned
    and needs to be zero-extended to type.
    But subtraction is special, even when half_type is unsigned, the subtraction
    behaves as signed (also regardless of whether the result type is signed or
    unsigned), 0xfeU - 0xffU is -1 or 0xffffffffU, not 0x0000ffff.
    
    I think it is better not to use mixed signedness of types in
    WIDEN_MINUS_EXPR (have unsigned vector of operands and signed result
    vector), so this patch instead adds another cast to make sure we always
    sign-extend the result from itype to type if type is wider than itype.
    
    2021-09-01  Jakub Jelinek  <jakub@redhat.com>
    
            PR tree-optimization/102124
            * tree-vect-patterns.c (vect_recog_widen_op_pattern): For ORIG_CODE
            MINUS_EXPR, if itype is unsigned with smaller precision than type,
            add an extra cast to signed variant of itype to ensure sign-extension.
    
            * gcc.dg/torture/pr102124.c: New test.
    
    (cherry picked from commit bea07159d1d4c9a61c8f7097e9f88c2b206b1b2f)
Comment 10 Jakub Jelinek 2021-09-01 11:48:00 UTC
Fixed for 11.3+ and 12.1+.
Comment 11 GCC Commits 2024-07-18 08:07:28 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:e217e7dbdc1040e7ee160796e9ca1ef12a0dd1cb

commit r15-2136-ge217e7dbdc1040e7ee160796e9ca1ef12a0dd1cb
Author: Sam James <sam@gentoo.org>
Date:   Thu Jul 18 10:00:17 2024 +0200

    testsuite: Add dg-do run to more tests
    
    All of these are for wrong-code bugs.  Confirmed to be used before but
    with no execution.
    
    2024-07-18  Sam James  <sam@gentoo.org>
    
            PR c++/53288
            PR c++/57437
            PR c/65345
            PR libstdc++/88101
            PR tree-optimization/96369
            PR tree-optimization/102124
            PR tree-optimization/108692
            * c-c++-common/pr96369.c: Add dg-do run directive.
            * gcc.dg/torture/pr102124.c: Ditto.
            * gcc.dg/pr108692.c: Ditto.
            * gcc.dg/atomic/pr65345-4.c: Ditto.
            * g++.dg/cpp0x/lambda/lambda-return1.C: Ditto.
            * g++.dg/init/lifetime4.C: Ditto.
            * g++.dg/torture/builtin-clear-padding-1.C: Ditto.
            * g++.dg/torture/builtin-clear-padding-2.C: Ditto.
            * g++.dg/torture/builtin-clear-padding-3.C: Ditto.
            * g++.dg/torture/builtin-clear-padding-4.C: Ditto.
            * g++.dg/torture/builtin-clear-padding-5.C: Ditto.