Created attachment 51374 [details] Test program for gcc 11.2.1 on AARCH64 Description of problem: When I'm building libgcrypt 1.9.4 with GCC 11.2.1 on my AARCH64 box(armv8.2 cortex-a76), I'm using -O3 compile options. however, -O3 generated code failed to pass "basic" test of libgcrypt, it fails on "gcry_cipher_checktag" function. After investigation, I found that, the problem occurs in buf_eq_const() function in file cipher/bufhelp.h of libgcrypt-1.9.4 362 /* Constant-time compare of two buffers. Returns 1 if buffers are equal, 363 and 0 if buffers differ. */ 364 static inline int 365 buf_eq_const(const void *_a, const void *_b, size_t len) 366 { 367 const byte *a = _a; 368 const byte *b = _b; 369 int ab, ba; 370 size_t i; 371 372 /* Constant-time compare. */ 373 for (i = 0, ab = 0, ba = 0; i < len; i++) 374 { 375 /* If a[i] != b[i], either ab or ba will be negative. */ 376 ab |= a[i] - b[i]; 377 ba |= b[i] - a[i]; 378 } 379 380 /* 'ab | ba' is negative when buffers are not equal. */ 382 return (ab | ba) >= 0; 383 } The calculation of 2 different array becomes >= 0 on the return value, however, it should be negative value. After I change -O3 to -O2, this function works again. Then I compile libgcrypt 1.9.4 with -O2 plus additional GCC options which are added by -O3 to locate the actual option that causing this issue, finally I found that, if "-ftree-loop-vectorize" is used to compile this code, the calculated result is a positive value, if removing "-ftree-loop-vectorize", the calculated result is negative. Then I downgraded GCC to 10.x, -ftree-loop-vectorize won't cause such issue. So I'm sure this is a GCC bug. It seems that "-ftree-loop-vectorize" causing "|=" operation ignore the "sign bit" of "a[i] - b[i]" or "b[i] - a[i]". I have summarized a test-case, which is attached I have also compiled a cross-toolchain, using gcc git version (98e482761b083dbc35ae59704ee1eeb0b8eeb5d1), which is also gcc 11.2.1, this git version also has such issue. Version-Release number of selected component (if applicable): GCC 11.2.1 Additional Info: GCC 11 for x86 / x86_64 doesn't have such issue. GCC 10.x for aarch64 doesn't have such issue. I have also submitted this bug to Fedora bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=1998964
Test Program is as below: #include <stdio.h> #include <stdlib.h> #include <stdint.h> typedef uint8_t byte; int buf_eq_const(const void *_a, const void *_b, size_t len) { const byte *a = _a; const byte *b = _b; int ab, ba; size_t i; /* Constant-time compare. */ for (i = 0, ab = 0, ba = 0; i < len; i++) { /* If a[i] != b[i], either ab or ba will be negative. */ ab |= a[i] - b[i]; ba |= b[i] - a[i]; } /* 'ab | ba' is negative when buffers are not equal. */ printf("(ab | ba) = %d(%08x), ab = %d(%08x), ba = %d(%08x)\n", (ab | ba), (ab | ba), ab, ab, ba, ba); return (ab | ba) >= 0; } void print_array(const char *name, uint8_t *a, size_t len) { printf("%s[%lu] = {", name, len); for (size_t i = 0; i < len; ++ i) { printf("\'%c\', ", a[i]); } printf("}\n"); } int main(int argc, char *argv[]) { uint8_t a[32] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'}; uint8_t b[32] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'}; uint8_t c[32] = {'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b'}; printf("Comparing array a and array b:\n"); print_array("a", a, 16); print_array("b", b, 16); if (buf_eq_const(a, b, 16)) { printf("a == b\n"); } else { printf("a != b\n"); } printf("Comparing array a and array c:\n"); print_array("a", a, 16); print_array("c", c, 16); if (buf_eq_const(a, c, 16)) { printf("a == c\n"); } else { printf("a != c\n"); } return 0; } Compile this file with the following command: gcc -Wall -O2 -march=armv8.2-a+crypto+fp16+rcpc+dotprod -mtune=cortex-a76 -o $@ $< Running Results: ./test-gcc-O2 Comparing array a and array b: a[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', } b[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', } (ab | ba) = 0(00000000), ab = 0(00000000), ba = 0(00000000) a == b Comparing array a and array c: a[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', } c[16] = {'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', } (ab | ba) = -1(ffffffff), ab = -1(ffffffff), ba = 1(00000001) a != c Compile this file with the following command: test-gcc-O2-tree-loop-vectorize: test-gcc.c gcc -Wall -O2 -march=armv8.2-a+crypto+fp16+rcpc+dotprod -mtune=cortex-a76 -ftree-loop-vectorize -o $@ $< Running Results: ./test-gcc-O2-tree-loop-vectorize Comparing array a and array b: a[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', } b[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', } (ab | ba) = 0(00000000), ab = 0(00000000), ba = 0(00000000) a == b Comparing array a and array c: a[16] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', } c[16] = {'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', } (ab | ba) = 65535(0000ffff), ab = 65535(0000ffff), ba = 1(00000001) a == c
Started with r11-5160-g9fc9573f9a5e9432e53c7de93985cfbd267f0309 . Slightly reduced testcase: int foo (const unsigned char *a, const unsigned char *b, unsigned long len) { int ab, ba; unsigned long i; for (i = 0, ab = 0, ba = 0; i < len; i++) { ab |= a[i] - b[i]; ba |= b[i] - a[i]; } return (ab | ba) >= 0; } int main () { unsigned char a[32] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'}; unsigned char b[32] = {'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a'}; unsigned char c[32] = {'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b'}; if (!foo (a, b, 16)) __builtin_abort (); if (foo (a, c, 16)) __builtin_abort (); return 0; }
Before vectorization the loop body is: _1 = a_19(D) + i_29; _2 = *_1; _3 = (int) _2; _4 = b_20(D) + i_29; _5 = *_4; _6 = (int) _5; _7 = _3 - _6; ab_21 = _7 | ab_25; _10 = _6 - _3; ba_22 = _10 | ba_27; where _2 and _5 are unsigned char and _3, _6, _7, _10 are int. It is vectorized as vect_patt_54.17_61 = WIDEN_MINUS_LO_EXPR <vect__2.13_56, vect__5.16_17>; vect_patt_54.17_62 = WIDEN_MINUS_HI_EXPR <vect__2.13_56, vect__5.16_17>; vect_patt_53.18_63 = [vec_unpack_lo_expr] vect_patt_54.17_61; vect_patt_53.18_64 = [vec_unpack_hi_expr] vect_patt_54.17_61; vect_patt_53.18_65 = [vec_unpack_lo_expr] vect_patt_54.17_62; vect_patt_53.18_66 = [vec_unpack_hi_expr] vect_patt_54.17_62; _7 = _3 - _6; vect_ab_21.19_67 = vect_patt_53.18_63 | vect_ab_25.9_13; vect_ab_21.19_68 = vect_patt_53.18_64 | vect_ab_21.19_67; vect_ab_21.19_69 = vect_patt_53.18_65 | vect_ab_21.19_68; vect_ab_21.19_70 = vect_patt_53.18_66 | vect_ab_21.19_69; ab_21 = _7 | ab_25; which means it is vectorized as if it was instead of (int) _2 - (int) _6 (int) (unsigned short) ((unsigned short) _2 - (unsigned short) _6).
(In reply to Jakub Jelinek from comment #3) > Before vectorization the loop body is: > _1 = a_19(D) + i_29; > _2 = *_1; > _3 = (int) _2; > _4 = b_20(D) + i_29; > _5 = *_4; > _6 = (int) _5; > _7 = _3 - _6; > ab_21 = _7 | ab_25; > _10 = _6 - _3; > ba_22 = _10 | ba_27; > where _2 and _5 are unsigned char and _3, _6, _7, _10 are int. > It is vectorized as > vect_patt_54.17_61 = WIDEN_MINUS_LO_EXPR <vect__2.13_56, vect__5.16_17>; > vect_patt_54.17_62 = WIDEN_MINUS_HI_EXPR <vect__2.13_56, vect__5.16_17>; > vect_patt_53.18_63 = [vec_unpack_lo_expr] vect_patt_54.17_61; > vect_patt_53.18_64 = [vec_unpack_hi_expr] vect_patt_54.17_61; > vect_patt_53.18_65 = [vec_unpack_lo_expr] vect_patt_54.17_62; > vect_patt_53.18_66 = [vec_unpack_hi_expr] vect_patt_54.17_62; > _7 = _3 - _6; > vect_ab_21.19_67 = vect_patt_53.18_63 | vect_ab_25.9_13; > vect_ab_21.19_68 = vect_patt_53.18_64 | vect_ab_21.19_67; > vect_ab_21.19_69 = vect_patt_53.18_65 | vect_ab_21.19_68; > vect_ab_21.19_70 = vect_patt_53.18_66 | vect_ab_21.19_69; > ab_21 = _7 | ab_25; > which means it is vectorized as if it was instead of (int) _2 - (int) _6 > (int) (unsigned short) ((unsigned short) _2 - (unsigned short) _6). Thank you very much for confirming this issue so quicky! Hope this issue can be fixed soon!
I think the bug is in vect_recog_widen_op_pattern. When we have half_type unsigned, for PLUS_EXPR, MULT_EXPR (and supposedly LSHIFT_EXPR) it is ok that itype is twice the precision and same sign as half_type, say unsigned char uc1 = 0xfe, uc2 = 0xff; int t1 = uc1, t2 = uc2; t1 + t2 (or t1 * t2) wants to perform the widening operation and then zero-extend to the result type. But MINUS_EXPR seems to be special, t1 - t2 is negative despite half_type being unsigned (and, even if type is unsigned, such as for unsigned u1 = uc1, u2 = uc2; u1 - u2 wants sign-extension from unsigned short (aka itype) to unsigned int (aka type) rather than zero-extension.
Created attachment 51377 [details] gcc12-pr102124.patch Untested fix.
(In reply to Jakub Jelinek from comment #6) > Created attachment 51377 [details] > gcc12-pr102124.patch > > Untested fix. After applying this patch on GCC 11.2.1 code base, I re-built GCC on my AARCH64 box (spending 36 hours) and tested. This issue is gone. Thanks for fixing this bug so quickly!
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:bea07159d1d4c9a61c8f7097e9f88c2b206b1b2f commit r12-3286-gbea07159d1d4c9a61c8f7097e9f88c2b206b1b2f Author: Jakub Jelinek <jakub@redhat.com> Date: Wed Sep 1 13:30:51 2021 +0200 vectorizer: Fix up vectorization using WIDEN_MINUS_EXPR [PR102124] The following testcase is miscompiled on aarch64-linux at -O3 since the introduction of WIDEN_MINUS_EXPR. The problem is if the inner type (half_type) is unsigned and the result type in which the subtraction is performed (type) has precision more than twice as larger as the inner type's precision. For other widening operations like WIDEN_{PLUS,MULT}_EXPR, if half_type is unsigned, the addition/multiplication result in itype is also unsigned and needs to be zero-extended to type. But subtraction is special, even when half_type is unsigned, the subtraction behaves as signed (also regardless of whether the result type is signed or unsigned), 0xfeU - 0xffU is -1 or 0xffffffffU, not 0x0000ffff. I think it is better not to use mixed signedness of types in WIDEN_MINUS_EXPR (have unsigned vector of operands and signed result vector), so this patch instead adds another cast to make sure we always sign-extend the result from itype to type if type is wider than itype. 2021-09-01 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/102124 * tree-vect-patterns.c (vect_recog_widen_op_pattern): For ORIG_CODE MINUS_EXPR, if itype is unsigned with smaller precision than type, add an extra cast to signed variant of itype to ensure sign-extension. * gcc.dg/torture/pr102124.c: New test.
The releases/gcc-11 branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:051040f0642cfd002d31f655a70aef50e6f44d25 commit r11-8948-g051040f0642cfd002d31f655a70aef50e6f44d25 Author: Jakub Jelinek <jakub@redhat.com> Date: Wed Sep 1 13:30:51 2021 +0200 vectorizer: Fix up vectorization using WIDEN_MINUS_EXPR [PR102124] The following testcase is miscompiled on aarch64-linux at -O3 since the introduction of WIDEN_MINUS_EXPR. The problem is if the inner type (half_type) is unsigned and the result type in which the subtraction is performed (type) has precision more than twice as larger as the inner type's precision. For other widening operations like WIDEN_{PLUS,MULT}_EXPR, if half_type is unsigned, the addition/multiplication result in itype is also unsigned and needs to be zero-extended to type. But subtraction is special, even when half_type is unsigned, the subtraction behaves as signed (also regardless of whether the result type is signed or unsigned), 0xfeU - 0xffU is -1 or 0xffffffffU, not 0x0000ffff. I think it is better not to use mixed signedness of types in WIDEN_MINUS_EXPR (have unsigned vector of operands and signed result vector), so this patch instead adds another cast to make sure we always sign-extend the result from itype to type if type is wider than itype. 2021-09-01 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/102124 * tree-vect-patterns.c (vect_recog_widen_op_pattern): For ORIG_CODE MINUS_EXPR, if itype is unsigned with smaller precision than type, add an extra cast to signed variant of itype to ensure sign-extension. * gcc.dg/torture/pr102124.c: New test. (cherry picked from commit bea07159d1d4c9a61c8f7097e9f88c2b206b1b2f)
Fixed for 11.3+ and 12.1+.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:e217e7dbdc1040e7ee160796e9ca1ef12a0dd1cb commit r15-2136-ge217e7dbdc1040e7ee160796e9ca1ef12a0dd1cb Author: Sam James <sam@gentoo.org> Date: Thu Jul 18 10:00:17 2024 +0200 testsuite: Add dg-do run to more tests All of these are for wrong-code bugs. Confirmed to be used before but with no execution. 2024-07-18 Sam James <sam@gentoo.org> PR c++/53288 PR c++/57437 PR c/65345 PR libstdc++/88101 PR tree-optimization/96369 PR tree-optimization/102124 PR tree-optimization/108692 * c-c++-common/pr96369.c: Add dg-do run directive. * gcc.dg/torture/pr102124.c: Ditto. * gcc.dg/pr108692.c: Ditto. * gcc.dg/atomic/pr65345-4.c: Ditto. * g++.dg/cpp0x/lambda/lambda-return1.C: Ditto. * g++.dg/init/lifetime4.C: Ditto. * g++.dg/torture/builtin-clear-padding-1.C: Ditto. * g++.dg/torture/builtin-clear-padding-2.C: Ditto. * g++.dg/torture/builtin-clear-padding-3.C: Ditto. * g++.dg/torture/builtin-clear-padding-4.C: Ditto. * g++.dg/torture/builtin-clear-padding-5.C: Ditto.