Bug 91201 - [9/10/11/12 Regression] SIMD not generated for horizontal sum of bytes in array
Summary: [9/10/11/12 Regression] SIMD not generated for horizontal sum of bytes in array
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 9.1.0
: P2 normal
Target Milestone: 9.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2019-07-18 15:54 UTC by Joel Yliluoma
Modified: 2021-06-01 08:14 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-07-19 00:00:00


Attachments
gcc10-pr91201.patch (1.14 KB, patch)
2019-07-30 10:32 UTC, Jakub Jelinek
Details | Diff
gcc10-pr91201-extr.patch (917 bytes, patch)
2019-08-01 09:14 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Joel Yliluoma 2019-07-18 15:54:49 UTC
For this code —

    typedef unsigned long long E;
    const unsigned D = 2;
    E bytes[D];
    unsigned char sum() 
    {
        E b[D]{};
        //#pragma omp simd
        for(unsigned n=0; n<D; ++n)
        {
            E temp = bytes[n];
            temp += (temp >> 32);
            temp += (temp >> 16);
            temp += (temp >> 8);
            b[n] = temp;
        }
        E result = 0;
        //#pragma omp simd
        for(unsigned n=0; n<D; ++n) result += b[n];
        return result;
    }

GCC 6.4 generates the following neat assembler code, but all versions since GCC 7 (including GCC 9.1) fail to utilize SIMD instructions at all.

        vmovdqa xmm0, XMMWORD PTR bytes[rip]
        vpsrlq  xmm1, xmm0, 32
        vpaddq  xmm1, xmm1, xmm0
        vpsrlq  xmm0, xmm1, 16
        vpaddq  xmm1, xmm0, xmm1
        vpsrlq  xmm0, xmm1, 8
        vpaddq  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 8
        vpaddq  xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        ret

The code that GCC versions since 7.0, including and up to 9.1, generates, is:

        mov     rcx, QWORD PTR bytes[rip]
        mov     rdx, QWORD PTR bytes[rip+8]
        mov     rax, rcx
        shr     rax, 32
        add     rcx, rax
        mov     rax, rcx
        shr     rax, 16
        add     rcx, rax
        mov     rax, rdx
        shr     rax, 32
        add     rdx, rax
        mov     rax, rdx
        shr     rax, 16
        add     rdx, rax
        mov     rax, rcx
        shr     rax, 8
        add     rcx, rdx
        add     rcx, rax
        shr     rdx, 8
        lea     rax, [rcx+rdx]
        ret

Tested using compiler options -Ofast -std=c++17 -pedantic -Wall -Wextra -W -march=skylake. Tried also haswell, broadwell and znver1 for the -march option.

If I change the D constant to a larger one, such as 4 or 8, then SIMD instructions will begin appearing. Interestingly with D=4, it uses stack as a temporary, but with D=8, it manages without (on both AVX and non-AVX code).

If I uncomment the two OpenMP pragmas, then SIMD code will manifest, so it is clear that the compiler _can_ generate the optimal code, but for some reason chooses not to.

The testcase is a horizontal sum of all bytes in an array by the way.

Compiler Explorer link for quick testing: https://godbolt.org/z/azkXiL
Comment 1 Marc Glisse 2019-07-18 19:23:46 UTC
We unroll the loop (-fdisable-tree-cunrolli) and SLP does not handle reductions.
Comment 2 Richard Biener 2019-07-19 08:02:12 UTC
Thus confirmed.
Comment 3 Joel Yliluoma 2019-07-19 09:04:14 UTC
For the record, for this particular case (8-bit checksum of an array, 16 bytes in this case) there exists even more optimal SIMD code, which ICC (version 18 or greater) generates automatically.

        vmovups   xmm0, XMMWORD PTR bytes[rip]                  #5.9
        vpxor     xmm2, xmm2, xmm2                              #4.41
        vpaddb    xmm0, xmm2, xmm0                              #4.41
        vpsrldq   xmm1, xmm0, 8                                 #4.41
        vpaddb    xmm3, xmm0, xmm1                              #4.41
        vpsadbw   xmm4, xmm2, xmm3                              #4.41
        vmovd     eax, xmm4                                     #4.41
        movsx     rax, al                                       #4.41
        ret                                                     #7.16
Comment 4 Jakub Jelinek 2019-07-29 11:49:50 UTC
With the OpenMP directives and -fopenmp-simd we already do vectorize this, as loop->force_vectorize loops aren't unrolled until after vectorization.
Comment 5 Marc Glisse 2019-07-29 19:45:23 UTC
It looks like a convoluted way to write:

unsigned char sum() 
{
  unsigned char res=0;
  unsigned char*p=(unsigned char*)bytes;
  for(int n=0;n<sizeof(bytes);++n)
    res+=p[n];
  return res;
}

which gcc easily recognizes as a reduction. Except that I am not convinced the testcase actually computes the "horizontal sum of all bytes in an array" as claimed.

Yes, the i386 backend seems to be missing reduc_plus_scal_V*QI using (v)psadbw, that should be a separate bug report.
Comment 6 Joel Yliluoma 2019-07-30 04:37:00 UTC
Maybe a horizontal checksum is a bit obscure term. A 8-bit checksum is what is being accomplished, nonetheless. Yes, there are simpler ways to do it…

But I tried a number of different approaches in order to try and get maximum performance SIMD code out of GCC, and I came upon this curious case that I posted this bugreport about.

To another compiler, I reported a related bug concerning a code that looks like this:

    unsigned char calculate_checksum(const void* ptr)
    {
        unsigned char bytes[16], result = 0;
        memcpy(bytes, ptr, 16);
        // The reason the memcpy is there in place is because to
        // my knowledge, it is the only _safe_ way permitted by
        // the standard to do conversions between representations.
        // Union, pointer casting, etc. are not safe.
        for(unsigned n=0; n<16; ++n) result += bytes[n];
        return result;
    }

After my report, their compiler now generates:

        vmovdqu xmm0, xmmword ptr [rdi]
        vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1]
        vpaddb xmm0, xmm0, xmm1
        vpxor xmm1, xmm1, xmm1
        vpsadbw xmm0, xmm0, xmm1
        vpextrb eax, xmm0, 0
        ret

This is what GCC generates for the same code.

        vmovdqu xmm0, XMMWORD PTR [rdi]
        vpsrldq xmm1, xmm0, 8
        vpaddb  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddb  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 2
        vpaddb  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 1
        vpaddb  xmm0, xmm0, xmm1
        vpextrb eax, xmm0, 0
        ret

So the bottom line is, (v)psadbw reductions should be added as M. Glisse correctly indicated.
Comment 7 Jakub Jelinek 2019-07-30 10:18:37 UTC
Untested patch to add the reduc_plus_scal_v{16,32,64}qi expanders.
Wonder if we don't need also reduc_plus_scal_v8qi expander for TARGET_MMX_WITH_SSE.

--- gcc/config/i386/sse.md.jj	2019-07-28 17:29:41.488143221 +0200
+++ gcc/config/i386/sse.md	2019-07-30 12:05:34.249034097 +0200
@@ -2728,9 +2728,30 @@ (define_expand "reduc_plus_scal_<mode>"
   DONE;
 })
 
+(define_expand "reduc_plus_scal_v16qi"
+ [(plus:V16QI
+    (match_operand:QI 0 "register_operand")
+    (match_operand:V16QI 1 "register_operand"))]
+ "TARGET_SSE2"
+{
+  rtx tmp = gen_reg_rtx (V1TImode);
+  emit_insn (gen_sse2_lshrv1ti3 (tmp, gen_lowpart (V1TImode, operands[1]),
+				 GEN_INT (64)));
+  rtx tmp2 = gen_reg_rtx (V16QImode);
+  emit_insn (gen_addv16qi3 (tmp2, operands[1], gen_lowpart (V16QImode, tmp)));
+  rtx tmp3 = gen_reg_rtx (V16QImode);
+  emit_move_insn (tmp3, CONST0_RTX (V16QImode));
+  rtx tmp4 = gen_reg_rtx (V2DImode);
+  emit_insn (gen_sse2_psadbw (tmp4, tmp2, tmp3));
+  tmp4 = gen_lowpart (V16QImode, tmp4);
+  emit_insn (gen_vec_extractv16qiqi (operands[0], tmp4, const0_rtx));
+  DONE;
+})
+
 (define_mode_iterator REDUC_PLUS_MODE
  [(V4DF "TARGET_AVX") (V8SF "TARGET_AVX")
-  (V8DF "TARGET_AVX512F") (V16SF "TARGET_AVX512F")])
+  (V8DF "TARGET_AVX512F") (V16SF "TARGET_AVX512F")
+  (V32QI "TARGET_AVX") (V64QI "TARGET_AVX512F")])
 
 (define_expand "reduc_plus_scal_<mode>"
  [(plus:REDUC_PLUS_MODE
@@ -2741,8 +2762,8 @@ (define_expand "reduc_plus_scal_<mode>"
   rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode);
   emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1]));
   rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode);
-  emit_insn (gen_add<ssehalfvecmodelower>3
-    (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1])));
+  rtx tmp3 = gen_lowpart (<ssehalfvecmode>mode, operands[1]);
+  emit_insn (gen_add<ssehalfvecmodelower>3 (tmp2, tmp, tmp3));
   emit_insn (gen_reduc_plus_scal_<ssehalfvecmodelower> (operands[0], tmp2));
   DONE;
 })
Comment 8 Jakub Jelinek 2019-07-30 10:32:24 UTC
Created attachment 46642 [details]
gcc10-pr91201.patch

Full untested patch for the final reduction to scalar.
Comment 9 Marc Glisse 2019-07-30 13:22:29 UTC
(In reply to Jakub Jelinek from comment #7)
> Untested patch to add the reduc_plus_scal_v{16,32,64}qi expanders.
> Wonder if we don't need also reduc_plus_scal_v8qi expander for
> TARGET_MMX_WITH_SSE.

I was even kind of hoping that reduc_plus_scal_v8qi would be the only one needed, with v16qi reducing to it the same way you reduce v32qi and v64qi to v16qi, but that's probably too optimistic because of MMX.
Comment 10 Marc Glisse 2019-07-30 13:45:25 UTC
For AVX512, I wonder if we could use vpsadbw to compute the sums for each 64-bit part, then vcompressb to collect them in the lower 64 bits, then vpsadbw to conclude. Or whatever other faster variant (is Peter Cordes around?). But that's not required for this patch.
Comment 11 Jakub Jelinek 2019-07-30 14:07:36 UTC
I'm not aware of vcompressb insn, only vcompressps and vcompresspd.  Sure, one could just emit whatever we emit for __builtin_shuffle with (__v64qi) { 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56 } or similar perm, the question is if it will be faster that way or not.
Comment 12 Marc Glisse 2019-07-30 14:18:12 UTC
(In reply to Jakub Jelinek from comment #11)
> I'm not aware of vcompressb insn, only vcompressps and vcompresspd.

Intel lists it under VBMI2, so icelake+.

> Sure,
> one could just emit whatever we emit for __builtin_shuffle with (__v64qi) {
> 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24,
> 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56,
> 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24, 32, 40, 48, 56, 0, 8, 16, 24,
> 32, 40, 48, 56 } or similar perm, the question is if it will be faster that
> way or not.

Exactly.
Comment 13 Jakub Jelinek 2019-07-31 09:23:20 UTC
Author: jakub
Date: Wed Jul 31 09:22:48 2019
New Revision: 273927

URL: https://gcc.gnu.org/viewcvs?rev=273927&root=gcc&view=rev
Log:
	PR tree-optimization/91201
	* config/i386/sse.md (reduc_plus_scal_v16qi): New expander.
	(REDUC_PLUS_MODE): Add V32QImode for TARGET_AVX and V64QImode for
	TARGET_AVX512F.
	(reduc_plus_scal_<mode>): Improve formatting by introducing
	a temporary.

	* gcc.target/i386/sse2-pr91201.c: New test.
	* gcc.target/i386/avx2-pr91201.c: New test.
	* gcc.target/i386/avx512bw-pr91201.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/avx2-pr91201.c
    trunk/gcc/testsuite/gcc.target/i386/avx512bw-pr91201.c
    trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/sse.md
    trunk/gcc/testsuite/ChangeLog
Comment 14 Jakub Jelinek 2019-07-31 13:49:57 UTC
Author: jakub
Date: Wed Jul 31 13:49:26 2019
New Revision: 273932

URL: https://gcc.gnu.org/viewcvs?rev=273932&root=gcc&view=rev
Log:
	PR tree-optimization/91201
	* config/i386/mmx.md (reduc_plus_scal_v8qi): New expander.

	* gcc.target/i386/sse2-pr91201-2.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/mmx.md
    trunk/gcc/testsuite/ChangeLog
Comment 15 Joel Yliluoma 2019-08-01 06:34:52 UTC
Seems to work neatly now.
Any reason why on vector size 128, non-AVX, it does the low byte move through the red zone? Are pextrb or movd instructions not available? Or does ABI specify that the upper bits of the eax register must be zero?

        movaps  XMMWORD PTR [rsp-40], xmm2
        movzx   eax, BYTE PTR [rsp-40]

Clang does just a simple movd here.

        movd    eax, xmm1
Comment 16 Joel Yliluoma 2019-08-01 06:40:00 UTC
In reference to my previous comment, this is the code I tested with and the compiler flags were -Ofast -mno-avx.

	unsigned char bytes[128];
	unsigned char sum (void)
	{
	  unsigned char r = 0;
	  const unsigned char *p = (const unsigned char *) bytes;
	  int n;
	  for (n = 0; n < sizeof (bytes); ++n)
	    r += p[n];
	  return r;
	}
Comment 17 Jakub Jelinek 2019-08-01 09:14:34 UTC
Created attachment 46655 [details]
gcc10-pr91201-extr.patch

pextrb is indeed not available (added in sse4.1) in -msse2 only, but we indeed can use movd for the zero element QImode extraction.  For HImode already SSE2 has extraction insn.
Comment 18 Joel Yliluoma 2019-08-01 11:33:00 UTC
Great, thanks. I can test this in a few days, but I would like to make sure that the proper thing still happens if the vector is of bytes but the return value of the function is a larger-than-byte integer type. Will it still generate a movd in this case? Because that would be wrong. :-)
Comment 19 Joel Yliluoma 2019-08-01 11:40:27 UTC
If the function return type is changed to "unsigned short", the AVX code with "vpextrb" will do a spurious "movzx eax, al" at the end — but if the return type is "unsigned int", it will not. The code with "(v)movd" should of course do it, if the vector element size is shorter than the return type.
Comment 20 Jakub Jelinek 2019-08-01 11:59:52 UTC
(In reply to Joel Yliluoma from comment #19)
> If the function return type is changed to "unsigned short", the AVX code
> with "vpextrb" will do a spurious "movzx eax, al" at the end — but if the
> return type is "unsigned int", it will not. The code with "(v)movd" should
> of course do it, if the vector element size is shorter than the return type.

With movd there is a non-redundant movzxl %al, %eax after the movd in both unsigned short and unsigned int cases.  For {,v}pextrb there is a pattern that makes the zero extension explicit in the IL:
(insn 28 27 29 2 (set (subreg:SI (reg:QI 87 [ stmp_r_10.10 ]) 0)
        (zero_extend:SI (vec_select:QI (subreg:V16QI (reg:V2DI 121) 0)
                (parallel [
                        (const_int 0 [0])
                    ])))) 4165 {*vec_extractv16qi_zext}
     (expr_list:REG_DEAD (reg:V2DI 121)
        (nil)))
and for unsigned int return type the combiner is able to combine that with the following
(insn 29 28 34 2 (set (reg:SI 122 [ stmp_r_10.10 ])
        (zero_extend:SI (reg:QI 87 [ stmp_r_10.10 ]))) "pr91201-4.c":9:11 119 {*zero_extendqisi2}
     (expr_list:REG_DEAD (reg:QI 87 [ stmp_r_10.10 ])
        (nil)))
but it isn't able to merge that for a different extension in the unsigned short return type.
Comment 21 Uroš Bizjak 2019-08-01 17:14:27 UTC
(In reply to Jakub Jelinek from comment #20)
> (In reply to Joel Yliluoma from comment #19)
> > If the function return type is changed to "unsigned short", the AVX code
> > with "vpextrb" will do a spurious "movzx eax, al" at the end — but if the
> > return type is "unsigned int", it will not. The code with "(v)movd" should
> > of course do it, if the vector element size is shorter than the return type.
> 
> With movd there is a non-redundant movzxl %al, %eax after the movd in both
> unsigned short and unsigned int cases.  For {,v}pextrb there is a pattern
> that makes the zero extension explicit in the IL:
> (insn 28 27 29 2 (set (subreg:SI (reg:QI 87 [ stmp_r_10.10 ]) 0)
>         (zero_extend:SI (vec_select:QI (subreg:V16QI (reg:V2DI 121) 0)
>                 (parallel [
>                         (const_int 0 [0])
>                     ])))) 4165 {*vec_extractv16qi_zext}
>      (expr_list:REG_DEAD (reg:V2DI 121)
>         (nil)))
> and for unsigned int return type the combiner is able to combine that with
> the following
> (insn 29 28 34 2 (set (reg:SI 122 [ stmp_r_10.10 ])
>         (zero_extend:SI (reg:QI 87 [ stmp_r_10.10 ]))) "pr91201-4.c":9:11
> 119 {*zero_extendqisi2}
>      (expr_list:REG_DEAD (reg:QI 87 [ stmp_r_10.10 ])
>         (nil)))
> but it isn't able to merge that for a different extension in the unsigned
> short return type.

I think an insn similar to

(define_insn "*vec_extract<PEXTR_MODE12:mode>_zext"

is missing. Like:

(define_insn "*vec_extractv16qi_zext_hi"
  [(set (match_operand:HI 0 "register_operand" "=r,r")
	(zero_extend:HI
	  (vec_select:QI
	    (match_operand:V16QI 1 "register_operand" "x,v")
	    (parallel
	      [(match_operand:SI 2 "const_0_to_15")]))))]
  "TARGET_SSE4_1"
  "@
   %vpextrb\t{%2, %1, %k0|%k0, %1, %2}
   vpextrb\t{%2, %1, %k0|%k0, %1, %2}"
Comment 22 Jakub Jelinek 2019-08-02 08:29:03 UTC
Author: jakub
Date: Fri Aug  2 08:28:31 2019
New Revision: 273998

URL: https://gcc.gnu.org/viewcvs?rev=273998&root=gcc&view=rev
Log:
	PR tree-optimization/91201
	* config/i386/i386-expand.c (ix86_expand_vector_extract): For elt == 0
	V16QImode extraction without sse4.1 try to use V4SImode lowpart
	extraction.

	* gcc.target/i386/sse2-pr91201-3.c: New test.
	* gcc.target/i386/sse2-pr91201-4.c: New test.
	* gcc.target/i386/sse2-pr91201-5.c: New test.
	* gcc.target/i386/sse2-pr91201-6.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201-3.c
    trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201-4.c
    trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201-5.c
    trunk/gcc/testsuite/gcc.target/i386/sse2-pr91201-6.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386-expand.c
    trunk/gcc/testsuite/ChangeLog
Comment 23 uros 2019-08-02 15:46:37 UTC
Author: uros
Date: Fri Aug  2 15:46:02 2019
New Revision: 274018

URL: https://gcc.gnu.org/viewcvs?rev=274018&root=gcc&view=rev
Log:
	PR target/91201
	* config/i386/sse.md (*vec_extractv16qi_zext): New insn pattern.

testsuite/ChangeLog:

	PR target/91201
	* gcc.target/i386/sse4_1-pr91201.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/i386/sse4_1-pr91201.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/sse.md
    trunk/gcc/testsuite/ChangeLog
Comment 24 Joel Yliluoma 2019-08-05 08:48:44 UTC
The simple horizontal 8-bit add seems to work nicely. Very nice work.

However, the original bug report — that the code snippet quoted below no longer receives love from the SIMD optimization unless you explicitly say “pragma #omp simd” — seems still unaddressed.

    #define num_words 2

    typedef unsigned long long E;
    E bytes[num_words];
    unsigned char sum() 
    {
        E b[num_words] = {};
        //#pragma omp simd
        for(unsigned n=0; n<num_words; ++n)
        {
            // Calculate the sum of all bytes in a word
            E temp = bytes[n];
            temp += (temp >> 32);
            temp += (temp >> 16);
            temp += (temp >> 8);
            // Save that number in an array
            b[n] = temp;
        }
        // Calculate sum of those sums
        unsigned char result = 0;
        //#pragma omp simd
        for(unsigned n=0; n<num_words; ++n) result += b[n];
        return result;
    }

Compiler Explorer link: https://godbolt.org/z/XL3cIK
Comment 25 Richard Biener 2019-10-22 12:29:04 UTC
We unroll the loop completely but our basic-block vectorization capabilities do not include reductions.  We see the following there:

  <bb 2> [local count: 357878154]:
  temp_33 = bytes[0];
  _34 = temp_33 >> 32;
  temp_35 = temp_33 + _34;
  _36 = temp_35 >> 16;
  temp_37 = temp_35 + _36;
  _38 = temp_37 >> 8;
  temp_44 = bytes[1];
  _45 = temp_44 >> 32;
  temp_46 = temp_44 + _45;
  _47 = temp_46 >> 16;
  temp_48 = temp_46 + _47;
  _40 = temp_37 + temp_48;
  _49 = temp_48 >> 8;
  _51 = _38 + _40;
  result_29 = _49 + _51;
  _20 = (unsigned char) result_29;
  b ={v} {CLOBBER};
  return _20;
Comment 26 Richard Biener 2019-11-14 07:54:23 UTC
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
Comment 27 Jakub Jelinek 2020-03-04 09:35:47 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 28 Jakub Jelinek 2021-05-14 09:51:56 UTC
GCC 8 branch is being closed.
Comment 29 Richard Biener 2021-06-01 08:14:52 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.