This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [21/32] Remove global call sets: LRA
- From: Richard Sandiford <richard dot sandiford at arm dot com>
- To: Uros Bizjak <ubizjak at gmail dot com>
- Cc: "gcc-patches\@gcc.gnu.org" <gcc-patches at gcc dot gnu dot org>, "H. J. Lu" <hjl dot tools at gmail dot com>
- Date: Sun, 06 Oct 2019 15:32:18 +0100
- Subject: Re: [21/32] Remove global call sets: LRA
- References: <CAFULd4aEM9wUzDGS4=R66TG7-08nZb=nPEN9rdPNvPcz=qMoWw@mail.gmail.com>
Uros Bizjak <ubizjak@gmail.com> writes:
>>>> This caused:
>>>>
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91994
>>
>> Thanks for reducing & tracking down the underlying cause.
>>
>>> This change doesn't work with -mzeroupper. When -mzeroupper is used,
>>> upper bits of vector registers are clobbered upon callee return if any
>>> MM/ZMM registers are used in callee. Even if YMM7 isn't used, upper
>>> bits of YMM7 can still be clobbered by vzeroupper when YMM1 is used.
>>
>> The problem here really is that the pattern is just:
>>
>> (define_insn "avx_vzeroupper"
>> [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)]
>> "TARGET_AVX"
>> "vzeroupper"
>> ...)
>>
>> and so its effect on the registers isn't modelled at all in rtl.
>> Maybe one option would be to add a parallel:
>>
>> (set (reg:V2DI N) (reg:V2DI N))
>>
>> for each register. Or we could do something like I did for the SVE
>> tlsdesc calls, although here that would mean using a call pattern for
>> something that isn't really a call. Or we could reinstate clobber_high
>> and use that, but that's very much third out of three.
>>
>> I don't think we should add target hooks to get around this, since that's
>> IMO papering over the issue.
>>
>> I'll try the parallel set thing first.
>
> Please note that vzeroupper insertion pass runs after register
> allocation, so in effect vzeroupper pattern is hidden to the register
> allocator.
Right, but even post-RA passes rely on the register usage being accurate.
Same for collect_fn_hard_reg_usage, which is the issue here.
The info collected by collect_fn_hard_reg_usage was always wrong for
vzeroupper. What changed with my patch is that we now use that info
for partly call-clobbered registers as well as "normally" clobbered
registers. So this is another instance of a problem that was previously
being masked by having ix86_hard_regno_call_part_clobbered enforce Win64
rules for all ABIs.
My first idea of adding:
(set (reg:V2DI N) (reg:V2DI N))
for all clobbered registers didn't work well because it left previously-
dead registers upwards exposed (obvious in hindsight). And the second
idea of using a fake call would require too many "is this really a call?"
hacks.
So in the end I went for a subpass that chooses between:
(set (reg:V2DI N) (reg:V2DI N))
and
(clobber (reg:V2DI N))
depending on whether register N is live or not. This fixes the testcase
and doesn't seem to regress code quality for the tests I've tried.
Tested on x86_64-linux-gnu. OK to install?
Richard
2019-10-06 Richard Sandiford <richard.sandiford@arm.com>
gcc/
PR target/91994
* config/i386/sse.md (avx_vzeroupper): Turn into a define_expand
and wrap the unspec_volatile in a parallel.
(*avx_vzeroupper): New define_insn. Use a match_parallel around
the unspec_volatile.
* config/i386/predicates.md (vzeroupper_pattern): Expect the
unspec_volatile to be wrapped in a parallel.
* config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper)
(ix86_add_reg_usage_to_vzerouppers): New functions.
(rest_of_handle_insert_vzeroupper): Use them to add register
usage information to the vzeroupper instructions.
gcc/testsuite/
PR target/91994
* gcc.target/i386/pr91994.c: New test.
Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md 2019-09-17 15:27:10.214075253 +0100
+++ gcc/config/i386/sse.md 2019-10-06 15:19:10.062769500 +0100
@@ -19622,9 +19622,16 @@ (define_insn "*avx_vzeroall"
(set_attr "mode" "OI")])
;; Clear the upper 128bits of AVX registers, equivalent to a NOP
-;; if the upper 128bits are unused.
-(define_insn "avx_vzeroupper"
- [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)]
+;; if the upper 128bits are unused. Initially we expand the instructions
+;; as though they had no effect on the SSE registers, but later add SETs and
+;; CLOBBERs to the PARALLEL to model the real effect.
+(define_expand "avx_vzeroupper"
+ [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
+ "TARGET_AVX")
+
+(define_insn "*avx_vzeroupper"
+ [(match_parallel 0 "vzeroupper_pattern"
+ [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
"TARGET_AVX"
"vzeroupper"
[(set_attr "type" "sse")
Index: gcc/config/i386/predicates.md
===================================================================
--- gcc/config/i386/predicates.md 2019-09-10 19:56:45.337178032 +0100
+++ gcc/config/i386/predicates.md 2019-10-06 15:19:10.054769556 +0100
@@ -1441,8 +1441,9 @@ (define_predicate "vzeroall_pattern"
;; return true if OP is a vzeroupper pattern.
(define_predicate "vzeroupper_pattern"
- (and (match_code "unspec_volatile")
- (match_test "XINT (op, 1) == UNSPECV_VZEROUPPER")))
+ (and (match_code "parallel")
+ (match_code "unspec_volatile" "a")
+ (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER")))
;; Return true if OP is an addsub vec_merge operation
(define_predicate "addsub_vm_operator"
Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c 2019-09-21 13:56:08.895934718 +0100
+++ gcc/config/i386/i386-features.c 2019-10-06 15:19:10.054769556 +0100
@@ -1757,6 +1757,68 @@ convert_scalars_to_vector (bool timode_p
return 0;
}
+/* Modify the vzeroupper pattern in INSN so that it describes the effect
+ that the instruction has on the SSE registers. LIVE_REGS are the set
+ of registers that are live across the instruction.
+
+ For a live register R we use:
+
+ (set (reg:V2DF R) (reg:V2DF R))
+
+ which preserves the low 128 bits but clobbers the upper bits.
+ For a dead register we just use:
+
+ (clobber (reg:V2DF R))
+
+ which invalidates any previous contents of R and stops R from becoming
+ live across the vzeroupper in future. */
+
+static void
+ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
+{
+ rtx pattern = PATTERN (insn);
+ unsigned int nregs = TARGET_64BIT ? 16 : 8;
+ rtvec vec = rtvec_alloc (nregs + 1);
+ RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
+ for (unsigned int i = 0; i < nregs; ++i)
+ {
+ unsigned int regno = GET_SSE_REGNO (i);
+ rtx reg = gen_rtx_REG (V2DImode, regno);
+ if (bitmap_bit_p (live_regs, regno))
+ RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
+ else
+ RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+ }
+ XVEC (pattern, 0) = vec;
+ df_insn_rescan (insn);
+}
+
+/* Walk the vzeroupper instructions in the function and annotate them
+ with the effect that they have on the SSE registers. */
+
+static void
+ix86_add_reg_usage_to_vzerouppers (void)
+{
+ basic_block bb;
+ rtx_insn *insn;
+ auto_bitmap live_regs;
+
+ df_analyze ();
+ FOR_EACH_BB_FN (bb, cfun)
+ {
+ bitmap_copy (live_regs, df_get_live_out (bb));
+ df_simulate_initialize_backwards (bb, live_regs);
+ FOR_BB_INSNS_REVERSE (bb, insn)
+ {
+ if (!NONDEBUG_INSN_P (insn))
+ continue;
+ if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
+ ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
+ df_simulate_one_insn_backwards (bb, insn, live_regs);
+ }
+ }
+}
+
static unsigned int
rest_of_handle_insert_vzeroupper (void)
{
@@ -1773,6 +1835,7 @@ rest_of_handle_insert_vzeroupper (void)
/* Call optimize_mode_switching. */
g->get_passes ()->execute_pass_mode_switching ();
+ ix86_add_reg_usage_to_vzerouppers ();
return 0;
}
Index: gcc/testsuite/gcc.target/i386/pr91994.c
===================================================================
--- /dev/null 2019-09-17 11:41:18.176664108 +0100
+++ gcc/testsuite/gcc.target/i386/pr91994.c 2019-10-06 15:19:10.062769500 +0100
@@ -0,0 +1,35 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O2 -mavx -mvzeroupper" } */
+
+#include "avx-check.h"
+
+#include <immintrin.h>
+
+__m256i x1, x2, x3;
+
+__attribute__ ((noinline))
+static void
+foo (void)
+{
+ x1 = x2;
+}
+
+void
+bar (void)
+{
+ __m256i x = x1;
+ foo ();
+ x3 = x;
+}
+
+__attribute__ ((noinline))
+void
+avx_test (void)
+{
+ __m256i x = _mm256_set1_epi8 (3);
+ x1 = x;
+ bar ();
+ if (__builtin_memcmp (&x3, &x, sizeof (x)))
+ __builtin_abort ();
+}