This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [21/32] Remove global call sets: LRA

From: Richard Sandiford <richard dot sandiford at arm dot com>
To: Uros Bizjak <ubizjak at gmail dot com>
Cc: "gcc-patches\@gcc.gnu.org" <gcc-patches at gcc dot gnu dot org>, "H. J. Lu" <hjl dot tools at gmail dot com>
Date: Sun, 06 Oct 2019 15:32:18 +0100
Subject: Re: [21/32] Remove global call sets: LRA
References: <CAFULd4aEM9wUzDGS4=R66TG7-08nZb=nPEN9rdPNvPcz=qMoWw@mail.gmail.com>

Uros Bizjak <ubizjak@gmail.com> writes:
>>>> This caused:
>>>>
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91994
>>
>> Thanks for reducing & tracking down the underlying cause.
>>
>>> This change doesn't work with -mzeroupper.  When -mzeroupper is used,
>>> upper bits of vector registers are clobbered upon callee return if any
>>> MM/ZMM registers are used in callee.  Even if YMM7 isn't used, upper
>>> bits of YMM7 can still be clobbered by vzeroupper when YMM1 is used.
>>
>> The problem here really is that the pattern is just:
>>
>> (define_insn "avx_vzeroupper"
>>   [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)]
>>   "TARGET_AVX"
>>   "vzeroupper"
>>   ...)
>>
>> and so its effect on the registers isn't modelled at all in rtl.
>> Maybe one option would be to add a parallel:
>>
>>   (set (reg:V2DI N) (reg:V2DI N))
>>
>> for each register.  Or we could do something like I did for the SVE
>> tlsdesc calls, although here that would mean using a call pattern for
>> something that isn't really a call.  Or we could reinstate clobber_high
>> and use that, but that's very much third out of three.
>>
>> I don't think we should add target hooks to get around this, since that's
>> IMO papering over the issue.
>>
>> I'll try the parallel set thing first.
>
> Please note that vzeroupper insertion pass runs after register
> allocation, so in effect vzeroupper pattern is hidden to the register
> allocator.

Right, but even post-RA passes rely on the register usage being accurate.
Same for collect_fn_hard_reg_usage, which is the issue here.

The info collected by collect_fn_hard_reg_usage was always wrong for
vzeroupper.  What changed with my patch is that we now use that info
for partly call-clobbered registers as well as "normally" clobbered
registers.  So this is another instance of a problem that was previously
being masked by having ix86_hard_regno_call_part_clobbered enforce Win64
rules for all ABIs.

My first idea of adding:

  (set (reg:V2DI N) (reg:V2DI N))

for all clobbered registers didn't work well because it left previously-
dead registers upwards exposed (obvious in hindsight).  And the second
idea of using a fake call would require too many "is this really a call?"
hacks.

So in the end I went for a subpass that chooses between:

  (set (reg:V2DI N) (reg:V2DI N))

and

  (clobber (reg:V2DI N))

depending on whether register N is live or not.  This fixes the testcase
and doesn't seem to regress code quality for the tests I've tried.

Tested on x86_64-linux-gnu.  OK to install?

Richard


2019-10-06  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	PR target/91994
	* config/i386/sse.md (avx_vzeroupper): Turn into a define_expand
	and wrap the unspec_volatile in a parallel.
	(*avx_vzeroupper): New define_insn.  Use a match_parallel around
	the unspec_volatile.
	* config/i386/predicates.md (vzeroupper_pattern): Expect the
	unspec_volatile to be wrapped in a parallel.
	* config/i386/i386-features.c (ix86_add_reg_usage_to_vzeroupper)
	(ix86_add_reg_usage_to_vzerouppers): New functions.
	(rest_of_handle_insert_vzeroupper): Use them to add register
	usage information to the vzeroupper instructions.

gcc/testsuite/
	PR target/91994
	* gcc.target/i386/pr91994.c: New test.

Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md	2019-09-17 15:27:10.214075253 +0100
+++ gcc/config/i386/sse.md	2019-10-06 15:19:10.062769500 +0100
@@ -19622,9 +19622,16 @@ (define_insn "*avx_vzeroall"
    (set_attr "mode" "OI")])
 
 ;; Clear the upper 128bits of AVX registers, equivalent to a NOP
-;; if the upper 128bits are unused.
-(define_insn "avx_vzeroupper"
-  [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)]
+;; if the upper 128bits are unused.  Initially we expand the instructions
+;; as though they had no effect on the SSE registers, but later add SETs and
+;; CLOBBERs to the PARALLEL to model the real effect.
+(define_expand "avx_vzeroupper"
+  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
+  "TARGET_AVX")
+
+(define_insn "*avx_vzeroupper"
+  [(match_parallel 0 "vzeroupper_pattern"
+     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
   "TARGET_AVX"
   "vzeroupper"
   [(set_attr "type" "sse")
Index: gcc/config/i386/predicates.md
===================================================================
--- gcc/config/i386/predicates.md	2019-09-10 19:56:45.337178032 +0100
+++ gcc/config/i386/predicates.md	2019-10-06 15:19:10.054769556 +0100
@@ -1441,8 +1441,9 @@ (define_predicate "vzeroall_pattern"
 
 ;; return true if OP is a vzeroupper pattern.
 (define_predicate "vzeroupper_pattern"
-  (and (match_code "unspec_volatile")
-       (match_test "XINT (op, 1) == UNSPECV_VZEROUPPER")))
+  (and (match_code "parallel")
+       (match_code "unspec_volatile" "a")
+       (match_test "XINT (XVECEXP (op, 0, 0), 1) == UNSPECV_VZEROUPPER")))
 
 ;; Return true if OP is an addsub vec_merge operation
 (define_predicate "addsub_vm_operator"
Index: gcc/config/i386/i386-features.c
===================================================================
--- gcc/config/i386/i386-features.c	2019-09-21 13:56:08.895934718 +0100
+++ gcc/config/i386/i386-features.c	2019-10-06 15:19:10.054769556 +0100
@@ -1757,6 +1757,68 @@ convert_scalars_to_vector (bool timode_p
   return 0;
 }
 
+/* Modify the vzeroupper pattern in INSN so that it describes the effect
+   that the instruction has on the SSE registers.  LIVE_REGS are the set
+   of registers that are live across the instruction.
+
+   For a live register R we use:
+
+     (set (reg:V2DF R) (reg:V2DF R))
+
+   which preserves the low 128 bits but clobbers the upper bits.
+   For a dead register we just use:
+
+     (clobber (reg:V2DF R))
+
+   which invalidates any previous contents of R and stops R from becoming
+   live across the vzeroupper in future.  */
+
+static void
+ix86_add_reg_usage_to_vzeroupper (rtx_insn *insn, bitmap live_regs)
+{
+  rtx pattern = PATTERN (insn);
+  unsigned int nregs = TARGET_64BIT ? 16 : 8;
+  rtvec vec = rtvec_alloc (nregs + 1);
+  RTVEC_ELT (vec, 0) = XVECEXP (pattern, 0, 0);
+  for (unsigned int i = 0; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      rtx reg = gen_rtx_REG (V2DImode, regno);
+      if (bitmap_bit_p (live_regs, regno))
+	RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
+      else
+	RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+    }
+  XVEC (pattern, 0) = vec;
+  df_insn_rescan (insn);
+}
+
+/* Walk the vzeroupper instructions in the function and annotate them
+   with the effect that they have on the SSE registers.  */
+
+static void
+ix86_add_reg_usage_to_vzerouppers (void)
+{
+  basic_block bb;
+  rtx_insn *insn;
+  auto_bitmap live_regs;
+
+  df_analyze ();
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      bitmap_copy (live_regs, df_get_live_out (bb));
+      df_simulate_initialize_backwards (bb, live_regs);
+      FOR_BB_INSNS_REVERSE (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn))
+	    continue;
+	  if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
+	    ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
+	  df_simulate_one_insn_backwards (bb, insn, live_regs);
+	}
+    }
+}
+
 static unsigned int
 rest_of_handle_insert_vzeroupper (void)
 {
@@ -1773,6 +1835,7 @@ rest_of_handle_insert_vzeroupper (void)
 
   /* Call optimize_mode_switching.  */
   g->get_passes ()->execute_pass_mode_switching ();
+  ix86_add_reg_usage_to_vzerouppers ();
   return 0;
 }
 
Index: gcc/testsuite/gcc.target/i386/pr91994.c
===================================================================
--- /dev/null	2019-09-17 11:41:18.176664108 +0100
+++ gcc/testsuite/gcc.target/i386/pr91994.c	2019-10-06 15:19:10.062769500 +0100
@@ -0,0 +1,35 @@
+/* { dg-do run } */
+/* { dg-require-effective-target avx } */
+/* { dg-options "-O2 -mavx -mvzeroupper" } */
+
+#include "avx-check.h"
+
+#include <immintrin.h>
+
+__m256i x1, x2, x3;
+
+__attribute__ ((noinline))
+static void
+foo (void)
+{
+  x1 = x2;
+}
+
+void
+bar (void)
+{
+  __m256i x = x1;
+  foo ();
+  x3 = x;
+}
+
+__attribute__ ((noinline))
+void
+avx_test (void)
+{
+  __m256i x = _mm256_set1_epi8 (3);
+  x1 = x;
+  bar ();
+  if (__builtin_memcmp (&x3, &x, sizeof (x)))
+    __builtin_abort ();
+}

Follow-Ups:
- Re: [21/32] Remove global call sets: LRA
  - From: Uros Bizjak

References:
- Re: [21/32] Remove global call sets: LRA
  - From: Uros Bizjak

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]