Bug 83565 - [7/8 regression] RTL combine pass yields wrong rotate result
Summary: [7/8 regression] RTL combine pass yields wrong rotate result
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: 7.3
Assignee: Eric Botcazou
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2017-12-23 14:39 UTC by Sergei Trofimovich
Modified: 2018-01-12 10:24 UTC (History)
3 users (show)

See Also:
Host:
Target: ia64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-12-29 00:00:00


Attachments
a.c (406 bytes, text/plain)
2017-12-23 14:39 UTC, Sergei Trofimovich
Details
rotate test case (391 bytes, text/plain)
2017-12-23 18:36 UTC, Jessica Clarke
Details
Zero-out high 32 bits after a rotate (336 bytes, patch)
2017-12-23 18:37 UTC, Jessica Clarke
Details | Diff
Higher bits of paradoxical subregs on ia64 are undefined (748 bytes, patch)
2017-12-24 16:31 UTC, Jessica Clarke
Details | Diff
Stopgap fix (1.26 KB, patch)
2018-01-09 09:31 UTC, Eric Botcazou
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sergei Trofimovich 2017-12-23 14:39:58 UTC
Created attachment 42959 [details]
a.c

Original but was seen as an openssl DES testsuite failure.

Below is a distilled example that yields differetn results
when built as -O0 vs. -O1 built:

#include <stdio.h>

/*
$ gcc -O0 a.c -o a-O0 && ./a-O0 > o0
$ gcc -O1 a.c -o a-O1 && ./a-O1 > o1
$ diff -U0 o0 o1

-off>>: 7fffffff
+off>>: ffffffff
*/

typedef unsigned int u32;

u32 bug (u32 * result) __attribute__((noinline));
u32 bug (u32 * result)
{
    // non-static and non-volatile to inhibit constant propagation
    volatile u32 ss = 0xFFFFffff;
    volatile u32 d  = 0xEEEEeeee;
    u32 tt = d & 0x00800000;
    u32 r  = tt << 8;

    // rotate
    r = (r >> 31)
      | (r <<  1);

    u32 u = r^ss;
    u32 off = u >> 1;

    // seemingly unrelated but bug-triggering side-effect
    *result = tt;
    return off;
}

int main() {
    u32 l;
    u32 off = bug(&l);
    printf ("off>>: %08x\n", off);
    return 0;
}
Comment 1 Sergei Trofimovich 2017-12-23 14:47:02 UTC
Here is disassembled dump of gcc -O1 result (it's slightly easier to reason about than gcc's assembly output):

Dump of assembler code for function bug:
   0x40000000000005f0 <+0>:     [MII]       mov r14=r12
   // store 'ss=0xffffffff' on stack
   0x40000000000005f1 <+1>:                 mov r15=-1;;
   0x40000000000005f2 <+2>:                 nop.i 0x0

   // store 'd=0xeeeeeeee' on stack
   0x4000000000000600 <+16>:    [MLX]       st4.rel [r14]=r15,4
   0x4000000000000601 <+17>:                movl r15=0xffffffffeeeeeeee;;

   // reload 'd'
   0x4000000000000610 <+32>:    [MMI]       st4.rel [r14]=r15 
   0x4000000000000611 <+33>:                ld4.acq r15=[r14]
   0x4000000000000612 <+34>:                nop.i 0x0

   // u32 tt = d & 0x00800000;
   0x4000000000000620 <+48>:    [MLX]       nop.m 0x0
   0x4000000000000621 <+49>:                movl r14=0x800000;;
   0x4000000000000630 <+64>:    [MMI]       and r15=r14,r15;;

   // u32 r  = tt << 8;
   0x4000000000000631 <+65>:                nop.m 0x0
   0x4000000000000632 <+66>:                dep.z r14=r15,8,24
   0x4000000000000640 <+80>:    [MMI]       ld4.acq r8=[r12]
   // *result = tt;
   0x4000000000000641 <+81>:                st4 [r32]=r15
   0x4000000000000642 <+82>:                nop.i 0x0;;

   // rotate
   //r = (r >> 31)
   //  | (r <<  1);
   0x4000000000000650 <+96>:    [MII]       nop.m 0x0
   0x4000000000000651 <+97>:                mix4.r r14=r14,r14;;
   0x4000000000000652 <+98>:                shr.u r14=r14,31;;

   // u32 u = r^ss;
   0x4000000000000660 <+112>:   [MMI]       xor r8=r8,r14;;
   0x4000000000000661 <+113>:               nop.m 0x0

   // u32 off = u >> 1;
   // Note how we use 32 bits after 1-bit shift.
   0x4000000000000662 <+114>:               extr.u r8=r8,1,32
   0x4000000000000670 <+128>:   [MIB]       nop.m 0x0
   0x4000000000000671 <+129>:               nop.i 0x0
   0x4000000000000672 <+130>:               br.ret.sptk.many b0;;


Here 'extr.u r8=r8,1,32' should be 'extr.u r8=r8,1,31' to yield correct result.
But something in RTK combine pass decided to expand 31 to 32.
Comment 2 Sergei Trofimovich 2017-12-23 14:49:13 UTC
Where the change happens (today's gcc master):
    $./bin/ia64-unknown-linux-gnu-gcc -O1 -S a.c -fdump-rtl-all

    $ fgrep -A3 zero_extract a.c.*

zero_extract extracted 31 bits at offset 1:

a.c.262r.init-regs:        (zero_extract:DI (reg:DI 358)
a.c.262r.init-regs-            (const_int 31 [0x1f])
a.c.262r.init-regs-            (const_int 1 [0x1]))) "a.c":28 159 {extzv}
a.c.262r.init-regs-     (expr_list:REG_DEAD (reg:DI 358)

a.c.264r.combine:insn_cost 4 for    23: r359:DI=zero_extract(r358:DI,0x1f,0x1)
a.c.264r.combine-      REG_DEAD r358:DI
a.c.264r.combine-insn_cost 4 for    24: r356:SI#0=r359:DI
a.c.264r.combine-      REG_DEAD r359:DI

a.c.264r.combine:modifying insn i3    29: r8:DI=zero_extract(r358:DI,0x20,0x1)
a.c.264r.combine-      REG_DEAD r358:DI
a.c.264r.combine-deferring rescan insn with uid = 29.
a.c.264r.combine-starting the processing of deferred insns

zero_extract extracts 32 bits at offset 1:

a.c.264r.combine:        (zero_extract:DI (reg:DI 358)
a.c.264r.combine-            (const_int 32 [0x20])
a.c.264r.combine-            (const_int 1 [0x1]))) "a.c":33 159 {extzv}
a.c.264r.combine-     (expr_list:REG_DEAD (reg:DI 358)
Comment 3 Jessica Clarke 2017-12-23 18:36:07 UTC
Created attachment 42960 [details]
rotate test case

Jason (in Cc) sent me this example, which is similar. What's basically happening, as far as I understand, is that, since the source for the "extr.u" is being used for a 32-bit integer, it knows that the high 32 bits of the register should be 0, and thus it can always increase the length parameter to 32, but the earlier rotate (which is implemented with a mix4.r, i.e. copy lower 32 bits to upper 32 bits, and a shr.u) doesn't clear out the high 32 bits.
Comment 4 Jessica Clarke 2017-12-23 18:37:41 UTC
Created attachment 42961 [details]
Zero-out high 32 bits after a rotate

Please try the attached patch.
Comment 5 Sergei Trofimovich 2017-12-23 18:53:18 UTC
gcc's try_combine() scares me lots :)

Segher, can you guide me here what changes size of zero_extract in here from 0x1f to 0x20?
Comment 6 Sergei Trofimovich 2017-12-23 19:27:02 UTC
(In reply to James Clarke from comment #4)
> Created attachment 42961 [details]
> Zero-out high 32 bits after a rotate
> 
> Please try the attached patch.

The patch works on an attached example. Building 7.2.0 with your patch to check on openssl testsuite.

Thank you!
Comment 7 Segher Boessenkool 2017-12-23 19:53:23 UTC
(In reply to Sergei Trofimovich from comment #5)
> gcc's try_combine() scares me lots :)

It's an acquired taste.

> Segher, can you guide me here what changes size of zero_extract in here from
> 0x1f to 0x20?

Trying 23 -> 24:
   23: r359:DI=zero_extract(r358:DI,0x1f,0x1)
      REG_DEAD r358:DI
   24: r356:SI#0=r359:DI
      REG_DEAD r359:DI
Successfully matched this instruction:
(set (subreg:DI (reg:SI 356 [ off ]) 0)
    (lshiftrt:DI (reg:DI 358)
        (const_int 1 [0x1])))
allowing combination of insns 23 and 24
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 23.
modifying insn i3    24: r356:SI#0=r358:DI 0>>0x1
      REG_DEAD r358:DI
deferring rescan insn with uid = 24.

[ ... ]

Trying 24 -> 29:
   24: r356:SI#0=r358:DI 0>>0x1
      REG_DEAD r358:DI
   29: r8:SI=r356:SI
      REG_DEAD r356:SI
Successfully matched this instruction:
(set (reg:DI 8 r8)
    (zero_extract:DI (reg:DI 358)
        (const_int 32 [0x20])
        (const_int 1 [0x1])))
allowing combination of insns 24 and 29
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 24.
modifying insn i3    29: r8:DI=zero_extract(r358:DI,0x20,0x1)
      REG_DEAD r358:DI
deferring rescan insn with uid = 29.



r358 is known to have the high half zero:
   22: r358:DI=r357:SI#0^r341:SI#0

so this is correct.


The zero_extract(31,1) is transformed to a shift right, and that then
later is transformed to a zero_extract(32,1); there is no transform from
31 to 32 :-)

James' patch looks fine fwiw; it's a backend problem.
Comment 8 Sergei Trofimovich 2017-12-23 20:14:51 UTC
Ah, tricky. That makes sense!

Shows again how bad I am at reading RTL logs. Is there a doc I could get through to understand what '#0' means (my guess it's "zero-extend") in '22: r358:DI=r357:SI#0^r341:SI#0' and such?

Thank you!
Comment 9 Segher Boessenkool 2017-12-23 20:22:17 UTC
#0 means subreg 0; in this case,  (subreg:DI (reg:SI 357) 0)  etc.

(The "slim" RTL dumps are lossy in places btw; in _this_ case you
can deduce it has to be DImode, but not always.  OTOH slim dumps
take up less space ;-) )
Comment 10 Sergei Trofimovich 2017-12-23 21:57:25 UTC
> The patch works on an attached example. Building 7.2.0 with your patch to
> check on openssl testsuite.

Full openssl-1.0.2n test suite now passes with this patch.
Comment 11 Jim Wilson 2017-12-24 05:02:49 UTC
On itanium, when you take a paradoxical subreg, the upper bits are undefined.  Note for instance that we use the exact same instruction for addsi3 and adddi3.  So after an addsi3, the upper bits may be garbage because the add instruction writes to all 64 bits.  It is the compare instruction that matters.  We have separate cmp4 for SImode and cmp for DImode, where cmp4 ignores the upper bits of the register because they are garbage bits.

This part

>r358 is known to have the high half zero:
>   22: r358:DI=r357:SI#0^r341:SI#0

is wrong.  The upper bits of both registers are unknown bits, and xor of unknown bits does not return zero.

I think the problem is in nonzero_bits1 in rtlanal.c, in the SUBREG case.  On ia64, we have WORD_REGISTER_OPERATIONS true, and load_extend_op of SImode is ZERO_EXTEND, so the code decides that the upper bits must be zero.  But LOAD_EXTEND_OP only applies to memory operations, and we do not have a subreg of memory here, we have a subreg of a register.  When we have a subreg of a reg, on itanium, the upper bits are unknown.  So the optimization performed here is wrong.  Though one could perhaps argue that the ia64 port is using WORD_REGISTER_OPERATIONS wrong I suppose.  I'm not sure what would happen if we removed that define.

Anyways, since addsi3 and many other simode patterns leave the upper bits as garbage, it doesn't make sense to argue that rotate patterns must zero the upper bits.
Comment 12 Jason Duerstock 2017-12-24 06:34:48 UTC
Some additional information:

1) The code works properly under 6.4.0.
2) The code works properly with -O1 -fno-tree-ter.
Comment 13 Jessica Clarke 2017-12-24 16:31:48 UTC
Created attachment 42963 [details]
Higher bits of paradoxical subregs on ia64 are undefined

Thanks Jim, that makes sense. It seems to me that WORD_REGISTER_OPERATIONS should still be true on ia64 given the description in the documentation. This regression was introduced in r242326, which added the `&& !REG_P (SUBREG_REG (x))` to nonzero_bits1, thereby assuming that the high bits were defined, which is a target-specific assumption. The attached patch therefore encodes this assumption in whether WORD_REGISTER_OPERATIONS is positive or negative; I haven't modified any documentation as there's no point doing that if people disagree with this approach.
Comment 14 Segher Boessenkool 2017-12-24 18:26:25 UTC
(In reply to Jim Wilson from comment #11)

> This part
> 
> >r358 is known to have the high half zero:
> >   22: r358:DI=r357:SI#0^r341:SI#0
> 
> is wrong.  The upper bits of both registers are unknown bits, and xor of
> unknown bits does not return zero.

Yup, I don't know what I was thinking.

> I think the problem is in nonzero_bits1 in rtlanal.c, in the SUBREG case. 
> On ia64, we have WORD_REGISTER_OPERATIONS true, and load_extend_op of SImode
> is ZERO_EXTEND, so the code decides that the upper bits must be zero.  But
> LOAD_EXTEND_OP only applies to memory operations, and we do not have a
> subreg of memory here, we have a subreg of a register.

Right, that should be changed, that makes no sense at all.

Will that fix the problem here?

> When we have a
> subreg of a reg, on itanium, the upper bits are unknown.  So the
> optimization performed here is wrong.  Though one could perhaps argue that
> the ia64 port is using WORD_REGISTER_OPERATIONS wrong I suppose.

WORD_REGISTER_OPERATIONS isn't well-defined.

"""
@defmac WORD_REGISTER_OPERATIONS
Define this macro to 1 if operations between registers with integral mode
smaller than a word are always performed on the entire register.
Most RISC machines have this property and most CISC machines do not.
"""

What operations?  For some operations it can never be true (rotates, shifts,
all slightly more complex operations).  For machines that have explicit
operations in more than one size it cannot be true, either.

> I'm not sure what would happen if we removed that define.

When I removed WORD_REGISTER_OPERATIONS from the rs6000 backend a few things
regressed, but more things improved.
Comment 15 Eric Botcazou 2017-12-24 18:29:24 UTC
> Thanks Jim, that makes sense. It seems to me that WORD_REGISTER_OPERATIONS
> should still be true on ia64 given the description in the documentation.

I disagree, WORD_REGISTER_OPERATIONS means that the (general) registers are always modified as a whole by arithmetic operations.  If that isn't the case, then the macro should not be defined (e.g Aarch64 doesn't define it although ARM does).

> This regression was introduced in r242326, which added the `&& !REG_P
> (SUBREG_REG (x))` to nonzero_bits1, thereby assuming that the high bits were
> defined, which is a target-specific assumption.

No, see above, WORD_REGISTER_OPERATIONS means that the bits are defined.
Comment 16 Jim Wilson 2017-12-24 18:34:15 UTC
That referred patch was written by Eric Botcazou for PR59461 which is for SPARC, which operates same as Itanium, the upper 32-bits of a 32-bit value in a 64-bit reg are undefined.  So it does not appear to be correct for SPARC either.  Hence it appears that we need the same change for SPARC, and that breaks the fix for PR59461.  I think we need to revisit that.  If you have a paradoxical subreg of a reg, then it is only OK to use LOAD_EXTEND_OP if you know the value in the reg came from a memory location.  This check is missing.

If we are going to change the meaning of WORD_REGISTER_OPERATIONS, then we need a change to the docs also.  But I'm not convinced that this needs to change.
Comment 17 Jessica Clarke 2017-12-24 18:36:16 UTC
(In reply to Eric Botcazou from comment #15)
> > Thanks Jim, that makes sense. It seems to me that WORD_REGISTER_OPERATIONS
> > should still be true on ia64 given the description in the documentation.
> 
> I disagree, WORD_REGISTER_OPERATIONS means that the (general) registers are
> always modified as a whole by arithmetic operations.  If that isn't the
> case, then the macro should not be defined (e.g Aarch64 doesn't define it
> although ARM does).

Ok, fair enough, the docs are all I have to go on, as opposed to your existing knowledge of the codebase.

> > This regression was introduced in r242326, which added the `&& !REG_P
> > (SUBREG_REG (x))` to nonzero_bits1, thereby assuming that the high bits were
> > defined, which is a target-specific assumption.
> 
> No, see above, WORD_REGISTER_OPERATIONS means that the bits are defined.

Well, yes and no, then. The regression on ia64 *was* introduced then, but only because of what you believe to be the existing incorrect use of WORD_REGISTER_OPERATIONS. I doubt there's much more I can contribute to this bug myself as I lack the knowledge of what is meant to happen deep inside GCC, but certainly I can confirm that the issue is with nonzero_bits returning only the lower 32 bits due to that if condition.
Comment 18 Jim Wilson 2017-12-24 18:37:56 UTC
SPARC defines WORD_REGISTER_OPERATIONS, and works the same as ia64.  The upper bits of a 64-bit register after a 32-bit operation are don't care bits.
Comment 19 Jessica Clarke 2017-12-24 18:38:19 UTC
(In reply to Jim Wilson from comment #16)
> That referred patch was written by Eric Botcazou for PR59461 which is for
> SPARC, which operates same as Itanium, the upper 32-bits of a 32-bit value
> in a 64-bit reg are undefined.  So it does not appear to be correct for
> SPARC either.  Hence it appears that we need the same change for SPARC, and
> that breaks the fix for PR59461.  I think we need to revisit that.  If you
> have a paradoxical subreg of a reg, then it is only OK to use LOAD_EXTEND_OP
> if you know the value in the reg came from a memory location.  This check is
> missing.
> 
> If we are going to change the meaning of WORD_REGISTER_OPERATIONS, then we
> need a change to the docs also.  But I'm not convinced that this needs to
> change.

The difference is that SPARC has instructions to operate on 32-bit values in the 64-bit registers, like many other RISC architectures, including for things like shifts.
Comment 20 Jim Wilson 2017-12-24 19:10:31 UTC
I don't see the distinction here.  Ia64 has instructions that operate on 32-bit values too, like cmp4.

On sparc, given this testcase
int
sub (int i, int j, int k)
{
  return i + j + k;
}
the compiler generates
sub:
	add	%o0, %o1, %o0
	add	%o0, %o2, %o0
	jmp	%o7+8
	 sra	%o0, 0, %o0
Note that the add instruction operates on the entire 64-bit register, and after the first add, we no longer have a valid 32-bit value, because there might have been an overflow.  This is why we need the sra at the end to sign-extend the return value, because we know that the upper 32-bits are don't care bits.

If you take a paradoxical subreg of a SImode reg after an add instruction, you can't make any assumptions about the upper 32-bits of the value in the register.  Exactly the same as ia64.

I do see that the fact that sparc has 32-bit shift instructions defined means the testcase that fails on ia64 will not fail on sparc.  But if we rely on that for the WORD_REGISTER_OPERATIONS definition, then it gets even messier than it already is.

If nonzero_bits1 isn't changed, then we may need to remove the WORD_REGISTER_OPERATIONS definition in the ia64 port.  Unfortunately, I don't have access to ia64 hardware, so I can't easily check to see what will happen if this is done.
Comment 21 Sergei Trofimovich 2017-12-28 00:24:15 UTC
(In reply to Jim Wilson from comment #20)
> If nonzero_bits1 isn't changed, then we may need to remove the
> WORD_REGISTER_OPERATIONS definition in the ia64 port.  Unfortunately, I
> don't have access to ia64 hardware, so I can't easily check to see what will
> happen if this is done.

I've dropped WORD_REGISTER_OPERATIONS locally as:

--- gcc/config/ia64/ia64.h      (revision 256001)
+++ gcc/config/ia64/ia64.h      (working copy)
@@ -1538,11 +1538,6 @@
 
 #define CASE_VECTOR_PC_RELATIVE 1
 
-/* Define this macro if operations between registers with integral mode smaller
-   than a word are always performed on the entire register.  */
-
-#define WORD_REGISTER_OPERATIONS 1
-

and ran 'make check' on clean master compiler and master compiler with this change.

1. The original test case works as expected after removal:

# gcc-ia64-no-word-register/host-ia64-unknown-linux-gnu/gcc/xgcc -Bgcc-ia64-no-word-register/host-ia64-unknown-linux-gnu/gcc /a.c -o /a -O1 && /a
off>>: 7fffffff
# gcc-ia64-clean/host-ia64-unknown-linux-gnu/gcc/xgcc -Bgcc-ia64-clean/host-ia64-unknown-linux-gnu/gcc /a.c -o /a -O1 && /a
off>>: ffffffff

2. Result of 'make check' tests:

clean:

                === gcc Summary ===

# of expected passes            104484
# of unexpected failures        524
# of unexpected successes       36
# of expected failures          528
# of unresolved testcases       2
# of unsupported tests          2093
Executing on host: /root/gcc-ia64-clean/host-ia64-unknown-linux-gnu/gcc/xgcc -v    (timeout = 300)
spawn /root/gcc-ia64-clean/host-ia64-unknown-linux-gnu/gcc/xgcc -v

no-word-register:

                === gcc Summary ===

# of expected passes            104484
# of unexpected failures        524
# of unexpected successes       36
# of expected failures          528
# of unresolved testcases       2
# of unsupported tests          2093
Executing on host: /root/gcc-ia64-no-word-register/host-ia64-unknown-linux-gnu/gcc/xgcc -v    (timeout = 300)
spawn /root/gcc-ia64-no-word-register/host-ia64-unknown-linux-gnu/gcc/xgcc -v

Or in diff form:

--- clean.log   2017-12-28 00:19:11.897405845 +0000
+++ nwr.log     2017-12-28 00:19:24.497403922 +0000
@@ -10,2 +10,2 @@
-Executing on host: /root/gcc-ia64-clean/host-ia64-unknown-linux-gnu/gcc/xgcc -v    (timeout = 300)
-spawn /root/gcc-ia64-clean/host-ia64-unknown-linux-gnu/gcc/xgcc -v
+Executing on host: /root/gcc-ia64-no-word-register/host-ia64-unknown-linux-gnu/gcc/xgcc -v    (timeout = 300)
+spawn /root/gcc-ia64-no-word-register/host-ia64-unknown-linux-gnu/gcc/xgcc -v
@@ -13 +13 @@
-COLLECT_GCC=/root/gcc-ia64-clean/host-ia64-unknown-linux-gnu/gcc/xgcc
+COLLECT_GCC=/root/gcc-ia64-no-word-register/host-ia64-unknown-linux-gnu/gcc/xgcc
@@ -18 +18 @@
-/root/gcc-ia64-clean/host-ia64-unknown-linux-gnu/gcc/xgcc  version 8.0.0 20171226 (experimental) (GCC) 
+/root/gcc-ia64-no-word-register/host-ia64-unknown-linux-gnu/gcc/xgcc  version 8.0.0 20171226 (experimental) (GCC) 
@@ -20 +20 @@
-runtest completed at Wed Dec 27 17:05:35 2017
+runtest completed at Wed Dec 27 13:14:17 2017

Surprisingly (or maybe not so surprisingly) no effect!

Assembly diff is as expected:

--- /a-clean.s  2017-12-28 00:22:22.945376700 +0000
+++ /a-nwr.s    2017-12-28 00:22:56.665371555 +0000
@@ -23,2 +23,2 @@
-       dep.z r14 = r15, 8, 24
-       ld4.acq r8 = [r12]
+       dep.z r8 = r15, 8, 24
+       ld4.acq r16 = [r12]
@@ -27 +27 @@
-       mix4.r r14 = r14, r14
+       mix4.r r8 = r8, r8
@@ -29 +29 @@
-       shr.u r14 = r14, 31
+       shr.u r8 = r8, 31
@@ -31 +31 @@
-       xor r8 = r8, r14
+       xor r8 = r16, r8
@@ -33 +33 @@
-       extr.u r8 = r8, 1, 32
+       extr.u r8 = r8, 1, 31
Comment 22 Sergei Trofimovich 2017-12-28 13:08:42 UTC
glibc test built with WORD_REGISTER_OPERATIONS unset are also almost not affected:

# diff -U0 ../glibc-orig.log ../glibc-no-word.log 
--- ../glibc-orig.log   2017-12-28 10:49:36.535635523 +0000
+++ ../glibc-no-word.log        2017-12-28 13:03:20.911591933 +0000
@@ -9,0 +10 @@
+FAIL: nptl/tst-cancel21-static
@@ -13 +13,0 @@
-FAIL: rt/tst-cpuclock2
@@ -24,2 +24,2 @@
-    292 FAIL
-   5286 PASS
+    293 FAIL
+   5285 PASS

(large existing batch of ~280 failures is related to FPU failures on NaN)
Comment 23 Jim Wilson 2017-12-29 20:13:34 UTC
There should be no correctness issue with turning off WORD_REGISTER_OPERATIONS.  The only issue should be performance.  If we lose 5+% performance, then we should try to get the broken nonzero_bits1 patch reverted.  If performance stays about the same, then there is probably no point in fighting this and we should just turn off WORD_REGISTER_OPERATIONS.  None of the 3 most important targets set it, so maybe it doesn't work correctly anymore, except maybe on targets like sparc that have shift instructions that read subwords but write the entire word.
Comment 24 Segher Boessenkool 2017-12-29 20:42:13 UTC
Binary size (of vmlinux) grows by 0.09%.  I don't know about performance.
Comment 25 Eric Botcazou 2018-01-02 15:28:31 UTC
> I don't see the distinction here.  Ia64 has instructions that operate on
> 32-bit values too, like cmp4.

The distinction is precisely what WORD_REGISTER_OPERATIONS conveys.  On SPARC and MIPS for example, for a 32-bit operation in 64-bit mode, the entire 64-bit register is modified and the upper bits contain well defined bits.

> On sparc, given this testcase
> int
> sub (int i, int j, int k)
> {
>   return i + j + k;
> }
> the compiler generates
> sub:
> 	add	%o0, %o1, %o0
> 	add	%o0, %o2, %o0
> 	jmp	%o7+8
> 	 sra	%o0, 0, %o0
> Note that the add instruction operates on the entire 64-bit register, and
> after the first add, we no longer have a valid 32-bit value, because there
> might have been an overflow.  This is why we need the sra at the end to
> sign-extend the return value, because we know that the upper 32-bits are
> don't care bits.

No, the upper bits are well defined by the SPARC-V9 architecture.

> If you take a paradoxical subreg of a SImode reg after an add instruction,
> you can't make any assumptions about the upper 32-bits of the value in the
> register.  Exactly the same as ia64.

No, you know what the bits are on SPARC and MIPS for example if you know the contents of the entire 64-bit register before the addition.

> I do see that the fact that sparc has 32-bit shift instructions defined
> means the testcase that fails on ia64 will not fail on sparc.  But if we
> rely on that for the WORD_REGISTER_OPERATIONS definition, then it gets even
> messier than it already is.

I agree that we may need to do something for operations that don't have a clear meaning when "extended" to the full work though.

Let me give it some thought...
Comment 26 Eric Botcazou 2018-01-02 15:28:55 UTC
Investigating.
Comment 27 Jeffrey A. Law 2018-01-02 17:41:26 UTC
WORD_REGISTER_OPERATIONS says the operations in integral modes smaller than a word are performed on the entire register (ie a word).

I think WORD_REGISTER_OPERATIONS is fairly intuitive when it comes to arithmetic and logicals. For arith/logicals I would find myself agreeing with Eric in c#25.

However, I think it's not as clear for shifts/rotates.

On the PA (which defines WORD_REGISTER_OPERATIONS) we have 32bit deposit/extract  which are used for various shifts.  The 32bit extract/deposit instructions set the low 32 bits in the expected way, but leave the upper bits undefined.  So I suspect PA and IA-64 have similar behavior (big surprise).

One could argue that WORD_REGISTER_OPERATIONS should apply to just arith and logicals.  One could also argue that it should apply to shifts in which case ports like the PA and IA-64 probably shouldn't define WORD_REGISTER_OPERATIONS.

If we were to argue that WORD_REGISTER_OPERATIONS should apply to just arith/logicals, then the PA would need to be audited for cases where we might implement an arith operation using shifts (deposit/extract instructions).

And just to be clear here.  I don't think the PA or IA-64 ought to drive decisions in this area.  They're both dead architectures -- but others may have similar properties.


Note none of this touches on what are the high bits of a paradoxical subreg, which are undefined in RTL.  So I'm in partial agreement with Jim c#11 as well.
Comment 28 Sergei Trofimovich 2018-01-02 22:46:09 UTC
Thanks all for very insightful comments and sorting out WORD_REGISTER_OPERATIONS ambiguity! I've understood quite a bit on how RTL does it's magic.

I still have a few related questions to clarify things:

1. Is it directly visible for you from RTL dumps which bits GCC assumes as non-zero or you just know RTL invariants? I had to patch gcc locally to verify RTL prediction by adding printfs into print_pattern to get the idea how it works:

insn_cost 4 for    20: [r347:DI]=<nz_bits(set)=0x800000>r354:DI#0
insn_cost 4 for    21: r357:SI=<nz_bits(set)=0x1>r343:SI<-<0x1

2. Does gcc have any validation mode that could have caught this error after code generation and checking against RTL invariant?
Comment 29 Jim Wilson 2018-01-02 23:43:44 UTC
(In reply to Sergei Trofimovich from comment #28)
> 1. Is it directly visible for you from RTL dumps which bits GCC assumes as
> non-zero or you just know RTL invariants? I had to patch gcc locally to
> verify RTL prediction by adding printfs into print_pattern to get the idea
> how it works:

I stepped through combine in gdb to find the problem.
 
> 2. Does gcc have any validation mode that could have caught this error after
> code generation and checking against RTL invariant?

We have some compile-time and run-time checking code, but nothing that would catch a problem like this.  We do have the regression testsuite which requires manually adding testcases that get miscompiled, and hopefully prevents them from getting miscompiled again.
Comment 30 Eric Botcazou 2018-01-08 09:22:08 UTC
As pointed out by Segher in comment #14, the problem ultimately comes from the ambiguity of WORD_REGISTER_OPERATIONS.  Quoting him:

"WORD_REGISTER_OPERATIONS isn't well-defined.

"""
@defmac WORD_REGISTER_OPERATIONS
Define this macro to 1 if operations between registers with integral mode
smaller than a word are always performed on the entire register.
Most RISC machines have this property and most CISC machines do not.
"""

What operations?  For some operations it can never be true (rotates, shifts,
all slightly more complex operations).  For machines that have explicit
operations in more than one size it cannot be true, either."

I'm leaving out the second case (explicit operations in more than one size) and discussing the first case.  It turns out that, in SPARC-V9 (64-bit architecture), the 3 32-bit shift operations (sll, srl, sra) do operate on the entire 64-bit registers with the expected semantics (e.g. srl clears the upper 32 bits and sra fills them with a copy of the sign bit of the lower part) so, on SPARC, you can apply WORD_REGISTER_OPERATIONS to shifts (there is no rotate instruction).

So it appears that we have 2 classes of RISC machines, the ones supporting a strong version of WORD_REGISTER_OPERATIONS and the others only a weak one.
Comment 31 Eric Botcazou 2018-01-08 10:23:03 UTC
> So it appears that we have 2 classes of RISC machines, the ones supporting a
> strong version of WORD_REGISTER_OPERATIONS and the others only a weak one.

However I'm not sure whether exposing the distinction is really the way to go so I'm going to evaluate the pessimization that would be introduced on SPARC 64-bit by disregarding WORD_REGISTER_OPERATIONS for shift operations.
Comment 32 Eric Botcazou 2018-01-09 09:29:03 UTC
> However I'm not sure whether exposing the distinction is really the way to
> go so I'm going to evaluate the pessimization that would be introduced on
> SPARC 64-bit by disregarding WORD_REGISTER_OPERATIONS for shift operations.

Pessimization is real under the form of redundant sign/zero-extensions, so I'm going to test a stopgap fix instead on IA-64 and SPARC64.
Comment 33 Eric Botcazou 2018-01-09 09:31:49 UTC
Created attachment 43068 [details]
Stopgap fix

To be tested on IA-64 and SPARC.
Comment 34 Eric Botcazou 2018-01-12 10:18:55 UTC
Author: ebotcazou
Date: Fri Jan 12 10:18:24 2018
New Revision: 256572

URL: https://gcc.gnu.org/viewcvs?rev=256572&root=gcc&view=rev
Log:
	PR rtl-optimization/83565
	* rtlanal.c (nonzero_bits1): On WORD_REGISTER_OPERATIONS machines, do
	not extend the result to a larger mode for rotate operations.
	(num_sign_bit_copies1): Likewise.

Added:
    trunk/gcc/testsuite/gcc.c-torture/execute/20180112-1.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/rtlanal.c
    trunk/gcc/testsuite/ChangeLog
Comment 35 Eric Botcazou 2018-01-12 10:21:16 UTC
Author: ebotcazou
Date: Fri Jan 12 10:20:42 2018
New Revision: 256573

URL: https://gcc.gnu.org/viewcvs?rev=256573&root=gcc&view=rev
Log:
	PR rtl-optimization/83565
	* rtlanal.c (nonzero_bits1): On WORD_REGISTER_OPERATIONS machines, do
	not extend the result to a larger mode for rotate operations.
	(num_sign_bit_copies1): Likewise.

Added:
    branches/gcc-7-branch/gcc/testsuite/gcc.c-torture/execute/20180112-1.c
      - copied unchanged from r256572, trunk/gcc/testsuite/gcc.c-torture/execute/20180112-1.c
Modified:
    branches/gcc-7-branch/gcc/ChangeLog
    branches/gcc-7-branch/gcc/rtlanal.c
    branches/gcc-7-branch/gcc/testsuite/ChangeLog
Comment 36 Eric Botcazou 2018-01-12 10:24:04 UTC
Thanks for reporting the problem and distilling a reduced testcase.