Bug 24810 - [4.1/4.2 Regression] mov + mov + testl generated instead of testb
Summary: [4.1/4.2 Regression] mov + mov + testl generated instead of testb
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.1.0
: P2 normal
Target Milestone: 4.1.0
Assignee: Not yet assigned to anyone
Depends on:
Blocks: 23153
  Show dependency treegraph
Reported: 2005-11-11 19:27 UTC by Dan Nicolaescu
Modified: 2005-12-29 11:53 UTC (History)
2 users (show)

See Also:
Target: i?86-*-*, x86_64-*-*
Known to work:
Known to fail:
Last reconfirmed:

Preprocessed code containing the functions that exhibit the problem (34.59 KB, text/plain)
2005-11-11 19:29 UTC, Dan Nicolaescu

Note You need to log in before you can comment on or make changes to this bug.
Description Dan Nicolaescu 2005-11-11 19:27:59 UTC
Compiling i387.c from the Linux kernel using: 
 -nostdinc -isystem /usr/lib/gcc/i386-redhat-linux/4.0.1/include -D__KERNEL__ -Iinclude -Wall -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -ffreestanding -O2 -fomit-frame-pointer -g -save-temps -msoft-float -m32 -fno-builtin-sprintf -fno-builtin-log2 -fno-builtin-puts -mpreferred-stack-boundary=2 -fno-unit-at-a-time -march=i686 -mtune=pentium4 -mregparm=3 -Iinclude/asm-i386/mach-default -Wdeclaration-after-statement -Wno-pointer-sign -DKBUILD_BASENAME=i387 -DKBUILD_MODNAME=i387 -carch/i386/kernel/i387.c
(these are the flags generated by rpmbuild on a Fedora Core 4 system) 

Using 4.0 the restore_fpu function looks like:
        testb   $1, boot_cpu_data+15
        je      .L23

Using 4.1 it looks like:
        movl    %eax, %edx
        movl    boot_cpu_data+12, %eax
        testl   $16777216, %eax
        je      .L24

Similar code sequences appear in other functions in the same file: 
get_fpu_mxcsr, get_fpu_swd, get_fpu_cwd, set_fpregs.
The size of these functions increases by 5 bytes (i.e.20%) 

It seems that some of these functions might be on some critical path in the kernel, so the size increase (and maybe speed penalty) could have an impact.

For 4.0 the 00.expand dump looks like:

(insn 9 7 10 1 (set (reg/f:SI 59)
        (const:SI (plus:SI (symbol_ref:SI ("boot_cpu_data") [flags 0x40] <var_decl 0xb7ee2d
80 boot_cpu_data>)
                (const_int 12 [0xc])))) -1 (nil)

(insn 10 9 11 1 (set (reg:SI 60)
        (mem/s/j:SI (reg/f:SI 59) [0 boot_cpu_data.x86_capability+0 S4 A32])) -1 (nil)

(insn 11 10 12 1 (parallel [
            (set (reg:SI 61)
                (and:SI (reg:SI 60)
                    (const_int 16777216 [0x1000000])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)

(insn 12 11 13 1 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg:SI 61)
            (const_int 0 [0x0]))) -1 (nil)

for 4.1 is identical except for insn 10 which has mem/s/v/j:SI 
instead of mem/s/j:SI. 

The combine pass of 4.0 deletes insn 10, that does not happen for 4.1

For 4.1 the generated code does not change when using -Os or -march=pentium4

This is one of the causes for PR23153
Comment 1 Dan Nicolaescu 2005-11-11 19:29:06 UTC
Created attachment 10220 [details]
Preprocessed code containing the functions that exhibit the problem
Comment 2 Dan Nicolaescu 2005-11-13 02:47:46 UTC
Simplified testcase: 
struct cpuinfo_x86 {
  unsigned char x86;
  unsigned char x86_vendor;
  unsigned char x86_model;
  unsigned char x86_mask;
  char wp_works_ok;
  char hlt_works_ok;
  char hard_math;
  char rfu;
  int cpuid_level;
  unsigned long x86_capability[7];
} __attribute__((__aligned__((1 << (7)))));

struct task_struct;
extern void foo (struct task_struct *tsk);
extern void bar (struct task_struct *tsk);

extern struct cpuinfo_x86 boot_cpu_data;

static inline __attribute__((always_inline)) int
constant_test_bit(int nr, const volatile unsigned long *addr)
 return ((1UL << (nr & 31)) & (addr[nr >> 5])) != 0;

restore_fpu(struct task_struct *tsk)
  if (constant_test_bit(24, boot_cpu_data.x86_capability))
    foo (tsk);
    bar (tsk);

The generated code for this simplified tescase shows one additional issue:

        movl    %eax, %edx
        movl    boot_cpu_data+12, %eax  ; edx could be used here
        testl   $16777216, %eax         ; and here
        je      .L2
        movl    %edx, %eax  ; then all the mov %eax, %edx and mov %edx, %eax
        jmp     foo         ; instructions could be eliminated.
        .p2align 4,,7
        movl    %edx, %eax
        jmp     bar

Comment 3 Janis Johnson 2005-11-14 22:17:14 UTC
A regression hunt using an i686-linux cross compiler identified the following
patch where the code generation changes:


r99658 | hubicka | 2005-05-13 13:57:19 +0000 (Fri, 13 May 2005) | 15 lines
        * gcc.dg/builtins-43.c: Use gimple dump instead of generic.
        * gcc.dg/fold-xor-?.c: Likewise.
        * gcc.dg/pr15784-?.c: Likewise.
        * gcc.dg/pr20922-?.c: Likewise.
        * gcc.dg/tree-ssa/20050128-1.c: Likewise.
        * gcc.dg/tree-ssa/pr17598.c: Likewise.
        * gcc.dg/tree-ssa/pr20470.c: Likewise.
        * tree-inline.c (copy_body_r): Simplify substituted ADDR_EXPRs.
        * tree-optimize.c (pass_gimple): Kill.
        (init_tree_optimization_passes): Kill pass_gimple.
        * tree-cfg.c (build_tree_cfg): Do verify_stmts to check that we are gimple.
        * tree-dump.c (dump_files): Rename .generic to .gimple.*
Comment 4 Mark Mitchell 2005-11-19 02:10:40 UTC
Should be fixed before 4.1, if possible.
Comment 5 Jan Hubicka 2005-12-18 20:53:29 UTC
Simplified testcase seems to work for me on 4.1 branch:
        movl    4(%esp), %edx
        movl    boot_cpu_data+12, %eax
        testl   $16777216, %eax
        je      .L2
        jmp     foo
        movl    %edx, 4(%esp)
        jmp     bar
"jmp foo" is not elliminated because we don't have pattern for conditional tailcalls.  Should not be big issue to add the neccesary patterns however.

Comment 6 Dan Nicolaescu 2005-12-18 22:57:56 UTC
(In reply to comment #5)
> Simplified testcase seems to work for me on 4.1 branch:
> restore_fpu:
>         movl    4(%esp), %edx
>         movl    boot_cpu_data+12, %eax
>         testl   $16777216, %eax

4.0 still does better, it uses a single "testb" instruction instead of 2 dependent 
movl + testb instructions.
Comment 7 Kazu Hirata 2005-12-19 00:37:02 UTC
We are basically talking about narrowing the memory being loaded for testing.
Now, can we really optimize this case?  We've got

  const volatile unsigned long *addr

I am not sure if "volatile" allows us to change the width of a memory read.
I know a chip that expects you to read memory at one address repeatedly to
transfer a block of data, and people probably use volatile
for this kind of case.  If the compiler changes the width of memory access,
we may be screwing up something.

IMHO, if byte access is really desired, the code should be rewritten that way.
Comment 8 Jakub Jelinek 2005-12-29 11:53:00 UTC
I don't think this is a bug, in fact, not honoring the volatile in GCC 4.0.x
and earlier was a bug.  If you want to allow byte access rather than word
access, you really need to remove the volatile keyword and then it compiles into
        testb   $1, boot_cpu_data+15
        je      .L2
        jmp     foo
        jmp     bar
        .size   restore_fpu, .-restore_fpu
        .ident  "GCC: (GNU) 4.2.0 20051223 (experimental)"

You should report this against Linux kernel, it shouldn't use volatile in