Bug 11180 - [avr-gcc] Optimization decrease performance of struct assignment.
Summary: [avr-gcc] Optimization decrease performance of struct assignment.
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 3.3.1
: P4 normal
Target Milestone: 5.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 16996
  Show dependency treegraph
 
Reported: 2003-06-13 02:35 UTC by Dmitry K.
Modified: 2023-06-21 03:54 UTC (History)
5 users (show)

See Also:
Host:
Target: avr-*-*
Build:
Known to work:
Known to fail: 4.6.4
Last reconfirmed: 2005-06-26 02:25:56


Attachments
quick and dirty patch to reduce code size (1.04 KB, patch)
2007-09-16 13:38 UTC, Rask Ingemann Lambertsen
Details | Diff
quick and dirty patch to reduce code size (1.04 KB, patch)
2007-09-16 17:30 UTC, Rask Ingemann Lambertsen
Details | Diff
Rask's patch modified from comments. (1.05 KB, patch)
2007-09-19 17:57 UTC, Eric Weddington
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry K. 2003-06-13 02:36:00 UTC
Any optimization level increase size and decrease speed 
for next program: 
 
	typedef struct { long lo; int in; } lo_in; 
	lo_in foo (void) 
	{ 
	    lo_in x; 
	    x.lo = 1; 
	    x.in = 2; 
	    return x; 
	} 
 
"avr-gcc -W -Wall -S -O0" produced: 
	/* File "lo_in.c": code   75 = 0x004b (  40), prologues  18, epilogues  17 */ 
"avr-gcc -W -Wall -S -Os" produced: 
	/* File "lo_in.c": code   80 = 0x0050 (  29), prologues  26, epilogues  25 */ 
"avr-gcc -W -Wall -S -O3" produced: 
	/* File "lo_in.c": code   79 = 0x004f (  28), prologues  26, epilogues  25 */ 
 
Compiler: 
	avr-gcc (GCC) 3.3.1 20030519 (prerelease)
Comment 1 Andrew Pinski 2003-07-06 01:24:58 UTC
I can confirm this on the mainline (20030706).
Comment 2 Steven Bosscher 2003-07-10 13:38:21 UTC
On i686-pc-linux-gnu I get identical asm for -O2 and -O3 and also for -O1 and
-Os.   The only difference between -O[23] and -O[1s] is:
<       .p2align 2,,3

        .file   "11180.c"
        .text
        .p2align 2,,3
.globl foo
        .type   foo, @function
foo:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %eax
        movl    $1, (%eax)
        movl    $2, 4(%eax)
        leave
        ret     $4
        .size   foo, .-foo
        .section        .note.GNU-stack,"",@progbits
        .ident  "GCC: (GNU) 3.4 20030710 (experimental)"

So -Os is optimal AFAICT (obviously it gets even smaller with -fomit-frame-pointer).

Apparently this is something target specific.  What target is avr-gcc?

Gr.
Steven
Comment 3 Andrew Pinski 2003-07-10 21:03:25 UTC
I was not the reporter but I was the one who confirmed it; avr-elf is the target, I will post the asm 
(or rtl) if you want.
Comment 4 Andrew Pinski 2004-01-03 19:00:25 UTC
This is a target problem of not showing how moves are done.
Comment 5 Andrew Pinski 2005-01-09 01:30:50 UTC
One issue I find is that avr does not define_insn_and_split (or just define_split) which will greatly 
improve the code generation because the hi part of the QI would be zero and you don't need to set it 
four times.
        ldi r25,hi8(2)
        ldi r25,hi8(1)
        ldi r26,hlo8(1)
        ldi r27,hhi8(1)

The main difference between -Os and -O0 (at least before 4.0.0) is that we use more registers at -Os 
which causes the prologue and eplogue to do more so my recommendation will help a lot.
Comment 6 andy hutchinson 2005-03-27 14:33:09 UTC
The problem here is that gcc is using  a DImode register to handle 6 byte
(int+long) structure. Why I have no idea!

Since the target has no insn for DI move, gcc turns this into individual QImode
byte moves (subregs all over the place!). 

The 'stacked' 6 byte structure is 'popped' into DI register (6 bytes ). Two
other byte registers are explicitely cleared (making our 8 byte DI register) 

What then  follows is a large amount of shuffling. i.e. Moving from intermediate
virtual DI register (8 bytes) into correct place for a 6 byte return. Which
seems to surpass the abilities of the register allocator (DI and return
registers overlap).

Smaller structures (<=4 bytes) are optimally handled. Larger structure >8 are
also much better since they are returned in memory.

So in summary, it would appear that the root cause is allocation of a DI mode
register for structures >4 and <=8 bytes.

A secondary factor is the use of QImode moves (when SI,HImode are available and
more  efficient) 

The problem can be partially alleviated by defining DImode moves (that a hell of
a change though). Poor code still remains - for example clearing unused padding
bytes and extra register usage.

PS -fpack-struct does not change this bug.

Comment 7 Andrew Pinski 2005-06-26 02:25:55 UTC
(In reply to comment #6)
> The problem here is that gcc is using  a DImode register to handle 6 byte
> (int+long) structure. Why I have no idea!
This is so it does not store it on the stack.  As I said in comment #5, this is a target issue and have 
nothing to do with DImode.
Comment 8 Paul Schlie 2005-06-26 15:06:47 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > The problem here is that gcc is using  a DImode register to handle 6 byte
> > (int+long) structure. Why I have no idea!
> This is so it does not store it on the stack.  As I said in comment #5, this is a target issue and have 
> nothing to do with DImode.

It would seem more desireable given the intended purpose to avoid pushing it on the stack
so that it's elements may be more effeciencly accessed, that it's coresponding elements be
allocated within the register file (as opposed to the whole struct remaining packed into an
alllocated DI mode integer), so that it's elements may be effeciently accessed without
needing to rip them out or reassemble them into the otherwise packed monolithic structure?
Comment 9 Eric Weddington 2007-07-24 23:50:35 UTC
Version 4.2.1 offers somewhat better results:

With -O0:
	.file	"test.c"
/* File "test.c": code  109 = 0x006d (  74), prologues  18, epilogues  17 */

With -O[123s]:
	.file	"test.c"
/* File "test.c": code   79 = 0x004f (  28), prologues  26, epilogues  25 */

At least the code no longer increases in size when optimization is turned on.

Dmitry, are the results for 4.2.1 acceptable for this bug report?
Comment 10 Dmitry K. 2007-07-27 01:24:07 UTC
Yes, results are:

avr-gcc-3.3.6:  O0 --> 75,  O1,O2,O3,Os --> 79
avr-gcc-4.2.1:  O0 --> 109, O1,O2,O3,Os --> 79

The mistake is corrected?  It is possible to tell and so as now
application of keys of optimization shortens a code. However,
"correction" does not improve quality of optimization, it only worsens
not optimized code.
Comment 11 Eric Weddington 2007-07-27 14:23:42 UTC
Subject: RE:  [avr-gcc] Optimization decrease performance of
 struct assignment.



> ------- Comment #10 from dmixm at marine dot febras dot ru
> 2007-07-27 01:24 -------
> Yes, results are:
>
> avr-gcc-3.3.6:  O0 --> 75,  O1,O2,O3,Os --> 79
> avr-gcc-4.2.1:  O0 --> 109, O1,O2,O3,Os --> 79
>
> The mistake is corrected?  It is possible to tell and so as now
> application of keys of optimization shortens a code. However,
> "correction" does not improve quality of optimization, it only worsens
> not optimized code.

Agreed. But is this sufficient to close this bug report?

Eric


Comment 12 Dmitry K. 2007-09-09 21:59:01 UTC
Andy Hutchinson wrote (comment #6) that addition a 'movdi' instruction improves the result. I have try to add a very simple 'movdi' (which split into 2 SImode instuctions). In result:
   -O0  --> 85 words,
   -O1,2,3,s  -->  50 words.
Version is 4.2.1
Comment 13 Eric Weddington 2007-09-11 16:10:51 UTC
(In reply to comment #12)
> Andy Hutchinson wrote (comment #6) that addition a 'movdi' instruction improves
> the result. I have try to add a very simple 'movdi' (which split into 2 SImode
> instuctions). In result:
>    -O0  --> 85 words,
>    -O1,2,3,s  -->  50 words.
> Version is 4.2.1
> 

That's great! Dmitry, could you attach your patch?
Comment 14 Dmitry K. 2007-09-16 03:22:00 UTC
It was:

(define_insn "movdi"
  [(set (match_operand:DI 0 "nonimmediate_operand" "")
        (match_operand:DI 1 "general_operand" ""))]
  ""
  "#")
(define_split
  [(set (match_operand:DI 0 "nonimmediate_operand" "")
        (match_operand:DI 1 "general_operand" ""))]
  "reload_completed"
  [(set (match_dup 2) (match_dup 4))
   (set (match_dup 3) (match_dup 5))]
  "{
     operands[2] = simplify_gen_subreg (SImode, operands[0], DImode, 0);
     operands[3] = simplify_gen_subreg (SImode, operands[0], DImode, 4);
     operands[4] = simplify_gen_subreg (SImode, operands[1], DImode, 0);
     operands[5] = simplify_gen_subreg (SImode, operands[1], DImode, 4);
   }")

Alas, this elementary addition has appeared erroneous.
The following program leads to emergency end of compilation:

int main ()
{
    volatile long long x = 0x0102030405060708LL;

    if (x != 0x0102030405060708LL)
        exit (__LINE__);
    exit (0);
}
Comment 15 Rask Ingemann Lambertsen 2007-09-16 12:57:13 UTC
1. You should use define_insn_and_split.
2. If possible (which I think it is here), splitting before reload should produc.e better code.

Btw, what is the ICE?

Also, it seems to me that avr.h defines MOVE_MAX incorrectly. Shouldn't it be 2 instead of 4?
Comment 16 Rask Ingemann Lambertsen 2007-09-16 13:38:47 UTC
Created attachment 14211 [details]
quick and dirty patch to reduce code size

A fundamental problem with the AVR back end is that it sabotages the RTL optimizer's attempts to optimize away redundant move insns. With this patch, I get better code (-O2):

foo:
	push r29	 ;  73	*pushhi/1	[length = 2]
	push r28
	rcall .	 ;  79	*addhi3_sp_R_pc2	[length = 3]
	rcall .
	rcall .
	in r28,__SP_L__	 ;  90	*movqi/6	[length = 1]
	in r29,__SP_H__	 ;  91	*movqi/6	[length = 1]
/* prologue: function */
/* frame size = 6 */
	ldi r24,lo8(2)	 ;  68	*movqi/2	[length = 1]
	std Y+5,r24	 ;  70	*movqi/3	[length = 1]
	std Y+6,__zero_reg__	 ;  71	*movqi/3	[length = 1]
	ldi r22,lo8(2)	 ;  21	*movqi/2	[length = 1]
	ldi r23,lo8(0)	 ;  23	*movqi/2	[length = 1]
	ldi r18,lo8(1)	 ;  42	*movqi/2	[length = 1]
	ldi r19,lo8(0)	 ;  43	*movqi/2	[length = 1]
	ldi r20,lo8(0)	 ;  44	*movqi/2	[length = 1]
	ldi r21,lo8(0)	 ;  45	*movqi/2	[length = 1]
	ldi r24,lo8(0)	 ;  48	*movqi/2	[length = 1]
	ldi r25,lo8(0)	 ;  49	*movqi/2	[length = 1]
/* epilogue start */
	adiw r28,6	 ;  85	*addhi3/2	[length = 1]
	out __SP_L__,r28	 ;  92	*movqi/5	[length = 1]
	out __SP_H__,r29	 ;  93	*movqi/5	[length = 1]
	pop r28	 ;  87	pophi	[length = 2]
	pop r29
	ret	 ;  88	return_from_epilogue	[length = 1]
	.size	foo, .-foo
/* File "/home/rask/pr11180.c": code    0 = 0x0000 (   0), prologues   0, epilogues   0 */

Perhaps a peephole optimization could replace insns 48 and 49 with "movw r25:r24, r21:r20".
The only big further improvement I see would be to get rid of the stack var.
Comment 17 Rask Ingemann Lambertsen 2007-09-16 13:54:31 UTC
The patch is against mainline revision 128431.
Comment 18 Eric Weddington 2007-09-16 15:44:08 UTC
Subject: RE:  [avr-gcc] Optimization decrease performance of
 struct assignment.



> ------- Comment #15 from rask at gcc dot gnu dot org
> 2007-09-16 12:57 -------

> Also, it seems to me that avr.h defines MOVE_MAX incorrectly.
> Shouldn't it be 2
> instead of 4?

Hmm. Yes and no. There are variants that have the MOVW instruction, and for
those it should be 2. Otherwise, if they don't have the MOVW instruction, it
should be 1, I think.


Comment 19 Rask Ingemann Lambertsen 2007-09-16 17:30:44 UTC
Created attachment 14213 [details]
quick and dirty patch to reduce code size

Here's a patch which doesn't mess up the stack pointer update in the epilogue.

The code size output is broken on mainline, so here's what I use:
$ gcc/xgcc -Bgcc/ -W -Wall ~/pr11180.c -S -dp -o /dev/stdout -O2 | awk 'match($0, /\[length = ([[:digit:]]+)\]/, field) { sum += field[1]; } END { print sum; }'
27

Compare that to unpatched mainline:
$ gcc/xgcc -Bgcc/ -W -Wall ~/pr11180.c -S -dp -o /dev/stdout -O2 | awk 'match($0, /\[length = ([[:digit:]]+)\]/, field) { sum += field[1]; } END { print sum; }'
36
Comment 20 Eric Weddington 2007-09-17 02:31:54 UTC
Subject: RE:  [avr-gcc] Optimization decrease performance of
 struct assignment.



> Here's a patch which doesn't mess up the stack pointer update
> in the epilogue.
>

Hi Rask,

Your patch causes a regression. Sort of.

I have a small patch that enables Objective-C for the AVR (not that anyone
would or should use it), that hasn't been committed yet:

--- gcc/config/avr/avr.h.old	2007-08-23 15:18:31.015625000 -0600
+++ gcc/config/avr/avr.h	2007-08-23 15:19:17.687500000 -0600
@@ -53,7 +53,7 @@ extern int avr_mega_p;
 extern int avr_have_mul_p;
 extern int avr_asm_only_p;
 extern int avr_have_movw_lpmx_p;
-#ifndef IN_LIBGCC2
+#if !defined(IN_LIBGCC2) && !defined(IN_TARGET_LIBS)
 extern GTY(()) section *progmem_section;
 #endif

Well, your patch causes a new error in configuring libobjc (when using the
patch above):

checking for thread model used by GCC... single
checking for exception model to use... configure: error: unable to detect
exception model
make[1]: *** [configure-target-libobjc] Error 1
make[1]: Leaving directory `/c/avrdev/gcc/build'
make: *** [all] Error 2

Normally, configure detects the exception model as sjlj.

The portion of your patch that changes MOVE_MAX in avr.h is fine. The
problem seems to be something with your changes in avr.md.

Why this is happening, I don't really know. I'm certainly more concerned
with optimizing the C compiler than I am in having ObjC build.

Thanks,
Eric Weddington


Comment 21 Rask Ingemann Lambertsen 2007-09-17 11:13:48 UTC
It's probably someting simple, see config.log. Like I said, the patch is a quick and dirty one and the AVR back end can use more work than that, most of which means deleting patterns. Examples: All and, ior, xor, one_cmpl and sign extend patterns larger than qimode, all zero_extend patterns, all movsi, movdi and such. Along with that all the output_xxx functions in avr.c that they use.
Comment 22 Eric Weddington 2007-09-17 22:53:45 UTC
Subject: RE:  [avr-gcc] Optimization decrease performance of
 struct assignment.



> ------- Comment #21 from rask at gcc dot gnu dot org
> 2007-09-17 11:13 -------
> It's probably someting simple, see config.log.

Here it is:

configure:10398: error: unrecognizable insn:
(insn 105 104 106 2 (set (subreg:QI (reg/f:HI 52) 0)
        (subreg:QI (label_ref:HI 57) 0)) -1 (nil))
configure:10398: internal compiler error: in extract_insn, at recog.c:1990


Comment 23 Rask Ingemann Lambertsen 2007-09-18 07:14:02 UTC
> configure:10398: error: unrecognizable insn:
> (insn 105 104 106 2 (set (subreg:QI (reg/f:HI 52) 0)
>         (subreg:QI (label_ref:HI 57) 0)) -1 (nil))
> configure:10398: internal compiler error: in extract_insn, at recog.c:1990

In define_insn_and_split "*movhi", add the line

   && LABEL_REF != GET_CODE (operands[1])

where it already says

   && SYMBOL_REF != GET_CODE (operands[1])

and the error should go away.
Comment 24 Eric Weddington 2007-09-18 19:06:04 UTC
Subject: RE:  [avr-gcc] Optimization decrease performance of
 struct assignment.



> ------- Comment #23 from rask at gcc dot gnu dot org
>
> In define_insn_and_split "*movhi", add the line
>
>    && LABEL_REF != GET_CODE (operands[1])
>
> where it already says
>
>    && SYMBOL_REF != GET_CODE (operands[1])
>
> and the error should go away.

I *added* the line (not replace) as you suggested, and now it generates a
new error during build:

c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:66: error: unrecognizable
insn:
(insn 54 4 55 2 c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:65 (set
(reg:QI 22 r22 [ D.2345 ])
        (subreg:QI (const:HI (plus:HI (symbol_ref:HI
("_OBJC_SELECTOR_TABLE") [flags 0x2] <var_decl 013C4A20
_OBJC_SELECTOR_TABLE>)
                    (const_int 8 [0x8]))) 0)) -1 (nil))


Comment 25 Rask Ingemann Lambertsen 2007-09-18 19:49:14 UTC
> c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:66: error: unrecognizable
> insn:
> (insn 54 4 55 2 c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:65 (set
> (reg:QI 22 r22 [ D.2345 ])
>         (subreg:QI (const:HI (plus:HI (symbol_ref:HI
> ("_OBJC_SELECTOR_TABLE") [flags 0x2] <var_decl 013C4A20
> _OBJC_SELECTOR_TABLE>)
>                     (const_int 8 [0x8]))) 0)) -1 (nil))

That's a similar story: Add a line with
&& CONST != GET_CODE (operands[1])
Comment 26 Eric Weddington 2007-09-19 05:28:36 UTC
Subject: RE:  [avr-gcc] Optimization decrease performance of
 struct assignment.



> > c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:66: error:
> unrecognizable
> > insn:
> > (insn 54 4 55 2
> c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:65 (set
> > (reg:QI 22 r22 [ D.2345 ])
> >         (subreg:QI (const:HI (plus:HI (symbol_ref:HI
> > ("_OBJC_SELECTOR_TABLE") [flags 0x2] <var_decl 013C4A20
> > _OBJC_SELECTOR_TABLE>)
> >                     (const_int 8 [0x8]))) 0)) -1 (nil))
>
> That's a similar story: Add a line with
> && CONST != GET_CODE (operands[1])

That finally did the trick! At least it builds now...


Comment 27 Eric Weddington 2007-09-19 17:57:55 UTC
Created attachment 14224 [details]
Rask's patch modified from comments.

Here is Rask's patch again, but slightly modified from the comments.
Comment 28 Andy Hutchinson 2008-02-02 15:44:46 UTC
The patch and suggestions on this are valid. However, memory moves  - particular with base pointers, may require additional instruction to be added to reach required displacments. Splitting such moves may well incur the overhead per BYTE!

So we need to get the pointers sorted so that (for example) 4 separate QI bytes is just as good as 1 access to 4 SI bytes.

As for other pattern removal YES!
 
The reason to change MOVE_MAX is not made clear. I understood this to control the threshold for using movmem rather than inline RTL. movmem wins (in size) at about 8+ bytes. Does it have another use related to this problem?
Comment 29 Eric Weddington 2008-02-19 02:47:03 UTC
Rask's patch (gcc-4.3-bug-11180-experimental.patch) causes worse code for the test case in bug #32871, than without the patch.
Comment 30 Andrew Pinski 2023-06-21 03:54:04 UTC
Fixed for GCC 5 (maybe earlier):

        ldi r18,lo8(1)
        ldi r22,lo8(2)
        ldi r19,0
        ldi r20,0
        ldi r21,0
        ldi r23,0
        ldi r24,0
        ldi r25,0