Any optimization level increase size and decrease speed for next program: typedef struct { long lo; int in; } lo_in; lo_in foo (void) { lo_in x; x.lo = 1; x.in = 2; return x; } "avr-gcc -W -Wall -S -O0" produced: /* File "lo_in.c": code 75 = 0x004b ( 40), prologues 18, epilogues 17 */ "avr-gcc -W -Wall -S -Os" produced: /* File "lo_in.c": code 80 = 0x0050 ( 29), prologues 26, epilogues 25 */ "avr-gcc -W -Wall -S -O3" produced: /* File "lo_in.c": code 79 = 0x004f ( 28), prologues 26, epilogues 25 */ Compiler: avr-gcc (GCC) 3.3.1 20030519 (prerelease)
I can confirm this on the mainline (20030706).
On i686-pc-linux-gnu I get identical asm for -O2 and -O3 and also for -O1 and -Os. The only difference between -O[23] and -O[1s] is: < .p2align 2,,3 .file "11180.c" .text .p2align 2,,3 .globl foo .type foo, @function foo: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax movl $1, (%eax) movl $2, 4(%eax) leave ret $4 .size foo, .-foo .section .note.GNU-stack,"",@progbits .ident "GCC: (GNU) 3.4 20030710 (experimental)" So -Os is optimal AFAICT (obviously it gets even smaller with -fomit-frame-pointer). Apparently this is something target specific. What target is avr-gcc? Gr. Steven
I was not the reporter but I was the one who confirmed it; avr-elf is the target, I will post the asm (or rtl) if you want.
This is a target problem of not showing how moves are done.
One issue I find is that avr does not define_insn_and_split (or just define_split) which will greatly improve the code generation because the hi part of the QI would be zero and you don't need to set it four times. ldi r25,hi8(2) ldi r25,hi8(1) ldi r26,hlo8(1) ldi r27,hhi8(1) The main difference between -Os and -O0 (at least before 4.0.0) is that we use more registers at -Os which causes the prologue and eplogue to do more so my recommendation will help a lot.
The problem here is that gcc is using a DImode register to handle 6 byte (int+long) structure. Why I have no idea! Since the target has no insn for DI move, gcc turns this into individual QImode byte moves (subregs all over the place!). The 'stacked' 6 byte structure is 'popped' into DI register (6 bytes ). Two other byte registers are explicitely cleared (making our 8 byte DI register) What then follows is a large amount of shuffling. i.e. Moving from intermediate virtual DI register (8 bytes) into correct place for a 6 byte return. Which seems to surpass the abilities of the register allocator (DI and return registers overlap). Smaller structures (<=4 bytes) are optimally handled. Larger structure >8 are also much better since they are returned in memory. So in summary, it would appear that the root cause is allocation of a DI mode register for structures >4 and <=8 bytes. A secondary factor is the use of QImode moves (when SI,HImode are available and more efficient) The problem can be partially alleviated by defining DImode moves (that a hell of a change though). Poor code still remains - for example clearing unused padding bytes and extra register usage. PS -fpack-struct does not change this bug.
(In reply to comment #6) > The problem here is that gcc is using a DImode register to handle 6 byte > (int+long) structure. Why I have no idea! This is so it does not store it on the stack. As I said in comment #5, this is a target issue and have nothing to do with DImode.
(In reply to comment #7) > (In reply to comment #6) > > The problem here is that gcc is using a DImode register to handle 6 byte > > (int+long) structure. Why I have no idea! > This is so it does not store it on the stack. As I said in comment #5, this is a target issue and have > nothing to do with DImode. It would seem more desireable given the intended purpose to avoid pushing it on the stack so that it's elements may be more effeciencly accessed, that it's coresponding elements be allocated within the register file (as opposed to the whole struct remaining packed into an alllocated DI mode integer), so that it's elements may be effeciently accessed without needing to rip them out or reassemble them into the otherwise packed monolithic structure?
Version 4.2.1 offers somewhat better results: With -O0: .file "test.c" /* File "test.c": code 109 = 0x006d ( 74), prologues 18, epilogues 17 */ With -O[123s]: .file "test.c" /* File "test.c": code 79 = 0x004f ( 28), prologues 26, epilogues 25 */ At least the code no longer increases in size when optimization is turned on. Dmitry, are the results for 4.2.1 acceptable for this bug report?
Yes, results are: avr-gcc-3.3.6: O0 --> 75, O1,O2,O3,Os --> 79 avr-gcc-4.2.1: O0 --> 109, O1,O2,O3,Os --> 79 The mistake is corrected? It is possible to tell and so as now application of keys of optimization shortens a code. However, "correction" does not improve quality of optimization, it only worsens not optimized code.
Subject: RE: [avr-gcc] Optimization decrease performance of struct assignment. > ------- Comment #10 from dmixm at marine dot febras dot ru > 2007-07-27 01:24 ------- > Yes, results are: > > avr-gcc-3.3.6: O0 --> 75, O1,O2,O3,Os --> 79 > avr-gcc-4.2.1: O0 --> 109, O1,O2,O3,Os --> 79 > > The mistake is corrected? It is possible to tell and so as now > application of keys of optimization shortens a code. However, > "correction" does not improve quality of optimization, it only worsens > not optimized code. Agreed. But is this sufficient to close this bug report? Eric
Andy Hutchinson wrote (comment #6) that addition a 'movdi' instruction improves the result. I have try to add a very simple 'movdi' (which split into 2 SImode instuctions). In result: -O0 --> 85 words, -O1,2,3,s --> 50 words. Version is 4.2.1
(In reply to comment #12) > Andy Hutchinson wrote (comment #6) that addition a 'movdi' instruction improves > the result. I have try to add a very simple 'movdi' (which split into 2 SImode > instuctions). In result: > -O0 --> 85 words, > -O1,2,3,s --> 50 words. > Version is 4.2.1 > That's great! Dmitry, could you attach your patch?
It was: (define_insn "movdi" [(set (match_operand:DI 0 "nonimmediate_operand" "") (match_operand:DI 1 "general_operand" ""))] "" "#") (define_split [(set (match_operand:DI 0 "nonimmediate_operand" "") (match_operand:DI 1 "general_operand" ""))] "reload_completed" [(set (match_dup 2) (match_dup 4)) (set (match_dup 3) (match_dup 5))] "{ operands[2] = simplify_gen_subreg (SImode, operands[0], DImode, 0); operands[3] = simplify_gen_subreg (SImode, operands[0], DImode, 4); operands[4] = simplify_gen_subreg (SImode, operands[1], DImode, 0); operands[5] = simplify_gen_subreg (SImode, operands[1], DImode, 4); }") Alas, this elementary addition has appeared erroneous. The following program leads to emergency end of compilation: int main () { volatile long long x = 0x0102030405060708LL; if (x != 0x0102030405060708LL) exit (__LINE__); exit (0); }
1. You should use define_insn_and_split. 2. If possible (which I think it is here), splitting before reload should produc.e better code. Btw, what is the ICE? Also, it seems to me that avr.h defines MOVE_MAX incorrectly. Shouldn't it be 2 instead of 4?
Created attachment 14211 [details] quick and dirty patch to reduce code size A fundamental problem with the AVR back end is that it sabotages the RTL optimizer's attempts to optimize away redundant move insns. With this patch, I get better code (-O2): foo: push r29 ; 73 *pushhi/1 [length = 2] push r28 rcall . ; 79 *addhi3_sp_R_pc2 [length = 3] rcall . rcall . in r28,__SP_L__ ; 90 *movqi/6 [length = 1] in r29,__SP_H__ ; 91 *movqi/6 [length = 1] /* prologue: function */ /* frame size = 6 */ ldi r24,lo8(2) ; 68 *movqi/2 [length = 1] std Y+5,r24 ; 70 *movqi/3 [length = 1] std Y+6,__zero_reg__ ; 71 *movqi/3 [length = 1] ldi r22,lo8(2) ; 21 *movqi/2 [length = 1] ldi r23,lo8(0) ; 23 *movqi/2 [length = 1] ldi r18,lo8(1) ; 42 *movqi/2 [length = 1] ldi r19,lo8(0) ; 43 *movqi/2 [length = 1] ldi r20,lo8(0) ; 44 *movqi/2 [length = 1] ldi r21,lo8(0) ; 45 *movqi/2 [length = 1] ldi r24,lo8(0) ; 48 *movqi/2 [length = 1] ldi r25,lo8(0) ; 49 *movqi/2 [length = 1] /* epilogue start */ adiw r28,6 ; 85 *addhi3/2 [length = 1] out __SP_L__,r28 ; 92 *movqi/5 [length = 1] out __SP_H__,r29 ; 93 *movqi/5 [length = 1] pop r28 ; 87 pophi [length = 2] pop r29 ret ; 88 return_from_epilogue [length = 1] .size foo, .-foo /* File "/home/rask/pr11180.c": code 0 = 0x0000 ( 0), prologues 0, epilogues 0 */ Perhaps a peephole optimization could replace insns 48 and 49 with "movw r25:r24, r21:r20". The only big further improvement I see would be to get rid of the stack var.
The patch is against mainline revision 128431.
Subject: RE: [avr-gcc] Optimization decrease performance of struct assignment. > ------- Comment #15 from rask at gcc dot gnu dot org > 2007-09-16 12:57 ------- > Also, it seems to me that avr.h defines MOVE_MAX incorrectly. > Shouldn't it be 2 > instead of 4? Hmm. Yes and no. There are variants that have the MOVW instruction, and for those it should be 2. Otherwise, if they don't have the MOVW instruction, it should be 1, I think.
Created attachment 14213 [details] quick and dirty patch to reduce code size Here's a patch which doesn't mess up the stack pointer update in the epilogue. The code size output is broken on mainline, so here's what I use: $ gcc/xgcc -Bgcc/ -W -Wall ~/pr11180.c -S -dp -o /dev/stdout -O2 | awk 'match($0, /\[length = ([[:digit:]]+)\]/, field) { sum += field[1]; } END { print sum; }' 27 Compare that to unpatched mainline: $ gcc/xgcc -Bgcc/ -W -Wall ~/pr11180.c -S -dp -o /dev/stdout -O2 | awk 'match($0, /\[length = ([[:digit:]]+)\]/, field) { sum += field[1]; } END { print sum; }' 36
Subject: RE: [avr-gcc] Optimization decrease performance of struct assignment. > Here's a patch which doesn't mess up the stack pointer update > in the epilogue. > Hi Rask, Your patch causes a regression. Sort of. I have a small patch that enables Objective-C for the AVR (not that anyone would or should use it), that hasn't been committed yet: --- gcc/config/avr/avr.h.old 2007-08-23 15:18:31.015625000 -0600 +++ gcc/config/avr/avr.h 2007-08-23 15:19:17.687500000 -0600 @@ -53,7 +53,7 @@ extern int avr_mega_p; extern int avr_have_mul_p; extern int avr_asm_only_p; extern int avr_have_movw_lpmx_p; -#ifndef IN_LIBGCC2 +#if !defined(IN_LIBGCC2) && !defined(IN_TARGET_LIBS) extern GTY(()) section *progmem_section; #endif Well, your patch causes a new error in configuring libobjc (when using the patch above): checking for thread model used by GCC... single checking for exception model to use... configure: error: unable to detect exception model make[1]: *** [configure-target-libobjc] Error 1 make[1]: Leaving directory `/c/avrdev/gcc/build' make: *** [all] Error 2 Normally, configure detects the exception model as sjlj. The portion of your patch that changes MOVE_MAX in avr.h is fine. The problem seems to be something with your changes in avr.md. Why this is happening, I don't really know. I'm certainly more concerned with optimizing the C compiler than I am in having ObjC build. Thanks, Eric Weddington
It's probably someting simple, see config.log. Like I said, the patch is a quick and dirty one and the AVR back end can use more work than that, most of which means deleting patterns. Examples: All and, ior, xor, one_cmpl and sign extend patterns larger than qimode, all zero_extend patterns, all movsi, movdi and such. Along with that all the output_xxx functions in avr.c that they use.
Subject: RE: [avr-gcc] Optimization decrease performance of struct assignment. > ------- Comment #21 from rask at gcc dot gnu dot org > 2007-09-17 11:13 ------- > It's probably someting simple, see config.log. Here it is: configure:10398: error: unrecognizable insn: (insn 105 104 106 2 (set (subreg:QI (reg/f:HI 52) 0) (subreg:QI (label_ref:HI 57) 0)) -1 (nil)) configure:10398: internal compiler error: in extract_insn, at recog.c:1990
> configure:10398: error: unrecognizable insn: > (insn 105 104 106 2 (set (subreg:QI (reg/f:HI 52) 0) > (subreg:QI (label_ref:HI 57) 0)) -1 (nil)) > configure:10398: internal compiler error: in extract_insn, at recog.c:1990 In define_insn_and_split "*movhi", add the line && LABEL_REF != GET_CODE (operands[1]) where it already says && SYMBOL_REF != GET_CODE (operands[1]) and the error should go away.
Subject: RE: [avr-gcc] Optimization decrease performance of struct assignment. > ------- Comment #23 from rask at gcc dot gnu dot org > > In define_insn_and_split "*movhi", add the line > > && LABEL_REF != GET_CODE (operands[1]) > > where it already says > > && SYMBOL_REF != GET_CODE (operands[1]) > > and the error should go away. I *added* the line (not replace) as you suggested, and now it generates a new error during build: c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:66: error: unrecognizable insn: (insn 54 4 55 2 c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:65 (set (reg:QI 22 r22 [ D.2345 ]) (subreg:QI (const:HI (plus:HI (symbol_ref:HI ("_OBJC_SELECTOR_TABLE") [flags 0x2] <var_decl 013C4A20 _OBJC_SELECTOR_TABLE>) (const_int 8 [0x8]))) 0)) -1 (nil))
> c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:66: error: unrecognizable > insn: > (insn 54 4 55 2 c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:65 (set > (reg:QI 22 r22 [ D.2345 ]) > (subreg:QI (const:HI (plus:HI (symbol_ref:HI > ("_OBJC_SELECTOR_TABLE") [flags 0x2] <var_decl 013C4A20 > _OBJC_SELECTOR_TABLE>) > (const_int 8 [0x8]))) 0)) -1 (nil)) That's a similar story: Add a line with && CONST != GET_CODE (operands[1])
Subject: RE: [avr-gcc] Optimization decrease performance of struct assignment. > > c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:66: error: > unrecognizable > > insn: > > (insn 54 4 55 2 > c:/avrdev/gcc/gcc-4.3-20070914/libobjc/Object.m:65 (set > > (reg:QI 22 r22 [ D.2345 ]) > > (subreg:QI (const:HI (plus:HI (symbol_ref:HI > > ("_OBJC_SELECTOR_TABLE") [flags 0x2] <var_decl 013C4A20 > > _OBJC_SELECTOR_TABLE>) > > (const_int 8 [0x8]))) 0)) -1 (nil)) > > That's a similar story: Add a line with > && CONST != GET_CODE (operands[1]) That finally did the trick! At least it builds now...
Created attachment 14224 [details] Rask's patch modified from comments. Here is Rask's patch again, but slightly modified from the comments.
The patch and suggestions on this are valid. However, memory moves - particular with base pointers, may require additional instruction to be added to reach required displacments. Splitting such moves may well incur the overhead per BYTE! So we need to get the pointers sorted so that (for example) 4 separate QI bytes is just as good as 1 access to 4 SI bytes. As for other pattern removal YES! The reason to change MOVE_MAX is not made clear. I understood this to control the threshold for using movmem rather than inline RTL. movmem wins (in size) at about 8+ bytes. Does it have another use related to this problem?
Rask's patch (gcc-4.3-bug-11180-experimental.patch) causes worse code for the test case in bug #32871, than without the patch.
Fixed for GCC 5 (maybe earlier): ldi r18,lo8(1) ldi r22,lo8(2) ldi r19,0 ldi r20,0 ldi r21,0 ldi r23,0 ldi r24,0 ldi r25,0