See http://gcc.opensuse.org/SPEC/CINT/sb-balakirew-head-64-2006/recent.html First bad rev. is 154713, last good is 154686. It's quite obvious that either the regrename.c changes have code generation differences or that removing the vec_interleave_* expanders caused this regression. Richard - you removed all vec_interleave_* expanders but the vectorizer still checks for optab_for_tree_code (VEC_INTERLEAVE_*_EXPR) which ends up looking at vec_interleave_*_optab. Either this code is all dead now or you caused the vectorization no longer to apply. I'll check if reverting your revision gets back performance.
The vec_interleave_*_optab should still be populated. It's just that what was once "sse2_punpcklwd" is now "vec_interleave_lowv8hi" directly. If this patch *is* attributable to a regression, then perhaps there's a typo somewhere in the substitutions...
Subject: Re: [4.5 Regression] 464.h265ref peak regressed 20% On Sun, 29 Nov 2009, rth at gcc dot gnu dot org wrote: > ------- Comment #1 from rth at gcc dot gnu dot org 2009-11-29 17:58 ------- > The vec_interleave_*_optab should still be populated. It's just > that what was once "sse2_punpcklwd" is now "vec_interleave_lowv8hi" > directly. If this patch *is* attributable to a regression, then > perhaps there's a typo somewhere in the substitutions... Indeed it wasn't your patch. Checking with Bernds patch reverted now. Richard.
This may be related to revision 154688, which has caused PR 42202.
It was indeed Bernds patch. Bernd - the patch discription suggested it was a compile-time performance fix and not going to change code? There is quite big effect (including this negative one) on SPEC CPU 2006 on x86_64 ...
Code generation changes are expected for two reasons - the code became less conservative when determining conflicts with other registers, so we can usually rename in more cases. There are also a few cases where we have to give up on renaming due to multiword hard register overlaps we can't track. Unfortunately I don't have access to spec; I'll have to think about how to track this down.
Subject: Re: [4.5 Regression] rev 15458[78] regress 464.h264ref peak 20% On Tue, 1 Dec 2009, bernds_cb1 at t-online dot de wrote: > ------- Comment #5 from bernds_cb1 at t-online dot de 2009-12-01 15:26 ------- > Code generation changes are expected for two reasons - the code became less > conservative when determining conflicts with other registers, so we can usually > rename in more cases. There are also a few cases where we have to give up on > renaming due to multiword hard register overlaps we can't track. > > Unfortunately I don't have access to spec; I'll have to think about how to > track this down. I will try to come up with a testcase for you. I suppose it might be a backend issue as regrename should mainly effect the 2nd scheduling pass, right? Richard.
Just reverting rev. 154688 and using the training set gets us from 464.h264ref -- 228 -- S to 464.h264ref -- 170 -- S at -O3 -ffast-math -funroll-loops -march-native (-march=k8-sse3 -msahf --param l1-cache-size=64 --param l1-cache-line-size=64 --param l2-cache-size=512 -mtune=k8). After the patch the oprofile looks like CPU: AMD64 processors, speed 2000 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 2241755 51.1739 SetupFastFullPelSearch 442008 10.0900 BlockMotionSearch 271429 6.1961 SubPelBlockMotionSearch 269337 6.1483 FastPelY_14 170961 3.9026 UMVLine16Y_11 159311 3.6367 SetupLargerBlocks 155556 3.5510 SATD 127230 2.9044 FastLine16Y_11 72328 1.6511 dct_luma 69728 1.5917 getNonAffNeighbour All but the *Y_1[14] functions are inside mv-search.c, just re-compiling that file is enough to reproduce the issue. after reverting it CPU: AMD64 processors, speed 2000 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 1223269 36.8289 SetupFastFullPelSearch 450306 13.5573 BlockMotionSearch 320395 9.6461 SubPelBlockMotionSearch 245175 7.3815 FastPelY_14 160694 4.8380 SetupLargerBlocks 155353 4.6772 SATD 153292 4.6152 UMVLine16Y_11 79531 2.3944 FastLine16Y_11 72455 2.1814 dct_luma 70733 2.1296 getNonAffNeighbour 42828 1.2894 UMVPelY_14 34396 1.0356 UnifiedOneForthPix
The hot loop is mv-search.c:SetupFastFullPelSearch for (pos = 0; pos < max_pos; pos++) { abs_y = offset_y + spiral_search_y[pos]; abs_x = offset_x + spiral_search_x[pos]; if (range_partly_outside) { if (abs_y >= 0 && abs_y <= max_height && abs_x >= 0 && abs_x <= max_width ) { PelYline_11 = FastLine16Y_11; } else { PelYline_11 = UMVLine16Y_11; } } orgptr = orig_blocks; bindex = 0; for (blky = 0; blky < 4; blky++) { LineSadBlk0 = LineSadBlk1 = LineSadBlk2 = LineSadBlk3 = 0; for (y = 0; y < 4; y++) { refptr = PelYline_11 (ref_pic, abs_y++, abs_x, img_height, img_width); LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; } block_sad[bindex++][pos] = LineSadBlk0; block_sad[bindex++][pos] = LineSadBlk1; block_sad[bindex++][pos] = LineSadBlk2; block_sad[bindex++][pos] = LineSadBlk3; } } good assembly of the innermost loop: .L1422: leal 1(%rsi), %r9d movl 64(%rsp), %r8d movl 68(%rsp), %ecx movl 52(%rsp), %edx movq 72(%rsp), %rdi movl %r9d, 32(%rsp) call *%rax movzwl (%rbx), %ecx movzwl (%rax), %edx movzwl 2(%rbx), %r10d movzwl 2(%rax), %r11d movq byte_abs(%rip), %r9 movzwl 4(%rbx), %esi subl %ecx, %edx movzwl 4(%rax), %ecx movslq %edx, %r8 subl %r10d, %r11d movzwl 6(%rax), %r10d addl (%r9,%r8,4), %r14d movzwl 6(%rbx), %r8d movslq %r11d, %rdi addl (%r9,%rdi,4), %r14d movzwl 8(%rbx), %edi subl %esi, %ecx movslq %ecx, %rdx movzwl 8(%rax), %ecx subl %r8d, %r10d addl (%r9,%rdx,4), %r14d movzwl 10(%rax), %r8d movzwl 10(%rbx), %edx movslq %r10d, %r11 addl (%r9,%r11,4), %r14d movzwl 12(%rbx), %r11d subl %edi, %ecx movzwl 12(%rax), %edi movslq %ecx, %rsi subl %edx, %r8d addl (%r9,%rsi,4), %r13d movzwl 14(%rax), %edx movzwl 14(%rbx), %esi movslq %r8d, %r10 subl %r11d, %edi addl (%r9,%r10,4), %r13d movzwl 16(%rax), %r11d movzwl 16(%rbx), %r10d movslq %edi, %rcx addl (%r9,%rcx,4), %r13d movzwl 18(%rbx), %ecx subl %esi, %edx movslq %edx, %r8 movzwl 18(%rax), %edx subl %r10d, %r11d addl (%r9,%r8,4), %r13d movzwl 20(%rax), %r10d movzwl 20(%rbx), %r8d movslq %r11d, %rdi addl (%r9,%rdi,4), %ebp movzwl 22(%rbx), %edi subl %ecx, %edx movzwl 22(%rax), %ecx movslq %edx, %rsi subl %r8d, %r10d addl (%r9,%rsi,4), %ebp movzwl 24(%rax), %r8d movzwl 24(%rbx), %esi movslq %r10d, %r11 subl %edi, %ecx addl (%r9,%r11,4), %ebp movzwl 26(%rax), %edi movslq %ecx, %rdx movzwl 26(%rbx), %r11d addl (%r9,%rdx,4), %ebp movzwl 28(%rbx), %edx subl %esi, %r8d movslq %r8d, %r10 movzwl 28(%rax), %r8d movzwl 30(%rax), %eax addl (%r9,%r10,4), %r12d movzwl 30(%rbx), %r10d subl %r11d, %edi movslq %edi, %rcx addl (%r9,%rcx,4), %r12d subl %edx, %r8d subl %r10d, %eax movslq %r8d, %rsi addl (%r9,%rsi,4), %r12d cltq addl (%r9,%rax,4), %r12d addq $32, %rbx addq $32, %r15 cmpq $128, %r15 je .L1600 movq PelYline_11(%rip), %rax movl 32(%rsp), %esi jmp .L1422 bad assembly: .L1422: leal 1(%rsi), %edx movl 68(%rsp), %ecx movl 64(%rsp), %r8d movq 72(%rsp), %rdi movl %edx, 32(%rsp) movl 52(%rsp), %edx call *%rax movzwl (%rbx), %esi movzwl (%rax), %ecx movq byte_abs(%rip), %rdx subl %esi, %ecx movzwl 2(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %r14d movzwl 2(%rax), %ecx subl %esi, %ecx movzwl 4(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %r14d movzwl 4(%rax), %ecx subl %esi, %ecx movzwl 6(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %r14d movzwl 6(%rax), %ecx subl %esi, %ecx movzwl 8(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %r14d movzwl 8(%rax), %ecx subl %esi, %ecx movzwl 10(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %r13d movzwl 10(%rax), %ecx subl %esi, %ecx movzwl 12(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %r13d movzwl 12(%rax), %ecx subl %esi, %ecx movzwl 14(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %r13d movzwl 14(%rax), %ecx subl %esi, %ecx movzwl 16(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %r13d movzwl 16(%rax), %ecx subl %esi, %ecx movzwl 18(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %ebp movzwl 18(%rax), %ecx subl %esi, %ecx movzwl 20(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %ebp movzwl 20(%rax), %ecx subl %esi, %ecx movzwl 22(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %ebp movzwl 22(%rax), %ecx subl %esi, %ecx movzwl 24(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %ebp movzwl 24(%rax), %ecx subl %esi, %ecx movzwl 26(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %r12d movzwl 26(%rax), %ecx subl %esi, %ecx movzwl 28(%rbx), %esi movslq %ecx, %rcx addl (%rdx,%rcx,4), %r12d movzwl 28(%rax), %ecx movzwl 30(%rax), %eax subl %esi, %ecx movslq %ecx, %rcx addl (%rdx,%rcx,4), %r12d movzwl 30(%rbx), %ecx subl %ecx, %eax cltq addl (%rdx,%rax,4), %r12d addq $32, %rbx addq $32, %r15 cmpq $128, %r15 je .L1600 movq PelYline_11(%rip), %rax movl 32(%rsp), %esi jmp .L1422 seems to be really only scheduling differences...
Created attachment 19202 [details] Test patch to attempt to narrow down the problem
Yes, regrename should only affect the second scheduling pass. I'm attaching a stab-in-the-dark patch. It contains a fix for the ia64 regressions (testing in progress), it adds a few warnings which could be useful to spot whether the testcase is producing the corner cases we refuse to handle, and it makes some changes to the register selection. The idea is to prefer registers which aren't used in the basic block yet, which should give the best freedom to the scheduler. I'd be glad if you could play with this and let me know the results.
The patch didn't change performance. For the relevant functions I see mv-search.c:452:1: warning: failing block due to bad pattern (the location seems to be always the end of the function though). The insns are most of the time (insn 333 1737 334 39 mv-search.c:542 (set (reg:DI 0 ax [orig:1365 block_c.pos_x ] [1365]) (sign_extend:DI (reg:SI 0 ax))) 126 {extendsidi2_rex64} (nil)) or similar. For the hot function with the regression I see: (insn 287 2237 289 9 mv-search.c:327 (set (reg:DI 0 ax [803]) (sign_extend:DI (reg:SI 0 ax [802]))) 126 {extendsidi2_rex64} (nil)) (insn 437 436 438 27 mv-search.c:305 (parallel [ (set (reg:DI 5 di [853]) (zero_extend:DI (plus:SI (reg:SI 5 di [orig:737 ivtmp.1996 ] [737]) (const_int -1 [0xffffffffffffffff])))) (clobber (reg:CC 17 flags)) ]) 253 {*addsi_1_zext} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) (insn 663 662 664 55 mv-search.c:436 (set (reg:DI 0 ax [940]) (sign_extend:DI (reg:SI 0 ax [939]))) 126 {extendsidi2_rex64} (nil)) (insn 719 2224 721 62 mv-search.c:327 (set (reg:DI 0 ax [948]) (sign_extend:DI (reg:SI 0 ax [947]))) 126 {extendsidi2_rex64} (nil)) line 436 is inside the hot loop in question, I'll attach a dump of the BB in question we skip due to it. After the above insn only the following extra insns appear: (insn 664 663 665 55 mv-search.c:436 (parallel [ (set (reg/v:SI 41 r12 [orig:394 LineSadBlk3 ] [394]) (plus:SI (reg/v:SI 41 r12 [orig:394 LineSadBlk3 ] [394]) (mem:SI (plus:DI (mult:DI (reg:DI 0 ax [940]) (const_int 4 [0x4])) (reg/f:DI 1 dx [orig:479 byte_abs.393 ] [479])) [2 S4 A32]))) (clobber (reg:CC 17 flags)) ]) 251 {*addsi_1} (expr_list:REG_DEAD (reg/f:DI 1 dx [orig:479 byte_abs.393 ] [479]) (expr_list:REG_DEAD (reg:DI 0 ax [940]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))))) (note 665 664 666 55 NOTE_INSN_DELETED) (insn 666 665 667 55 mv-search.c:305 (parallel [ (set (reg/v/f:DI 3 bx [orig:404 orgptr ] [404]) (plus:DI (reg/v/f:DI 3 bx [orig:404 orgptr ] [404]) (const_int 32 [0x20]))) (clobber (reg:CC 17 flags)) ]) 252 {*adddi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) (insn 667 666 669 55 mv-search.c:305 (parallel [ (set (reg:DI 44 r15 [orig:398 ivtmp.1894 ] [398]) (plus:DI (reg:DI 44 r15 [orig:398 ivtmp.1894 ] [398]) (const_int 32 [0x20]))) (clobber (reg:CC 17 flags)) ]) 252 {*adddi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) (insn 669 667 670 55 mv-search.c:417 (set (reg:CCZ 17 flags) (compare:CCZ (reg:DI 44 r15 [orig:398 ivtmp.1894 ] [398]) (const_int 128 [0x80]))) 7 {*cmpdi_1} (nil)) (jump_insn 670 669 671 55 mv-search.c:417 (set (pc) (if_then_else (ne (reg:CCZ 17 flags) (const_int 0 [0x0])) (label_ref 668) (pc))) 619 {*jcc_1} (expr_list:REG_DEAD (reg:CCZ 17 flags) (expr_list:REG_BR_PROB (const_int 7500 [0x1d4c]) (nil))) -> 668) ;; lr out 3 [bx] 6 [bp] 7 [sp] 41 [r12] 42 [r13] 43 [r14] 44 [r15] ;; live out 3 [bx] 6 [bp] 7 [sp] 41 [r12] 42 [r13] 43 [r14] 44 [r15] They all look like they'd be handled just fine.
Created attachment 19208 [details] basic-block that is rejected
Created attachment 19211 [details] Patch to make it less conservative when accepting matching constraints with different modes
Would you mind trying another patch, both for testing performance, and if that fails, for interesting warnings?
With the second patch there are no warnings when building 464.h264ref (I have reverted the first patch before installing the 2nd, it didn't apply cleanly otherwise). Performance is back to normal with the patch applied.
Thanks for testing. This means the regression is fixed by the patch? I'll do a full test run then.
Subject: Re: [4.5 Regression] changes in scheduling regress 464.h264ref 20% On Wed, 2 Dec 2009, bernds_cb1 at t-online dot de wrote: > ------- Comment #16 from bernds_cb1 at t-online dot de 2009-12-02 17:06 ------- > Thanks for testing. This means the regression is fixed by the patch? I'll do > a full test run then. Yes, the regression is fixed by the second patch. Richard.
Unfortunately it causes failures. Tracking these mismatches really is quite tricky. I'll try to fix it.
Subject: Re: [4.5 Regression] changes in scheduling regress 464.h264ref 20% On Fri, 4 Dec 2009, bernds_cb1 at t-online dot de wrote: > ------- Comment #18 from bernds_cb1 at t-online dot de 2009-12-04 14:26 ------- > Unfortunately it causes failures. Tracking these mismatches really is quite > tricky. > > I'll try to fix it. Thanks. Richard.
Subject: Bug 42216 Author: bernds Date: Thu Dec 10 18:03:05 2009 New Revision: 155134 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=155134 Log: PR rtl-opt/42216 * regrename.c: Error out if MAX_RECOG_OPERANDS is larger than HOST_BITS_PER_WIDE_INT. (verify_reg_in_set): New function, broken out of verify_reg_tracked. (verify_reg_tracked): Use it. (scan_rtx_reg): When seeing a use involving a superset of the registers in an existing chain, enlarge that chain. Otherwise, allow subsets and set fail_current_block for all other kinds of overlap. (hide_operands): New argument UNTRACKED_OPERANDS; callers changed. Do not modify operands when the bit with the corresponding number is set in that bitmap. (build_def_use): When we see matching operands with different modes, don't set fail_current_block, but keep track of such operands in a bitmap if their registers aren't already tracked in an open chain. Pass this bitmap to all hide_operands calls. Modified: trunk/gcc/ChangeLog trunk/gcc/regrename.c
All is well again.
All is bad again after the fix for PR42220: 2010-03-07 Bernd Schmidt <bernd.schmidt@analog.com> PR rtl-optimization/42220 * regrename.c (scan_rtx) <case STRICT_LOW_PART, ZERO_EXTRACT>: Use verify_reg_tracked to determine if we should use OP_OUT rather than OP_INOUT. (build_def_use): If we see an in-out operand for a register that we know nothing about, treat is an output if possible, fail the block if not.
Created attachment 20048 [details] Alternative fix for 42220 If you wouldn't mind, please test the attached patch which should be an alternative fix for 42220 and avoids the need to set fail_block for some cases.
> If you wouldn't mind, please test the attached patch which should be an > alternative fix for 42220 and avoids the need to set fail_block for some cases. I cannot test if the patch in comment #23 fixes the regression for SPEC, but a partial test indicates that it does not reintroduce pr42220 (patch applied on top of r157293 with the patch from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43287#c10 and a simple update). More testing during the coming nights).
Subject: Re: [4.5 Regression] changes in scheduling regress 464.h264ref 20% On Mon, 8 Mar 2010, bernds_cb1 at t-online dot de wrote: > ------- Comment #23 from bernds_cb1 at t-online dot de 2010-03-08 23:04 ------- > Created an attachment (id=20048) > --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20048&action=view) > Alternative fix for 42220 > > If you wouldn't mind, please test the attached patch which should be an > alternative fix for 42220 and avoids the need to set fail_block for some cases. The patch fixes the regression for me. Thanks, Richard.
I have posted a full regression testing with the patch in comment #23 at http://gcc.gnu.org/ml/gcc-testresults/2010-03/msg01182.html The failures are the usual ones and pr42220 is still fixed.
Subject: Bug 42216 Author: bernds Date: Wed Mar 17 09:25:35 2010 New Revision: 157511 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=157511 Log: PR rtl-optimization/42216 * regrename.c (create_new_chain): New function, broken out from... (scan_rtx_reg): ... here. Call it. Handle the case where we are appending a use to an empty chain. (build_def_use): Remove previous changes that convert OP_INOUT to OP_OUT operands; instead detect the case where an OP_INOUT operand uses a previously untracked register and create an empty chain for it. Modified: trunk/gcc/ChangeLog trunk/gcc/regrename.c
All fine again. Fixed.