Bug 71361 - [7 Regression] Changes in ivopts caused perf regression on x86
Summary: [7 Regression] Changes in ivopts caused perf regression on x86
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.0
: P2 normal
Target Milestone: 8.0
Assignee: Martin Liška
URL:
Keywords: missed-optimization, needs-bisection
Depends on:
Blocks:
 
Reported: 2016-05-31 19:19 UTC by iverbin
Modified: 2018-02-19 12:40 UTC (History)
5 users (show)

See Also:
Host:
Target: i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-01-08 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description iverbin 2016-05-31 19:19:14 UTC
r235805 leads to performance regression on x86.

Reduced testcase:

int arr_1[512];
int arr_2[512];

int main ()
{
    int c1[512];
    int c2[512];
    int res[512];

    for (int i = 0; i < 512; i++)
      arr_1[i] = arr_2[i] = c1[i] = c2[i] = i;

    for (int l = 0; l < 1000000; l++)
      for (int k = 1; k <= 9; k++)
        {
          int n1 = 1 << k;
          int n2 = n1 >> 1;

          for (int j = 0; j < n2; j++)
            for (int i = j; i < 512; i += n1)
              {
                int idx = i + n2;
                int x1 = arr_1[idx] * c1[j] + arr_2[idx] * c2[j];
                int x2 = arr_2[idx] * c1[j] + arr_1[idx] * c2[j];

                arr_1[i] = x1;
                arr_2[i] = x2;
                arr_1[idx] = x1;
                arr_2[idx] = x2;
              }
        }

    return 0;
}

Compilation options: -Ofast -m32 -fPIE
GCC is configured --with-arch=corei7 --with-cpu=corei7 --with-fpmath=sse
Run time on Sandy Bridge increased by ~20%
Run time on Atom increased by ~60%

Below are the dumps of the innermost loop after ivopts pass.

Before regression there are 2 induction variables, which are used as bases for all 6 memory accesses:

  # i_66 = PHI <i_37(7), j_65(11)>
  # ivtmp.19_63 = PHI <ivtmp.19_95(7), ivtmp.19_76(11)>
  # ivtmp.20_17 = PHI <ivtmp.20_15(7), ivtmp.20_73(11)>
  _59 = (void *) ivtmp.19_63;
  _58 = (sizetype) n2_20;
  _22 = MEM[base: _59, index: _58, step: 4, offset: 0B];
  _24 = _22 * pretmp_105;
  _55 = (void *) ivtmp.20_17;
  _54 = (sizetype) n2_20;
  _25 = MEM[base: _55, index: _54, step: 4, offset: 0B];
  _27 = _25 * pretmp_107;
  x1_28 = _24 + _27;
  _30 = _25 * pretmp_105;
  _31 = _22 * pretmp_107;
  x2_32 = _30 + _31;
  _51 = (void *) ivtmp.19_63;
  MEM[base: _51, offset: 0B] = x1_28;
  _50 = (void *) ivtmp.20_17;
  MEM[base: _50, offset: 0B] = x2_32;
  _57 = (void *) ivtmp.19_63;
  _56 = (sizetype) n2_20;
  MEM[base: _57, index: _56, step: 4, offset: 0B] = x1_28;
  _53 = (void *) ivtmp.20_17;
  _52 = (sizetype) n2_20;
  MEM[base: _53, index: _52, step: 4, offset: 0B] = x2_32;
  i_37 = n1_19 + i_66;
  ivtmp.19_95 = ivtmp.19_63 + _77;
  ivtmp.20_15 = ivtmp.20_17 + _12;
  if (i_37 <= 511)
    goto <bb 7>;
  else
    goto <bb 9>;

After regression there is only one induction variable, which is used as index for 4 memory accesses.

  # i_66 = PHI <i_37(7), j_65(11)>
  # ivtmp.22_63 = PHI <ivtmp.22_95(7), ivtmp.22_76(11)>
  _22 = MEM[symbol: arr_1, index: ivtmp.22_63, offset: 0B];
  _24 = _22 * pretmp_105;
  _25 = MEM[symbol: arr_2, index: ivtmp.22_63, offset: 0B];
  _27 = _25 * pretmp_107;
  x1_28 = _24 + _27;
  _30 = _25 * pretmp_105;
  _31 = _22 * pretmp_107;
  x2_32 = _30 + _31;
  _17 = (sizetype) i_66;
  _15 = _17 * 4;
  MEM[symbol: arr_1, index: _15, offset: 0B] = x1_28;
  _14 = (sizetype) i_66;
  _12 = _14 * 4;
  MEM[symbol: arr_2, index: _12, offset: 0B] = x2_32;
  MEM[symbol: arr_1, index: ivtmp.22_63, offset: 0B] = x1_28;
  MEM[symbol: arr_2, index: ivtmp.22_63, offset: 0B] = x2_32;
  i_37 = n1_19 + i_66;
  ivtmp.22_95 = ivtmp.22_63 + _77;
  if (i_37 <= 511)
    goto <bb 7>;
  else
    goto <bb 9>;

As a result, the final assembly contains 13% more instructions.

Before regression:

  .L5:
        movl    (%edi,%ebx,4), %eax
        movd    %xmm1, %edx
        movd    %xmm0, %ecx
        imull   (%esi,%ebx,4), %ecx
        imull   %eax, %edx
        addl    %ecx, %edx
        movd    %xmm0, %ecx
        imull   %ecx, %eax
        movd    %xmm1, %ecx
        imull   (%esi,%ebx,4), %ecx
        movl    %edx, (%esi)
        movl    %edx, (%esi,%ebx,4)
        movd    %xmm5, %edx
        addl    %edx, %esi
        addl    %ecx, %eax
        movl    %eax, (%edi)
        movl    %eax, (%edi,%ebx,4)
        movd    %xmm4, %eax
        addl    %edx, %edi
        addl    %eax, -4124(%ebp)
        movl    -4124(%ebp), %ecx
        cmpl    $511, %ecx
        jle     .L5

After regression:

  .L5:
        movd    %xmm5, %edi
        movd    %xmm3, %edx
        movd    %xmm1, %ebx
        imull   (%eax,%edx), %ebx
        movd    %xmm4, %ecx
        movd    %xmm4, %edx
        imull   (%eax,%edi), %ecx
        addl    %ecx, %ebx
        movd    %xmm1, %ecx
        imull   (%eax,%edi), %ecx
        movd    %ecx, %xmm0
        movd    %xmm3, %ecx
        imull   (%eax,%ecx), %edx
        movd    %xmm0, %ecx
        addl    %edx, %ecx
        movd    %xmm3, %edx
        movl    %ebx, (%edx,%esi,4)
        movd    %xmm3, %edx
        movl    %ecx, (%edi,%esi,4)
        addl    -4124(%ebp), %esi
        movd    %xmm5, %edi
        movl    %ebx, (%eax,%edx)
        movl    %ecx, (%eax,%edi)
        addl    -4128(%ebp), %eax
        cmpl    $511, %esi
        jle     .L5
Comment 1 bin cheng 2016-06-01 07:59:25 UTC
It doesn't look like a regression from the dump, I suspect it's because how gcc handles symbol (arr_1/arr_2) in m32 PIE code.   I will have a look.
BTW, the patch itself is right, it triggers cost model issue again in which wrong/inaccurate cost gives better result.   I am doing experiments rewriting the whole cost computation part.
Comment 2 Jakub Jelinek 2017-01-02 14:55:07 UTC
Any progress with this?
Comment 3 bin cheng 2017-01-03 10:40:18 UTC
(In reply to Jakub Jelinek from comment #2)
> Any progress with this?

I had a patch which can resolve the reported regression, and generate even better code than the original:

.L5:
	movl	(%esi,%ecx,4), %eax
	movl	-4128(%ebp), %edx
	movl	-4124(%ebp), %ebx
	imull	(%edi,%ecx,4), %ebx
	imull	%eax, %edx
	imull	-4124(%ebp), %eax
	addl	%ebx, %edx
	movl	-4128(%ebp), %ebx
	imull	(%edi,%ecx,4), %ebx
	addl	%ebx, %eax
	movl	-4132(%ebp), %ebx
	movl	%edx, (%ebx,%ecx,4)
	movl	-4136(%ebp), %ebx
	movl	%edx, (%edi,%ecx,4)
	movl	%eax, (%ebx,%ecx,4)
	movl	%eax, (%esi,%ecx,4)
	addl	-4140(%ebp), %ecx
	cmpl	$511, %ecx
	jle	.L5

But this patch is a complete rewrite of major part of IVOPT, which is a stage 1 work.
Comment 4 Jakub Jelinek 2017-01-03 10:42:42 UTC
So shall we defer this PR to GCC 8 then (i.e. [8 Regression] and Target Milestone: 8.0?  Richard, are you ok with that?
Comment 5 bin cheng 2017-08-08 16:30:25 UTC
(In reply to Jakub Jelinek from comment #4)
> So shall we defer this PR to GCC 8 then (i.e. [8 Regression] and Target
> Milestone: 8.0?  Richard, are you ok with that?

With ivopt rewriting, we now generate below code:
.L5:
	movl	(%esi,%ecx,4), %eax
	movl	40(%esp), %edx
	movl	44(%esp), %ebx
	imull	(%edi,%ecx,4), %ebx
	imull	%eax, %edx
	imull	44(%esp), %eax
	addl	%ebx, %edx
	movl	40(%esp), %ebx
	imull	(%edi,%ecx,4), %ebx
	addl	%ebx, %eax
	movl	36(%esp), %ebx
	movl	%edx, (%ebx,%ecx,4)
	movl	32(%esp), %ebx
	movl	%edx, (%edi,%ecx,4)
	movl	%eax, (%ebx,%ecx,4)
	movl	%eax, (%esi,%ecx,4)
	addl	28(%esp), %ecx
	cmpl	$511, %ecx
	jle	.L5

Which I think is optimal.  Shall we consider this fixed?
Comment 6 bin cheng 2017-08-08 16:33:11 UTC
Hmm, but it can't be backported to 7 branch.
Comment 7 Jeffrey A. Law 2018-01-08 23:23:43 UTC
Fixed on the trunk.
Comment 8 Leslie Zhai 2018-02-07 04:43:04 UTC
(In reply to Jeffrey A. Law from comment #7)
> Fixed on the trunk.

Hi Jeff,

Could you please show me which commit fixed it on trunk? thanks a lot!

Regards,
Leslie Zhai
Comment 9 Jeffrey A. Law 2018-02-08 23:19:02 UTC
Leslie, you'd need to bisect.  Probably something from Bin in the summer of 2017.  Not something we're likely to backport.
Comment 10 Martin Liška 2018-02-12 08:55:31 UTC
Let me bisect that.
Comment 11 bin cheng 2018-02-12 09:39:52 UTC
(In reply to Martin Liška from comment #10)
> Let me bisect that.

IIRC, it's one of the IVOPTs rewriting patches.  The Bisected revision may not be appropriate for backport on itself I think.
Comment 12 Martin Liška 2018-02-12 10:11:10 UTC
(In reply to amker from comment #11)
> (In reply to Martin Liška from comment #10)
> > Let me bisect that.
> 
> IIRC, it's one of the IVOPTs rewriting patches.  The Bisected revision may
> not be appropriate for backport on itself I think.

Ok, so then probably close the PR as we're not planning a rewritten to gcc7 branch?
Comment 13 Martin Liška 2018-02-19 12:40:43 UTC
Closing.