User account creation filtered due to spam.

Bug 71361 - [7/8 Regression] Changes in ivopts caused perf regression on x86
Summary: [7/8 Regression] Changes in ivopts caused perf regression on x86
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.0
: P2 normal
Target Milestone: 8.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2016-05-31 19:19 UTC by iverbin
Modified: 2017-03-31 07:45 UTC (History)
6 users (show)

See Also:
Host:
Target: i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description iverbin 2016-05-31 19:19:14 UTC
r235805 leads to performance regression on x86.

Reduced testcase:

int arr_1[512];
int arr_2[512];

int main ()
{
    int c1[512];
    int c2[512];
    int res[512];

    for (int i = 0; i < 512; i++)
      arr_1[i] = arr_2[i] = c1[i] = c2[i] = i;

    for (int l = 0; l < 1000000; l++)
      for (int k = 1; k <= 9; k++)
        {
          int n1 = 1 << k;
          int n2 = n1 >> 1;

          for (int j = 0; j < n2; j++)
            for (int i = j; i < 512; i += n1)
              {
                int idx = i + n2;
                int x1 = arr_1[idx] * c1[j] + arr_2[idx] * c2[j];
                int x2 = arr_2[idx] * c1[j] + arr_1[idx] * c2[j];

                arr_1[i] = x1;
                arr_2[i] = x2;
                arr_1[idx] = x1;
                arr_2[idx] = x2;
              }
        }

    return 0;
}

Compilation options: -Ofast -m32 -fPIE
GCC is configured --with-arch=corei7 --with-cpu=corei7 --with-fpmath=sse
Run time on Sandy Bridge increased by ~20%
Run time on Atom increased by ~60%

Below are the dumps of the innermost loop after ivopts pass.

Before regression there are 2 induction variables, which are used as bases for all 6 memory accesses:

  # i_66 = PHI <i_37(7), j_65(11)>
  # ivtmp.19_63 = PHI <ivtmp.19_95(7), ivtmp.19_76(11)>
  # ivtmp.20_17 = PHI <ivtmp.20_15(7), ivtmp.20_73(11)>
  _59 = (void *) ivtmp.19_63;
  _58 = (sizetype) n2_20;
  _22 = MEM[base: _59, index: _58, step: 4, offset: 0B];
  _24 = _22 * pretmp_105;
  _55 = (void *) ivtmp.20_17;
  _54 = (sizetype) n2_20;
  _25 = MEM[base: _55, index: _54, step: 4, offset: 0B];
  _27 = _25 * pretmp_107;
  x1_28 = _24 + _27;
  _30 = _25 * pretmp_105;
  _31 = _22 * pretmp_107;
  x2_32 = _30 + _31;
  _51 = (void *) ivtmp.19_63;
  MEM[base: _51, offset: 0B] = x1_28;
  _50 = (void *) ivtmp.20_17;
  MEM[base: _50, offset: 0B] = x2_32;
  _57 = (void *) ivtmp.19_63;
  _56 = (sizetype) n2_20;
  MEM[base: _57, index: _56, step: 4, offset: 0B] = x1_28;
  _53 = (void *) ivtmp.20_17;
  _52 = (sizetype) n2_20;
  MEM[base: _53, index: _52, step: 4, offset: 0B] = x2_32;
  i_37 = n1_19 + i_66;
  ivtmp.19_95 = ivtmp.19_63 + _77;
  ivtmp.20_15 = ivtmp.20_17 + _12;
  if (i_37 <= 511)
    goto <bb 7>;
  else
    goto <bb 9>;

After regression there is only one induction variable, which is used as index for 4 memory accesses.

  # i_66 = PHI <i_37(7), j_65(11)>
  # ivtmp.22_63 = PHI <ivtmp.22_95(7), ivtmp.22_76(11)>
  _22 = MEM[symbol: arr_1, index: ivtmp.22_63, offset: 0B];
  _24 = _22 * pretmp_105;
  _25 = MEM[symbol: arr_2, index: ivtmp.22_63, offset: 0B];
  _27 = _25 * pretmp_107;
  x1_28 = _24 + _27;
  _30 = _25 * pretmp_105;
  _31 = _22 * pretmp_107;
  x2_32 = _30 + _31;
  _17 = (sizetype) i_66;
  _15 = _17 * 4;
  MEM[symbol: arr_1, index: _15, offset: 0B] = x1_28;
  _14 = (sizetype) i_66;
  _12 = _14 * 4;
  MEM[symbol: arr_2, index: _12, offset: 0B] = x2_32;
  MEM[symbol: arr_1, index: ivtmp.22_63, offset: 0B] = x1_28;
  MEM[symbol: arr_2, index: ivtmp.22_63, offset: 0B] = x2_32;
  i_37 = n1_19 + i_66;
  ivtmp.22_95 = ivtmp.22_63 + _77;
  if (i_37 <= 511)
    goto <bb 7>;
  else
    goto <bb 9>;

As a result, the final assembly contains 13% more instructions.

Before regression:

  .L5:
        movl    (%edi,%ebx,4), %eax
        movd    %xmm1, %edx
        movd    %xmm0, %ecx
        imull   (%esi,%ebx,4), %ecx
        imull   %eax, %edx
        addl    %ecx, %edx
        movd    %xmm0, %ecx
        imull   %ecx, %eax
        movd    %xmm1, %ecx
        imull   (%esi,%ebx,4), %ecx
        movl    %edx, (%esi)
        movl    %edx, (%esi,%ebx,4)
        movd    %xmm5, %edx
        addl    %edx, %esi
        addl    %ecx, %eax
        movl    %eax, (%edi)
        movl    %eax, (%edi,%ebx,4)
        movd    %xmm4, %eax
        addl    %edx, %edi
        addl    %eax, -4124(%ebp)
        movl    -4124(%ebp), %ecx
        cmpl    $511, %ecx
        jle     .L5

After regression:

  .L5:
        movd    %xmm5, %edi
        movd    %xmm3, %edx
        movd    %xmm1, %ebx
        imull   (%eax,%edx), %ebx
        movd    %xmm4, %ecx
        movd    %xmm4, %edx
        imull   (%eax,%edi), %ecx
        addl    %ecx, %ebx
        movd    %xmm1, %ecx
        imull   (%eax,%edi), %ecx
        movd    %ecx, %xmm0
        movd    %xmm3, %ecx
        imull   (%eax,%ecx), %edx
        movd    %xmm0, %ecx
        addl    %edx, %ecx
        movd    %xmm3, %edx
        movl    %ebx, (%edx,%esi,4)
        movd    %xmm3, %edx
        movl    %ecx, (%edi,%esi,4)
        addl    -4124(%ebp), %esi
        movd    %xmm5, %edi
        movl    %ebx, (%eax,%edx)
        movl    %ecx, (%eax,%edi)
        addl    -4128(%ebp), %eax
        cmpl    $511, %esi
        jle     .L5
Comment 1 amker 2016-06-01 07:59:25 UTC
It doesn't look like a regression from the dump, I suspect it's because how gcc handles symbol (arr_1/arr_2) in m32 PIE code.   I will have a look.
BTW, the patch itself is right, it triggers cost model issue again in which wrong/inaccurate cost gives better result.   I am doing experiments rewriting the whole cost computation part.
Comment 2 Jakub Jelinek 2017-01-02 14:55:07 UTC
Any progress with this?
Comment 3 amker 2017-01-03 10:40:18 UTC
(In reply to Jakub Jelinek from comment #2)
> Any progress with this?

I had a patch which can resolve the reported regression, and generate even better code than the original:

.L5:
	movl	(%esi,%ecx,4), %eax
	movl	-4128(%ebp), %edx
	movl	-4124(%ebp), %ebx
	imull	(%edi,%ecx,4), %ebx
	imull	%eax, %edx
	imull	-4124(%ebp), %eax
	addl	%ebx, %edx
	movl	-4128(%ebp), %ebx
	imull	(%edi,%ecx,4), %ebx
	addl	%ebx, %eax
	movl	-4132(%ebp), %ebx
	movl	%edx, (%ebx,%ecx,4)
	movl	-4136(%ebp), %ebx
	movl	%edx, (%edi,%ecx,4)
	movl	%eax, (%ebx,%ecx,4)
	movl	%eax, (%esi,%ecx,4)
	addl	-4140(%ebp), %ecx
	cmpl	$511, %ecx
	jle	.L5

But this patch is a complete rewrite of major part of IVOPT, which is a stage 1 work.
Comment 4 Jakub Jelinek 2017-01-03 10:42:42 UTC
So shall we defer this PR to GCC 8 then (i.e. [8 Regression] and Target Milestone: 8.0?  Richard, are you ok with that?