103903 – Loops handling r,g,b values are not vectorized to use power of 2 vectors even if they can

Bug 103903 - Loops handling r,g,b values are not vectorized to use power of 2 vectors even if they can

Summary: Loops handling r,g,b values are not vectorized to use power of 2 vectors even...

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	12.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2022-01-04 15:17 UTC by Jan Hubicka
Modified:	2022-01-05 09:49 UTC (History)
CC List:	2 users (show)

See Also:	99412
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2022-01-04 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jan Hubicka 2022-01-04 15:17:52 UTC

This is another textcase comming from Firefox's LightPixel. I am not sure if this is duplicate, but I think it is quite common in programs dealing with RGB values.

To match the vectorized code we would need to move from SLP vectorizing the 3 parallel computations to vectorising the loop.

struct a {float r,g,b;};
struct a src[100000], dest[100000];

void
test ()
{
  int i;
  for (i=0;i<100000;i++)
  {
          dest[i].r/=src[i].g;
          dest[i].g/=src[i].g;
          dest[i].b/=src[i].b;
  }
}

is vectorized to do 3 operaitons at a time, while equivalent:

float src[300000], dest[300000];

void
test ()
{
  int i;
  for (i=0;i<300000;i++)
  {
          dest[i]/=src[i];
  }
}

runs faster.

Comment 1 Andrew Pinski 2022-01-04 19:18:35 UTC

Basically this is re-rolling.

PR 99412 is another example of re-rolling; there might be others.

Comment 2 Richard Biener 2022-01-05 07:43:35 UTC

If you fix the loop to do

  for (i=0;i<100000;i++)
  {
          dest[i].r/=src[i].g;
          dest[i].g/=src[i].g;
          dest[i].b/=src[i].b;
  }

it's vectorized just fine (with larger than necessary VF):

.L2:
        movaps  dest+16(%rax), %xmm1
        movaps  dest+32(%rax), %xmm0
        addq    $48, %rax
        divps   src-32(%rax), %xmm1
        movaps  dest-48(%rax), %xmm2
        divps   src-16(%rax), %xmm0
        divps   src-48(%rax), %xmm2
        movaps  %xmm1, dest-32(%rax)
        movaps  %xmm2, dest-48(%rax)
        movaps  %xmm0, dest-16(%rax)
        cmpq    $1200000, %rax
        jne     .L2

so not sure what you are asking for?  Is the unrolling harmful?  It should
be doable to do the "re-rolling" on the fly in some cases but it might be
some work to tie that in.

Comment 3 Jan Hubicka 2022-01-05 09:49:25 UTC

Aha, sorry. I did not spot the typo in cut&paste.  Unrolling is fine.  I need to figure out why in the real testcase we don't do the same transformation.