104408 – SLP discovery fails due to -Ofast rewriting

Bug 104408 - SLP discovery fails due to -Ofast rewriting

Summary: SLP discovery fails due to -Ofast rewriting

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	12.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Tamar Christina

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2022-02-06 09:13 UTC by Tamar Christina
Modified:	2022-02-07 10:55 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2022-02-07 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tamar Christina 2022-02-06 09:13:14 UTC

The following testcase:

typedef struct { float r, i; } cf;
void
f (cf *restrict a, cf *restrict b, cf *restrict c, cf *restrict d, cf e)
{
  for (int i = 0; i < 100; ++i)
    {
      b[i].r = e.r * (c[i].r - d[i].r) - e.i * (c[i].i - d[i].i);
      b[i].i = e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r);
    }
}

when compiled at -O3 forms an SLP tree but fails at -Ofast because match.pd rewrites the expression into 

      b[i].r = e.r * (c[i].r - d[i].r) + e.i * (d[i].i - c[i].i);
      b[i].i = e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r);

and so introduces a different interleaving in the second multiply operation.

It's unclear to me what the gain of actually doing this is as it results in worse vector and scalar code due to you losing the sharing of the computed value of the nodes.

Without the rewriting the first code can re-use the load from the first vector and just reverse the elements:

.L2:
        ldr     q1, [x3, x0]
        ldr     q0, [x2, x0]
        fsub    v0.4s, v0.4s, v1.4s
        fmul    v1.4s, v2.4s, v0.4s
        fmul    v0.4s, v3.4s, v0.4s
        rev64   v1.4s, v1.4s
        fneg    v0.2d, v0.2d
        fadd    v0.4s, v0.4s, v1.4s
        str     q0, [x1, x0]
        add     x0, x0, 16
        cmp     x0, 800
        bne     .L2

While with the rewrite it forces an increase in VF to be able to handle the interleaving

.L2:
        ld2     {v0.4s - v1.4s}, [x3], 32
        ld2     {v4.4s - v5.4s}, [x2], 32
        fsub    v2.4s, v1.4s, v5.4s
        fsub    v3.4s, v4.4s, v0.4s
        fsub    v5.4s, v5.4s, v1.4s
        fmul    v2.4s, v2.4s, v6.4s
        fmul    v4.4s, v6.4s, v3.4s
        fmla    v2.4s, v7.4s, v3.4s
        fmla    v4.4s, v5.4s, v7.4s
        mov     v0.16b, v2.16b
        mov     v1.16b, v4.16b
        st2     {v0.4s - v1.4s}, [x1], 32
        cmp     x5, x1
        bne     .L2

in scalar you lose the ability to re-use the subtract so you get an extra sub.

Comment 1 Tamar Christina 2022-02-06 09:43:48 UTC

In particular, the rewrite should probably be gated on the expression being single use.

Comment 2 Tamar Christina 2022-02-06 12:41:39 UTC

Mine for GCC 13.

Comment 3 Richard Biener 2022-02-07 08:55:55 UTC

match.pd just does canonicalization here.  SLP discovery could handle this
in the existing swap operands or reassoc support but I guess the desire here
is to pull out a Complex SLP pattern.

So what should really be done in the end is get rid of the restriction during
SLP build that a node has to be from a single interleaving chain.  We can
handle {d[i].i, c[i].r} as permute of {d[i].r, d[i].i} and {c[i].r, c[i].i}
during discovery.  Of course doing that interferes with the swap-operands
logic which when successful will produce a more optimal initial SLP graph
but that relies on the recursive SLP discovery to first fail.

Forming an optimal SLP graph is likely NP complete so we have to employ
heuristics to some extent.

So - no perfect idea yet how to reliably match a Complex pattern here but
trying to attack this from the match.pd side sounds wrong.

Comment 4 Tamar Christina 2022-02-07 09:58:02 UTC

(In reply to Richard Biener from comment #3)
> match.pd just does canonicalization here.  SLP discovery could handle this
> in the existing swap operands or reassoc support but I guess the desire here
> is to pull out a Complex SLP pattern.

Yes, though also to optimize the case where you don't have the optab, currently the generated code is much worse at -Ofast.

> 
> So - no perfect idea yet how to reliably match a Complex pattern here but
> trying to attack this from the match.pd side sounds wrong.

Well the problem is that the scalar code is suboptimal too. even without matching a complex pattern, so the epilogue here does an extra sub on each unrolled step.

So I initially figured we'd want to not perform the canonization if it's coming at the expense of sharing. However that looks harder than I though at first as there are multiple points in const-fold.c that will try and force this form.

I can probably fix the epilogue post vectorization but that seemed like a worse solution.

Comment 5 Richard Biener 2022-02-07 10:55:51 UTC

(In reply to Tamar Christina from comment #4)
> (In reply to Richard Biener from comment #3)
> > match.pd just does canonicalization here.  SLP discovery could handle this
> > in the existing swap operands or reassoc support but I guess the desire here
> > is to pull out a Complex SLP pattern.
> 
> Yes, though also to optimize the case where you don't have the optab,
> currently the generated code is much worse at -Ofast.
> 
> > 
> > So - no perfect idea yet how to reliably match a Complex pattern here but
> > trying to attack this from the match.pd side sounds wrong.
> 
> Well the problem is that the scalar code is suboptimal too. even without
> matching a complex pattern, so the epilogue here does an extra sub on each
> unrolled step.

Well, the issue is then why the scalar code is not optimized (yes, it's
not so easy).

> So I initially figured we'd want to not perform the canonization if it's
> coming at the expense of sharing. However that looks harder than I though at
> first as there are multiple points in const-fold.c that will try and force
> this form.

Yep.  The canonicalization likely happens early before we do CSE.

> I can probably fix the epilogue post vectorization but that seemed like a
> worse solution.

Well, the CSE opportunity needs to be realized despite the canonialization.