108583 – [13 Regression] wrong code with vector division by uint16 at -O2

Bug 108583 - [13 Regression] wrong code with vector division by uint16 at -O2

Summary: [13 Regression] wrong code with vector division by uint16 at -O2

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	13.0

Importance:	P1 normal
Target Milestone:	13.0
Assignee:	Tamar Christina

URL:
Keywords:	wrong-code

Depends on:
Blocks:

Reported:	2023-01-28 14:00 UTC by Zdenek Sojka
Modified:	2023-03-12 18:45 UTC (History)
CC List:	4 users (show)

See Also:
Host:	x86_64-pc-linux-gnu
Target:	aarch64-unknown-linux-gnu
Build:
Known to work:	11.3.1, 12.2.1
Known to fail:	13.0
Last reconfirmed:	2023-01-30 00:00:00

Attachments
reduced testcase (231 bytes, text/plain) 2023-01-28 14:00 UTC, Zdenek Sojka	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Zdenek Sojka 2023-01-28 14:00:29 UTC

Created attachment 54366 [details]
reduced testcase

Output:
$ aarch64-unknown-linux-gnu-gcc -O2 testcase.c -static
$ qemu-aarch64 -- ./a.out 
qemu: uncaught target signal 6 (Aborted) - core dumped
Aborted

$ aarch64-unknown-linux-gnu-gcc -v
Using built-in specs.
COLLECT_GCC=/repo/gcc-trunk/binary-latest-aarch64/bin/aarch64-unknown-linux-gnu-gcc
COLLECT_LTO_WRAPPER=/repo/gcc-trunk/binary-trunk-r13-5461-20230127201120-g815e5740162-checking-yes-rtl-df-extra-aarch64/bin/../libexec/gcc/aarch64-unknown-linux-gnu/13.0.1/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: /repo/gcc-trunk//configure --enable-languages=c,c++ --enable-valgrind-annotations --disable-nls --enable-checking=yes,rtl,df,extra --with-cloog --with-ppl --with-isl --with-sysroot=/usr/aarch64-unknown-linux-gnu --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=aarch64-unknown-linux-gnu --with-ld=/usr/bin/aarch64-unknown-linux-gnu-ld --with-as=/usr/bin/aarch64-unknown-linux-gnu-as --disable-libstdcxx-pch --prefix=/repo/gcc-trunk//binary-trunk-r13-5461-20230127201120-g815e5740162-checking-yes-rtl-df-extra-aarch64
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.0.1 20230127 (experimental) (GCC)

Comment 1 Andrew Pinski 2023-01-30 04:19:35 UTC

Confirmed. The problem shows up in the expansion of:
  v_2 = v_1(D) / { 65535, 65535, 65535, 65535 };

Comment 2 Andrew Pinski 2023-01-30 04:22:59 UTC

Most likely introduced by r13-4026-gc98aabc1427a4d .

Comment 3 Tamar Christina 2023-01-30 14:20:50 UTC

Right, so this is because in the expansion we don't have enough context to decide how to optimize the division.

This optimization is only possible when the input is widened because you need an additional free bit so that the second addition can't overflow.

The vectorizer has this context but since we didn't want a new IFN the context should instead be derivable in targetm.vectorize.can_special_div_by_const hook.

So my proposal to fix this and keep targetm.vectorize.can_special_div_by_const as a general divide optimization hook is to pass the actual `tree` operand0 as well to the hook such that the hook has a bit more context.

I was hoping to use get_nonzero_bits to get what the actual range of the operand is.  But it looks like for widening operations it still reports -1.

So my suggestion is the backend should just check if the operand is a non-constant and the type of the operation being performed is matches the precision we require.

Is this ok with you Richards? since I'll need both of you to approve.

Comment 4 rguenther@suse.de 2023-01-30 14:52:55 UTC

On Mon, 30 Jan 2023, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583
> 
> Tamar Christina <tnfchris at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |rguenth at gcc dot gnu.org,
>                    |                            |rsandifo at gcc dot gnu.org
> 
> --- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> Right, so this is because in the expansion we don't have enough context to
> decide how to optimize the division.
> 
> This optimization is only possible when the input is widened because you need
> an additional free bit so that the second addition can't overflow.
> 
> The vectorizer has this context but since we didn't want a new IFN the 
> context should instead be derivable in 
> targetm.vectorize.can_special_div_by_const hook.

The vectorizer doesn't check for a widened operand zero.  In fact
"can_special_div_by_const" doesn't suggest that widening is required.
If the vectorizer checks that then why do we need another operand?

Comment 5 Tamar Christina 2023-01-30 15:01:03 UTC

> > The vectorizer has this context but since we didn't want a new IFN the 
> > context should instead be derivable in 
> > targetm.vectorize.can_special_div_by_const hook.
> 
> The vectorizer doesn't check for a widened operand zero.  In fact
> "can_special_div_by_const" doesn't suggest that widening is required.

Correct, at the moment the generalized so far as to be a generic div by constant operation.  But the original sequence wanted to optimize a sequence that needs more context than what was given to it.

> If the vectorizer checks that then why do we need another operand?

Because the expansion gets triggered by intrinsics code as well:

typedef unsigned __attribute__((__vector_size__ (16))) V;

static __attribute__((__noinline__)) __attribute__((__noclone__)) V
foo (V v, unsigned short i)
{
  v /= i;
  return v;
}

It's invalid to do this particular decomposition we do in AArch64, but the hook doesn't know this.  Since the new IFN wasn't like we're just leaving the `/` alone in the vectorizer.  But this means the expansion code, or the backend needs to be able to check the context in which we are expanding.

We don't have an integer vector division operation. So the operation needs to be decomposed.  And if we decompose it, then the sequence becomes too long for combine to match and we're back where we started.

Comment 6 Richard Sandiford 2023-01-30 16:52:22 UTC

(In reply to Tamar Christina from comment #3)
> The vectorizer has this context but since we didn't want a new IFN the
> context should instead be derivable in
> targetm.vectorize.can_special_div_by_const hook.
I probably got lost in the threading, but could you point to where
the idea of using an ifn was rejected?  I know there was pushback
against hard-coding a specific constant, but that doesn't prevent
the use of ifns in itself.  (E.g. the gather/scatter IFNs have
constant arguments that are checked against what the target allows.)

Comment 7 Tamar Christina 2023-01-30 17:04:30 UTC

(In reply to rsandifo@gcc.gnu.org from comment #6)
> (In reply to Tamar Christina from comment #3)
> > The vectorizer has this context but since we didn't want a new IFN the
> > context should instead be derivable in
> > targetm.vectorize.can_special_div_by_const hook.
> I probably got lost in the threading, but could you point to where
> the idea of using an ifn was rejected?  I know there was pushback
> against hard-coding a specific constant, but that doesn't prevent
> the use of ifns in itself.  (E.g. the gather/scatter IFNs have
> constant arguments that are checked against what the target allows.)

https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/#124788 is where the original patch used an IFN.
https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/#124863 where Richi wouldn't ACK a new optab as it didn't have a physical instruction backing it on any ISA, and in fairness, no one else did either effectively stranding the change.

https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/#125144 where I pinged and got no response to the ping. After which I went on IRC and asked Richi how he'd like me to proceed.  In reply to this I was instructed he would like to proceed the same way vector permutes are currently handled with `can_perm` etc.  and that's where the patch thread picks off back on the ML.

Comment 8 Richard Biener 2023-01-31 10:31:39 UTC

(In reply to Tamar Christina from comment #7)
> (In reply to rsandifo@gcc.gnu.org from comment #6)
> > (In reply to Tamar Christina from comment #3)
> > > The vectorizer has this context but since we didn't want a new IFN the
> > > context should instead be derivable in
> > > targetm.vectorize.can_special_div_by_const hook.
> > I probably got lost in the threading, but could you point to where
> > the idea of using an ifn was rejected?  I know there was pushback
> > against hard-coding a specific constant, but that doesn't prevent
> > the use of ifns in itself.  (E.g. the gather/scatter IFNs have
> > constant arguments that are checked against what the target allows.)
> 
> https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/
> #124788 is where the original patch used an IFN.
> https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/
> #124863 where Richi wouldn't ACK a new optab as it didn't have a physical
> instruction backing it on any ISA, and in fairness, no one else did either
> effectively stranding the change.

Yep.  The other option would have to expose whatever special instruction
is used in the end as IFN and rewrite the division during vectorization
"properly", either via some pattern or via doing the expansion when
emitting the vectorized division instruction.

In hindsight that might have been the better option, but then ...

> https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/
> #125144 where I pinged and got no response to the ping. After which I went
> on IRC and asked Richi how he'd like me to proceed.  In reply to this I was
> instructed he would like to proceed the same way vector permutes are
> currently handled with `can_perm` etc.  and that's where the patch thread
> picks off back on the ML.

... this looked simpler and also reasonable.  Only that it breaks now.

Comment 9 Richard Sandiford 2023-01-31 11:01:59 UTC

Are we sure this is a vectoriser vs. C vectors thing?
The equivalent vectoriser test:

void __attribute__((noipa))
f (unsigned *v)
{
  for (int i = 0; i < 4; ++i)
    v[i] /= 0xffff;
}

int
main (void)
{
  unsigned v[] = { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff };
  f (v);
  for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++)
    if (v[i] != 0x00010001)
      __builtin_abort ();
  return 0;
}

fails in the same way for me (compiled at -O3).

Comment 10 Tamar Christina 2023-01-31 11:39:01 UTC

(In reply to rsandifo@gcc.gnu.org from comment #9)
> Are we sure this is a vectoriser vs. C vectors thing?

it's not, the issue we're debating is how to fix it.

As Richi pointed out https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583#c4 we can fix the vectorizer case by adding a guard around the recog code in the vectorizer.

But that only fixes the vectorizer, and Richi asked why that wasn't enough, to which I pointed out it's because intrinsics code hits it as well. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583#c5

This is why I mentioned that there are two ways to fix this:

1. We make the hook less general, and add checks in every place it is called.
2. We keep the hook general and pass enough context to the hook for it to decide how to optimize the division.

it's not intrinsics only, but a vectorizer only fix won't fix intrinsics.

Comment 11 rguenther@suse.de 2023-01-31 11:44:23 UTC

On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583
> 
> --- Comment #10 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> (In reply to rsandifo@gcc.gnu.org from comment #9)
> > Are we sure this is a vectoriser vs. C vectors thing?
> 
> it's not, the issue we're debating is how to fix it.
> 
> As Richi pointed out https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583#c4 we
> can fix the vectorizer case by adding a guard around the recog code in the
> vectorizer.
> 
> But that only fixes the vectorizer, and Richi asked why that wasn't enough, to
> which I pointed out it's because intrinsics code hits it as well.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583#c5
> 
> This is why I mentioned that there are two ways to fix this:
> 
> 1. We make the hook less general, and add checks in every place it is called.
> 2. We keep the hook general and pass enough context to the hook for it to
> decide how to optimize the division.

I don't think passing in for example the tree operand 0 helps, the
target appearantly wants to not specialize by the constant divided by
but instead also by the number divided (of we already know its type).

Do you plan to do SSA def pattern matching in the hook for this at
expand time?

As said, recognizing the special case and open-coding the full divison
in the vectorizer (or possibly even from RTL expansion) would be
another option here.

Comment 12 Tamar Christina 2023-01-31 11:58:07 UTC

(In reply to rguenther@suse.de from comment #11)
> On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote:
> 
> 
> I don't think passing in for example the tree operand 0 helps, the
> target appearantly wants to not specialize by the constant divided by
> but instead also by the number divided (of we already know its type).
> 

The target wants to specialize division by a specific constant.  It just happens that there is an additional specialization when certain bits are known to be free in operand0.

> Do you plan to do SSA def pattern matching in the hook for this at
> expand time?
> 

My intention was to only look if the DEF_STMT is a gimple assign, and then the operands to it are half the precision of the result.  This catches *w, *>> and any other case.  It doesn't need to know further up because it's harder to track non-zero bits across different math operations.  and this is enough for the use case we're interested in.

> As said, recognizing the special case and open-coding the full divison
> in the vectorizer (or possibly even from RTL expansion) would be
> another option here.

open-coding in the vectorizer will require 4 new IFNs and 4 new optabs as it requires 4 different instructions. 2 for neon, 2 for SVE, which are semantically different as permutes work different across SVE and NEON.  So if you're happy with that I will do that.

> (or possibly even from RTL expansion) would be another option here.

I don't really follow this one. The open coding at expand time is what I was trying to do before.  The issue is there's no longer a single statement if we don't stop the vectorizer from decomposing it into shifts and multiplies.

As long as the vectorizer does this decomposition there's not much to do at expand time.  Additionally the number of widening operations after vectorization is large. So to semantically match them with the decomposed shift seems impossible..

But those are only my thoughts :) ultimately I'll do the approach you two want.  If you want me to revert to using IFNs I can do that.

Comment 13 Richard Sandiford 2023-01-31 12:03:19 UTC

OK, hopefully I understand now.  Sorry for being slow.

But what specific constraints do we want to apply to the optimisation?

(In reply to Tamar Christina from comment #3)
> Right, so this is because in the expansion we don't have enough context to
> decide how to optimize the division.
> 
> This optimization is only possible when the input is widened because you
> need an additional free bit so that the second addition can't overflow.
> 
> The vectorizer has this context but since we didn't want a new IFN the
> context should instead be derivable in
> targetm.vectorize.can_special_div_by_const hook.
> 
> So my proposal to fix this and keep
> targetm.vectorize.can_special_div_by_const as a general divide optimization
> hook is to pass the actual `tree` operand0 as well to the hook such that the
> hook has a bit more context.
> 
> I was hoping to use get_nonzero_bits to get what the actual range of the
> operand is.  But it looks like for widening operations it still reports -1.
The original motivating example was:

void draw_bitmap1(uint8_t* restrict pixel, uint8_t level, int n)
{
  for (int i = 0; i < (n & -16); i+=1)
    pixel[i] = (pixel[i] * level) / 0xff;
}

where we then do a 16-bit division.  But since level and pixel[i] are
unconstrained, the maximum value of pixel[i] * level is 0xfe01.
There's no free bit in that sense.  It seems more that range of the
dividend is [0, N*N] for a divisor of N.

If that's the condition we want to test for, it seems like something
we need to check in the vectoriser rather than the hook.  And it's
not something we can easily do in the vector form, since we don't
track ranges for vectors (AFAIK).

Comment 14 rguenther@suse.de 2023-01-31 12:19:40 UTC

On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583
> 
> --- Comment #12 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #11)
...
> > As said, recognizing the special case and open-coding the full divison
> > in the vectorizer (or possibly even from RTL expansion) would be
> > another option here.
> 
> open-coding in the vectorizer will require 4 new IFNs and 4 new optabs as it
> requires 4 different instructions. 2 for neon, 2 for SVE, which are
> semantically different as permutes work different across SVE and NEON.  So if
> you're happy with that I will do that.

You probably did so elsewhere some time ago, but what exactly are those
four instructions?  (pointers to specifications appreciated)

Comment 15 Tamar Christina 2023-01-31 13:35:49 UTC

> OK, hopefully I understand now.  Sorry for being slow.

Not at all, Sorry if it came across a bit cranky, it wasn't meant that way!

> If that's the condition we want to test for, it seems like something
> we need to check in the vectoriser rather than the hook.  And it's
> not something we can easily do in the vector form, since we don't
> track ranges for vectors (AFAIK).

Ack, that also tracks with what I tried before, we don't indeed track ranges for vector ops. The general case can still be handled slightly better (I think) but it doesn't become as clear of a win as this one.

> You probably did so elsewhere some time ago, but what exactly are those
> four instructions?  (pointers to specifications appreciated)

For NEON we use:
https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/ADDHN--ADDHN2--Add-returning-High-Narrow-
https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/UADDW--UADDW2--Unsigned-Add-Wide-

In that order, and for SVE we use two https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/ADDHNB--Add-narrow-high-part--bottom--

The difference between the SVE and the NEON versions is that the SVE version does even/odd and the NEON the typical first N, last N element wise operation.

At the moment the vectorizer doesn't know anything at all about SVE's even and odd instructions. But if we create B we should create T as well.

Comment 16 rguenther@suse.de 2023-01-31 14:33:08 UTC

On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583
> 
> --- Comment #15 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > OK, hopefully I understand now.  Sorry for being slow.
> 
> Not at all, Sorry if it came across a bit cranky, it wasn't meant that way!
> 
> > If that's the condition we want to test for, it seems like something
> > we need to check in the vectoriser rather than the hook.  And it's
> > not something we can easily do in the vector form, since we don't
> > track ranges for vectors (AFAIK).
> 
> Ack, that also tracks with what I tried before, we don't indeed track ranges
> for vector ops. The general case can still be handled slightly better (I think)
> but it doesn't become as clear of a win as this one.
> 
> > You probably did so elsewhere some time ago, but what exactly are those
> > four instructions?  (pointers to specifications appreciated)
> 
> For NEON we use:
> https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/ADDHN--ADDHN2--Add-returning-High-Narrow-

so thats a add + pack high

> https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/UADDW--UADDW2--Unsigned-Add-Wide-

and that unpacks (zero-extends) the high/low part of one operand of an add

I wonder if we'd open-code the pack / unpack and use regular add whether
combine can synthesize uaddw and addhn?  The pack and unpack would be
vec_perms on GIMPLE (plus V_C_E).

> In that order, and for SVE we use two
> https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/ADDHNB--Add-narrow-high-part--bottom--

probably similar.

So the difficulty here will be to decide whether that's in the end
better than what the pattern handling code does now, right?  Because
I think most targets will be able to do the above but lacking the
special adds it will be slower because of the extra packing/unpacking?

That said, can we possibly do just that costing (would be a first in
the pattern code I guess) with a target hook?  Or add optabs for
the addh operations so we can query support?

Comment 17 rguenther@suse.de 2023-01-31 14:45:26 UTC

On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583
> 
> --- Comment #15 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > OK, hopefully I understand now.  Sorry for being slow.
> 
> Not at all, Sorry if it came across a bit cranky, it wasn't meant that way!
> 
> > If that's the condition we want to test for, it seems like something
> > we need to check in the vectoriser rather than the hook.  And it's
> > not something we can easily do in the vector form, since we don't
> > track ranges for vectors (AFAIK).
> 
> Ack, that also tracks with what I tried before, we don't indeed track ranges
> for vector ops. The general case can still be handled slightly better (I think)
> but it doesn't become as clear of a win as this one.
> 
> > You probably did so elsewhere some time ago, but what exactly are those
> > four instructions?  (pointers to specifications appreciated)
> 
> For NEON we use:
> https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/ADDHN--ADDHN2--Add-returning-High-Narrow-

so thats a add + pack high

> https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/UADDW--UADDW2--Unsigned-Add-Wide-

and that unpacks (zero-extends) the high/low part of one operand of an add

I wonder if we'd open-code the pack / unpack and use regular add whether
combine can synthesize uaddw and addhn?  The pack and unpack would be
vec_perms on GIMPLE (plus V_C_E).

> In that order, and for SVE we use two
> https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/ADDHNB--Add-narrow-high-part--bottom--

probably similar.

So the difficulty here will be to decide whether that's in the end
better than what the pattern handling code does now, right?  Because
I think most targets will be able to do the above but lacking the
special adds it will be slower because of the extra packing/unpacking?

That said, can we possibly do just that costing (would be a first in
the pattern code I guess) with a target hook?  Or add optabs for
the addh operations so we can query support?

Comment 18 Tamar Christina 2023-01-31 15:01:41 UTC

> > 
> > Ack, that also tracks with what I tried before, we don't indeed track ranges
> > for vector ops. The general case can still be handled slightly better (I think)
> > but it doesn't become as clear of a win as this one.
> > 
> > > You probably did so elsewhere some time ago, but what exactly are those
> > > four instructions?  (pointers to specifications appreciated)
> > 
> > For NEON we use:
> > https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/ADDHN--ADDHN2--Add-returning-High-Narrow-
> 
> so thats a add + pack high
> 

Yes, though with no overflow, the addition is done in twice the precision of the original type. So it's more a widening add + pack high which narrows it back and zero extends.

> > https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/UADDW--UADDW2--Unsigned-Add-Wide-
> 
> and that unpacks (zero-extends) the high/low part of one operand of an add
> 
> I wonder if we'd open-code the pack / unpack and use regular add whether
> combine can synthesize uaddw and addhn?  The pack and unpack would be
> vec_perms on GIMPLE (plus V_C_E).

I don't think so for addhn, because it wouldn't truncate the top bits, it truncates the bottom bits.

The instruction does
    element1 = Elem[operand1, e, 2*esize];
    element2 = Elem[operand2, e, 2*esize];

So it widens on input. 

> 
> So the difficulty here will be to decide whether that's in the end
> better than what the pattern handling code does now, right?  Because
> I think most targets will be able to do the above but lacking the
> special adds it will be slower because of the extra packing/unpacking?
> 
> That said, can we possibly do just that costing (would be a first in
> the pattern code I guess) with a target hook?  Or add optabs for
> the addh operations so we can query support?

We could, the alternative wouldn't be correct for costing I think.. if we generate *+ , vec_perm that's gonna be more expensive.

Comment 19 rguenther@suse.de 2023-02-01 07:29:47 UTC

On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583
> 
> --- Comment #18 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > > 
> > > Ack, that also tracks with what I tried before, we don't indeed track ranges
> > > for vector ops. The general case can still be handled slightly better (I think)
> > > but it doesn't become as clear of a win as this one.
> > > 
> > > > You probably did so elsewhere some time ago, but what exactly are those
> > > > four instructions?  (pointers to specifications appreciated)
> > > 
> > > For NEON we use:
> > > https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/ADDHN--ADDHN2--Add-returning-High-Narrow-
> > 
> > so thats a add + pack high
> > 
> 
> Yes, though with no overflow, the addition is done in twice the precision of
> the original type. So it's more a widening add + pack high which narrows it
> back and zero extends.
> 
> > > https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/UADDW--UADDW2--Unsigned-Add-Wide-
> > 
> > and that unpacks (zero-extends) the high/low part of one operand of an add
> > 
> > I wonder if we'd open-code the pack / unpack and use regular add whether
> > combine can synthesize uaddw and addhn?  The pack and unpack would be
> > vec_perms on GIMPLE (plus V_C_E).
> 
> I don't think so for addhn, because it wouldn't truncate the top bits, it
> truncates the bottom bits.
> 
> The instruction does
>     element1 = Elem[operand1, e, 2*esize];
>     element2 = Elem[operand2, e, 2*esize];
> 
> So it widens on input. 

OK, so that's an ADD_HIGHPART_EXPR then?  Though the highpart of an
add is only a single bit, isn't it?  For scalar you'd use the
carry bit here and instructions like adc to consume it.  Is addhn
to do such thing on vectors?

When writing generic vector code is combine able to synthesize addhn
from widen, plus, pack-high?

As said in the old discussion I'm not opposed to adding new IFNs,
but I'd like to see useful building blocks (that ideally map to ISAs)
instead of IFN-for-complex-pattern-X

The alternative way was to improve division expansion in general
which is the can_div_special_by_const_p thing, but we do not seem
to be able to capture the requirements correctly here.

> > So the difficulty here will be to decide whether that's in the end
> > better than what the pattern handling code does now, right?  Because
> > I think most targets will be able to do the above but lacking the
> > special adds it will be slower because of the extra packing/unpacking?
> > 
> > That said, can we possibly do just that costing (would be a first in
> > the pattern code I guess) with a target hook?  Or add optabs for
> > the addh operations so we can query support?
> 
> We could, the alternative wouldn't be correct for costing I think.. if we
> generate *+ , vec_perm that's gonna be more expensive.

Well, the target cost model can always detect such patterns ...

But sure, using the actual ISA is preferable for costing and also to
avoid "breaking" the combination by later "optimization".  OTOH at least
some basic constant folding for all such ISA IFNs is required to
avoid regressing cases where complete unrolling later allows constant
evaluation but first vectorizing breaking that.

Comment 20 Tamar Christina 2023-02-01 16:22:27 UTC

> > I don't think so for addhn, because it wouldn't truncate the top bits, it
> > truncates the bottom bits.
> > 
> > The instruction does
> >     element1 = Elem[operand1, e, 2*esize];
> >     element2 = Elem[operand2, e, 2*esize];
> > 
> > So it widens on input. 
> 
> OK, so that's an ADD_HIGHPART_EXPR then?  Though the highpart of an
> add is only a single bit, isn't it?  For scalar you'd use the
> carry bit here and instructions like adc to consume it.  Is addhn
> to do such thing on vectors?

Sorry, it's not but that's cause I read the docs wrong. This instruction
doesn't widen on addition but narrows on result.  The esize for the
instruction is determined output not the input.


To unconfuse everyone lets just explain what this is doing:

The optimization is to try to optimize unsigned division by a mask found by
setting all bit in half the size of the input.

i.e. for a divide of a short, the division needs to be by 0xff.

This works by decomposing the unsigned division of x / 0xff into
(x + ((x + 257) >> 8)) >> 8.  This is correct for all ranges as long
as the operations are done in at least double the precision of x.

If the operations are done in the same precision as x, then the only
valid ranged for this are those where x + 257 does not cause an overflow.

This is why the maximum range of a widened operation is fine, because
0xFE01 + 0x101 = 0xFF02 and so does not overflow.  In this case we can avoid
the widening and do the operation in a smaller precision.

The two vector instructions we're using does this, i.e.
addhn == (x + 257) >> 8) and uaddw == addhn_res + x.  Finally we emit a >> 8
in the end.  This shift later gets removed because a widening operation
will do the same sequence for the _lo and _hi pairs.  since we are doing two
logical right shifts we can just permute the values out.

Hope this clear it up to you both.

> 
> When writing generic vector code is combine able to synthesize addhn
> from widen, plus, pack-high?

No, it's currently unspecs. We could recognize add, shift and narrow as
it though.

> 
> But sure, using the actual ISA is preferable for costing and also to
> avoid "breaking" the combination by later "optimization".  OTOH at least
> some basic constant folding for all such ISA IFNs is required to
> avoid regressing cases where complete unrolling later allows constant
> evaluation but first vectorizing breaking that.

So I suppose one way to open code this in the vectorizer, would be to use
the hook to tell the vectorizer to decompose the operation that's within
the valid range to (x + ((x + 257) >> 8)) >> 8 (without any intermediate
promotions), and fudge the costing for it and use combine to recognize the
sequence? it's 4 instruction sequence so should be able to.

I also spent the day wondering if I could come up with a sequence that's
correct in general and that I can use combine to optimize further in case
a widening operation is found. Unfortunately while I can create a sequence:

uint32x4_t
foo (uint32x4_t v)
{
  uint32x4_t cst = vreinterpretq_u32_u16 (vdupq_n_u16 (0x1));
  uint64x2_t l1 = vaddl_u32 (vget_low_u32 (cst), vget_low_u32 (v));
  uint64x2_t l2 = vaddl_high_u32 (cst, v);
  uint32x2_t t1 = vshrn_n_u64 (l1, 16);
  uint32x4_t t2 = vshrn_high_n_u64 (t1, l2, 16);
  uint64x2_t l3 = vaddl_u32 (vget_low_u32 (t2), vget_low_u32 (v));
  uint64x2_t l4 = vaddl_high_u32 (t2, v);
  uint32x2_t r1 = vshrn_n_u64 (l3, 16);
  uint32x4_t r2 = vshrn_high_n_u64 (r1, l4, 16);
  return r2;
}

But this isn't more efficient than what's there before, though it is easier to
match in combine.  Part of the problem is that if the addition produces 33 bits
of significance then you do need a shift and can't remove them with permutes.

So the core of the optimization relies on the range of the inputs, so it does need
to be done pre-vectorization.

Comment 21 Tamar Christina 2023-02-02 08:03:10 UTC

> 
> OK, so that's an ADD_HIGHPART_EXPR then?  Though the highpart of an
> add is only a single bit, isn't it?  For scalar you'd use the
> carry bit here and instructions like adc to consume it.  Is addhn
> to do such thing on vectors?
> 

So I think this is the only new IFN we'd need. basically we only need one representing (a + b) >> n, for certain values of n we have a single instruction for others we reject it.

So I think we can just do two of these back to back and that should work.  If
this is ok I can implement it today.

Comment 22 rguenther@suse.de 2023-02-02 08:50:04 UTC

On Thu, 2 Feb 2023, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583
> 
> --- Comment #21 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > 
> > OK, so that's an ADD_HIGHPART_EXPR then?  Though the highpart of an
> > add is only a single bit, isn't it?  For scalar you'd use the
> > carry bit here and instructions like adc to consume it.  Is addhn
> > to do such thing on vectors?
> > 
> 
> So I think this is the only new IFN we'd need. basically we only need one
> representing (a + b) >> n, for certain values of n we have a single instruction
> for others we reject it.

But the ISA only implements n == width(a), so a true "highpart add".
So I'd add

OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")

only?  That is it's QImode + QImode -> QImode with the carry bit in
the result (sign or zero-extended)?

Comment 23 Tamar Christina 2023-02-02 08:55:00 UTC

(In reply to rguenther@suse.de from comment #22)
> On Thu, 2 Feb 2023, tnfchris at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583
> > 
> > --- Comment #21 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > > 
> > > OK, so that's an ADD_HIGHPART_EXPR then?  Though the highpart of an
> > > add is only a single bit, isn't it?  For scalar you'd use the
> > > carry bit here and instructions like adc to consume it.  Is addhn
> > > to do such thing on vectors?
> > > 
> > 
> > So I think this is the only new IFN we'd need. basically we only need one
> > representing (a + b) >> n, for certain values of n we have a single instruction
> > for others we reject it.
> 
> But the ISA only implements n == width(a), so a true "highpart add".
> So I'd add
> 
> OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart")
> OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart")
> 
> only?  That is it's QImode + QImode -> QImode with the carry bit in
> the result (sign or zero-extended)?

Sure that works I think, I'll do that then.

Comment 24 Tamar Christina 2023-02-08 13:57:09 UTC

> Sure that works I think, I'll do that then.

Just to check, I'm regtesting the patch, I assume you want me to revert the hook as well right? Since nothing will be using it.

Comment 25 rguenther@suse.de 2023-02-09 07:41:43 UTC

On Wed, 8 Feb 2023, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583
> 
> --- Comment #24 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > Sure that works I think, I'll do that then.
> 
> Just to check, I'm regtesting the patch, I assume you want me to revert the
> hook as well right? Since nothing will be using it.

Sure

Comment 26 GCC Commits 2023-03-12 18:43:46 UTC

The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:2246d576f922bae3629da0fe1dbfcc6ff06769ad

commit r13-6616-g2246d576f922bae3629da0fe1dbfcc6ff06769ad
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Sun Mar 12 18:39:33 2023 +0000

    middle-end: Revert can_special_div_by_const changes [PR108583]
    
    This reverts the changes for the CAN_SPECIAL_DIV_BY_CONST hook.
    
    gcc/ChangeLog:
    
            PR target/108583
            * doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove.
            * doc/tm.texi.in: Likewise.
            * explow.cc (round_push, align_dynamic_address): Revert previous patch.
            * expmed.cc (expand_divmod): Likewise.
            * expmed.h (expand_divmod): Likewise.
            * expr.cc (force_operand, expand_expr_divmod): Likewise.
            * optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise.
            * target.def (can_special_div_by_const): Remove.
            * target.h: Remove tree-core.h include
            * targhooks.cc (default_can_special_div_by_const): Remove.
            * targhooks.h (default_can_special_div_by_const): Remove.
            * tree-vect-generic.cc (expand_vector_operation): Remove hook.
            * tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook.
            * tree-vect-stmts.cc (vectorizable_operation): Remove hook.

Comment 27 GCC Commits 2023-03-12 18:43:51 UTC

The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:03c6ba86757f0684c5419c90651106900f5ecb5a

commit r13-6617-g03c6ba86757f0684c5419c90651106900f5ecb5a
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Sun Mar 12 18:40:12 2023 +0000

    ranger: Add range-ops for widen addition and widen multiplication [PR108583]
    
    This adds range-ops for widening addition and widening multiplication.
    
    I couldn't figure out how to write a test for this.  It looks like there are
    self tests but not a way to write standalone ones?  I did create testcases in
    the patch 3/4 which tests the end result.
    
    gcc/ChangeLog:
    
            PR target/108583
            * gimple-range-op.h (gimple_range_op_handler): Add maybe_non_standard.
            * gimple-range-op.cc (gimple_range_op_handler::gimple_range_op_handler):
            Use it.
            (gimple_range_op_handler::maybe_non_standard): New.
            * range-op.cc (class operator_widen_plus_signed,
            operator_widen_plus_signed::wi_fold, class operator_widen_plus_unsigned,
            operator_widen_plus_unsigned::wi_fold, class operator_widen_mult_signed,
            operator_widen_mult_signed::wi_fold, class operator_widen_mult_unsigned,
            operator_widen_mult_unsigned::wi_fold,
            ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned,
            ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New.
            * range-op.h (ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned,
            ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New
    
    Co-Authored-By: Andrew MacLeod <amacleod@redhat.com>

Comment 28 GCC Commits 2023-03-12 18:43:56 UTC

The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:0b3c630fcc44063a61f6131af48a4171b1de2b37

commit r13-6618-g0b3c630fcc44063a61f6131af48a4171b1de2b37
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Sun Mar 12 18:40:50 2023 +0000

    middle-end: don't form FMAs when multiplication is not single use. [PR108583]
    
    The testcase
    
    typedef unsigned int vec __attribute__((vector_size(32)));
    vec
    f3 (vec a, vec b, vec c)
    {
      vec d = a * b;
      return d + ((c + d) >> 1);
    }
    
    shows a case where we don't want to form an FMA due to the MUL not being single
    use.  In this case to form an FMA we have to redo the MUL as well as we no
    longer have it to share.
    
    As such making an FMA here would be a de-optimization.
    
    gcc/ChangeLog:
    
            PR target/108583
            * tree-ssa-math-opts.cc (convert_mult_to_fma): Inhibit FMA in case not
            single use.
    
    gcc/testsuite/ChangeLog:
    
            PR target/108583
            * gcc.dg/mla_1.c: New test.
    
    Co-Authored-By: Richard Sandiford <richard.sandiford@arm.com>

Comment 29 GCC Commits 2023-03-12 18:44:01 UTC

The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:81fd62d1378b7ddc1fa0967cbddcdcdcdd2d8d8c

commit r13-6619-g81fd62d1378b7ddc1fa0967cbddcdcdcdd2d8d8c
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Sun Mar 12 18:42:04 2023 +0000

    middle-end: Implement preferred_div_as_shifts_over_mult [PR108583]
    
    This now implements a hook
    preferred_div_as_shifts_over_mult that indicates whether a target prefers that
    the vectorizer decomposes division as shifts rather than multiplication when
    possible.
    
    In order to be able to use this we need to check whether the current precision
    has enough bits to do the operation without any of the additions overflowing.
    
    We use range information to determine this and only do the operation if we're
    sure am overflow won't occur. This now uses ranger to do this range check.
    
    This seems to work better than vect_get_range_info which uses range_query, but I
    have not switched the interface of vect_get_range_info over in this PR fix.
    
    As Andy said before initializing a ranger instance is cheap but not free, and if
    the intention is to call it often during a pass it should be instantiated at
    pass startup and passed along to the places that need it.  This is a big
    refactoring and doesn't seem right to do in this PR.  But we should in GCC 14.
    
    Currently we only instantiate it after a long series of much cheaper checks.
    
    gcc/ChangeLog:
    
            PR target/108583
            * target.def (preferred_div_as_shifts_over_mult): New.
            * doc/tm.texi.in: Document it.
            * doc/tm.texi: Regenerate.
            * targhooks.cc (default_preferred_div_as_shifts_over_mult): New.
            * targhooks.h (default_preferred_div_as_shifts_over_mult): New.
            * tree-vect-patterns.cc (vect_recog_divmod_pattern): Use it.
    
    gcc/testsuite/ChangeLog:
    
            PR target/108583
            * gcc.dg/vect/vect-div-bitmask-4.c: New test.
            * gcc.dg/vect/vect-div-bitmask-5.c: New test.

Comment 30 GCC Commits 2023-03-12 18:44:06 UTC

The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:f23dc726875c26f2c38dfded453aa9beba0b9be9

commit r13-6620-gf23dc726875c26f2c38dfded453aa9beba0b9be9
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Sun Mar 12 18:42:59 2023 +0000

    AArch64: Update div-bitmask to implement new optab instead of target hook [PR108583]
    
    This replaces the custom division hook with just an implementation through
    add_highpart.  For NEON we implement the add highpart (Addition + extraction of
    the upper highpart of the register in the same precision) as ADD + LSR.
    
    This representation allows us to easily optimize the sequence using existing
    sequences. This gets us a pretty decent sequence using SRA:
    
            umull   v1.8h, v0.8b, v3.8b
            umull2  v0.8h, v0.16b, v3.16b
            add     v5.8h, v1.8h, v2.8h
            add     v4.8h, v0.8h, v2.8h
            usra    v1.8h, v5.8h, 8
            usra    v0.8h, v4.8h, 8
            uzp2    v1.16b, v1.16b, v0.16b
    
    To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
    is half the precision of the mode of the operation into addhn + uaddw which is
    a general good optimization on its own and gets us back to:
    
    .L4:
            ldr     q0, [x3]
            umull   v1.8h, v0.8b, v5.8b
            umull2  v0.8h, v0.16b, v5.16b
            addhn   v3.8b, v1.8h, v4.8h
            addhn   v2.8b, v0.8h, v4.8h
            uaddw   v1.8h, v1.8h, v3.8b
            uaddw   v0.8h, v0.8h, v2.8b
            uzp2    v1.16b, v1.16b, v0.16b
            str     q1, [x3], 16
            cmp     x3, x4
            bne     .L4
    
    For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:
    
    .L3:
            ld1b    z0.h, p0/z, [x0, x3]
            mul     z0.h, p1/m, z0.h, z2.h
            add     z1.h, z0.h, z3.h
            usra    z0.h, z1.h, #8
            lsr     z0.h, z0.h, #8
            st1b    z0.h, p0, [x0, x3]
            inch    x3
            whilelo p0.h, w3, w2
            b.any   .L3
    .L1:
            ret
    
    and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
    to addhnb which gets us to:
    
    .L3:
            ld1b    z0.h, p0/z, [x0, x3]
            mul     z0.h, p1/m, z0.h, z2.h
            addhnb  z1.b, z0.h, z3.h
            addhnb  z0.b, z0.h, z1.h
            st1b    z0.h, p0, [x0, x3]
            inch    x3
            whilelo p0.h, w3, w2
            b.any   .L3
    
    There are multiple RTL representations possible for these optimizations, I did
    not represent them using a zero_extend because we seem very inconsistent in this
    in the backend.  Since they are unspecs we won't match them from vector ops
    anyway. I figured maintainers would prefer this, but my maintainer ouija board
    is still out for repairs :)
    
    There are no new test as new correctness tests were added to the mid-end and
    the existing codegen tests for this already exist.
    
    gcc/ChangeLog:
    
            PR target/108583
            * config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
            (*bitmask_shift_plus<mode>): New.
            * config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
            (@aarch64_bitmask_udiv<mode>3): Remove.
            * config/aarch64/aarch64.cc
            (aarch64_vectorize_can_special_div_by_constant,
            TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
            (TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
            aarch64_vectorize_preferred_div_as_shifts_over_mult): New.

Comment 31 Tamar Christina 2023-03-12 18:45:45 UTC

Fixed, thanks for the report.