Created attachment 54366 [details] reduced testcase Output: $ aarch64-unknown-linux-gnu-gcc -O2 testcase.c -static $ qemu-aarch64 -- ./a.out qemu: uncaught target signal 6 (Aborted) - core dumped Aborted $ aarch64-unknown-linux-gnu-gcc -v Using built-in specs. COLLECT_GCC=/repo/gcc-trunk/binary-latest-aarch64/bin/aarch64-unknown-linux-gnu-gcc COLLECT_LTO_WRAPPER=/repo/gcc-trunk/binary-trunk-r13-5461-20230127201120-g815e5740162-checking-yes-rtl-df-extra-aarch64/bin/../libexec/gcc/aarch64-unknown-linux-gnu/13.0.1/lto-wrapper Target: aarch64-unknown-linux-gnu Configured with: /repo/gcc-trunk//configure --enable-languages=c,c++ --enable-valgrind-annotations --disable-nls --enable-checking=yes,rtl,df,extra --with-cloog --with-ppl --with-isl --with-sysroot=/usr/aarch64-unknown-linux-gnu --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --target=aarch64-unknown-linux-gnu --with-ld=/usr/bin/aarch64-unknown-linux-gnu-ld --with-as=/usr/bin/aarch64-unknown-linux-gnu-as --disable-libstdcxx-pch --prefix=/repo/gcc-trunk//binary-trunk-r13-5461-20230127201120-g815e5740162-checking-yes-rtl-df-extra-aarch64 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 13.0.1 20230127 (experimental) (GCC)
Confirmed. The problem shows up in the expansion of: v_2 = v_1(D) / { 65535, 65535, 65535, 65535 };
Most likely introduced by r13-4026-gc98aabc1427a4d .
Right, so this is because in the expansion we don't have enough context to decide how to optimize the division. This optimization is only possible when the input is widened because you need an additional free bit so that the second addition can't overflow. The vectorizer has this context but since we didn't want a new IFN the context should instead be derivable in targetm.vectorize.can_special_div_by_const hook. So my proposal to fix this and keep targetm.vectorize.can_special_div_by_const as a general divide optimization hook is to pass the actual `tree` operand0 as well to the hook such that the hook has a bit more context. I was hoping to use get_nonzero_bits to get what the actual range of the operand is. But it looks like for widening operations it still reports -1. So my suggestion is the backend should just check if the operand is a non-constant and the type of the operation being performed is matches the precision we require. Is this ok with you Richards? since I'll need both of you to approve.
On Mon, 30 Jan 2023, tnfchris at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583 > > Tamar Christina <tnfchris at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |rguenth at gcc dot gnu.org, > | |rsandifo at gcc dot gnu.org > > --- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > Right, so this is because in the expansion we don't have enough context to > decide how to optimize the division. > > This optimization is only possible when the input is widened because you need > an additional free bit so that the second addition can't overflow. > > The vectorizer has this context but since we didn't want a new IFN the > context should instead be derivable in > targetm.vectorize.can_special_div_by_const hook. The vectorizer doesn't check for a widened operand zero. In fact "can_special_div_by_const" doesn't suggest that widening is required. If the vectorizer checks that then why do we need another operand?
> > The vectorizer has this context but since we didn't want a new IFN the > > context should instead be derivable in > > targetm.vectorize.can_special_div_by_const hook. > > The vectorizer doesn't check for a widened operand zero. In fact > "can_special_div_by_const" doesn't suggest that widening is required. Correct, at the moment the generalized so far as to be a generic div by constant operation. But the original sequence wanted to optimize a sequence that needs more context than what was given to it. > If the vectorizer checks that then why do we need another operand? Because the expansion gets triggered by intrinsics code as well: typedef unsigned __attribute__((__vector_size__ (16))) V; static __attribute__((__noinline__)) __attribute__((__noclone__)) V foo (V v, unsigned short i) { v /= i; return v; } It's invalid to do this particular decomposition we do in AArch64, but the hook doesn't know this. Since the new IFN wasn't like we're just leaving the `/` alone in the vectorizer. But this means the expansion code, or the backend needs to be able to check the context in which we are expanding. We don't have an integer vector division operation. So the operation needs to be decomposed. And if we decompose it, then the sequence becomes too long for combine to match and we're back where we started.
(In reply to Tamar Christina from comment #3) > The vectorizer has this context but since we didn't want a new IFN the > context should instead be derivable in > targetm.vectorize.can_special_div_by_const hook. I probably got lost in the threading, but could you point to where the idea of using an ifn was rejected? I know there was pushback against hard-coding a specific constant, but that doesn't prevent the use of ifns in itself. (E.g. the gather/scatter IFNs have constant arguments that are checked against what the target allows.)
(In reply to rsandifo@gcc.gnu.org from comment #6) > (In reply to Tamar Christina from comment #3) > > The vectorizer has this context but since we didn't want a new IFN the > > context should instead be derivable in > > targetm.vectorize.can_special_div_by_const hook. > I probably got lost in the threading, but could you point to where > the idea of using an ifn was rejected? I know there was pushback > against hard-coding a specific constant, but that doesn't prevent > the use of ifns in itself. (E.g. the gather/scatter IFNs have > constant arguments that are checked against what the target allows.) https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/#124788 is where the original patch used an IFN. https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/#124863 where Richi wouldn't ACK a new optab as it didn't have a physical instruction backing it on any ISA, and in fairness, no one else did either effectively stranding the change. https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/#125144 where I pinged and got no response to the ping. After which I went on IRC and asked Richi how he'd like me to proceed. In reply to this I was instructed he would like to proceed the same way vector permutes are currently handled with `can_perm` etc. and that's where the patch thread picks off back on the ML.
(In reply to Tamar Christina from comment #7) > (In reply to rsandifo@gcc.gnu.org from comment #6) > > (In reply to Tamar Christina from comment #3) > > > The vectorizer has this context but since we didn't want a new IFN the > > > context should instead be derivable in > > > targetm.vectorize.can_special_div_by_const hook. > > I probably got lost in the threading, but could you point to where > > the idea of using an ifn was rejected? I know there was pushback > > against hard-coding a specific constant, but that doesn't prevent > > the use of ifns in itself. (E.g. the gather/scatter IFNs have > > constant arguments that are checked against what the target allows.) > > https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/ > #124788 is where the original patch used an IFN. > https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/ > #124863 where Richi wouldn't ACK a new optab as it didn't have a physical > instruction backing it on any ISA, and in fairness, no one else did either > effectively stranding the change. Yep. The other option would have to expose whatever special instruction is used in the end as IFN and rewrite the division during vectorization "properly", either via some pattern or via doing the expansion when emitting the vectorized division instruction. In hindsight that might have been the better option, but then ... > https://patchwork.sourceware.org/project/gcc/patch/patch-15779-tamar@arm.com/ > #125144 where I pinged and got no response to the ping. After which I went > on IRC and asked Richi how he'd like me to proceed. In reply to this I was > instructed he would like to proceed the same way vector permutes are > currently handled with `can_perm` etc. and that's where the patch thread > picks off back on the ML. ... this looked simpler and also reasonable. Only that it breaks now.
Are we sure this is a vectoriser vs. C vectors thing? The equivalent vectoriser test: void __attribute__((noipa)) f (unsigned *v) { for (int i = 0; i < 4; ++i) v[i] /= 0xffff; } int main (void) { unsigned v[] = { 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff }; f (v); for (unsigned i = 0; i < sizeof (v) / sizeof (v[0]); i++) if (v[i] != 0x00010001) __builtin_abort (); return 0; } fails in the same way for me (compiled at -O3).
(In reply to rsandifo@gcc.gnu.org from comment #9) > Are we sure this is a vectoriser vs. C vectors thing? it's not, the issue we're debating is how to fix it. As Richi pointed out https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583#c4 we can fix the vectorizer case by adding a guard around the recog code in the vectorizer. But that only fixes the vectorizer, and Richi asked why that wasn't enough, to which I pointed out it's because intrinsics code hits it as well. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583#c5 This is why I mentioned that there are two ways to fix this: 1. We make the hook less general, and add checks in every place it is called. 2. We keep the hook general and pass enough context to the hook for it to decide how to optimize the division. it's not intrinsics only, but a vectorizer only fix won't fix intrinsics.
On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583 > > --- Comment #10 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > (In reply to rsandifo@gcc.gnu.org from comment #9) > > Are we sure this is a vectoriser vs. C vectors thing? > > it's not, the issue we're debating is how to fix it. > > As Richi pointed out https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583#c4 we > can fix the vectorizer case by adding a guard around the recog code in the > vectorizer. > > But that only fixes the vectorizer, and Richi asked why that wasn't enough, to > which I pointed out it's because intrinsics code hits it as well. > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583#c5 > > This is why I mentioned that there are two ways to fix this: > > 1. We make the hook less general, and add checks in every place it is called. > 2. We keep the hook general and pass enough context to the hook for it to > decide how to optimize the division. I don't think passing in for example the tree operand 0 helps, the target appearantly wants to not specialize by the constant divided by but instead also by the number divided (of we already know its type). Do you plan to do SSA def pattern matching in the hook for this at expand time? As said, recognizing the special case and open-coding the full divison in the vectorizer (or possibly even from RTL expansion) would be another option here.
(In reply to rguenther@suse.de from comment #11) > On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote: > > > I don't think passing in for example the tree operand 0 helps, the > target appearantly wants to not specialize by the constant divided by > but instead also by the number divided (of we already know its type). > The target wants to specialize division by a specific constant. It just happens that there is an additional specialization when certain bits are known to be free in operand0. > Do you plan to do SSA def pattern matching in the hook for this at > expand time? > My intention was to only look if the DEF_STMT is a gimple assign, and then the operands to it are half the precision of the result. This catches *w, *>> and any other case. It doesn't need to know further up because it's harder to track non-zero bits across different math operations. and this is enough for the use case we're interested in. > As said, recognizing the special case and open-coding the full divison > in the vectorizer (or possibly even from RTL expansion) would be > another option here. open-coding in the vectorizer will require 4 new IFNs and 4 new optabs as it requires 4 different instructions. 2 for neon, 2 for SVE, which are semantically different as permutes work different across SVE and NEON. So if you're happy with that I will do that. > (or possibly even from RTL expansion) would be another option here. I don't really follow this one. The open coding at expand time is what I was trying to do before. The issue is there's no longer a single statement if we don't stop the vectorizer from decomposing it into shifts and multiplies. As long as the vectorizer does this decomposition there's not much to do at expand time. Additionally the number of widening operations after vectorization is large. So to semantically match them with the decomposed shift seems impossible.. But those are only my thoughts :) ultimately I'll do the approach you two want. If you want me to revert to using IFNs I can do that.
OK, hopefully I understand now. Sorry for being slow. But what specific constraints do we want to apply to the optimisation? (In reply to Tamar Christina from comment #3) > Right, so this is because in the expansion we don't have enough context to > decide how to optimize the division. > > This optimization is only possible when the input is widened because you > need an additional free bit so that the second addition can't overflow. > > The vectorizer has this context but since we didn't want a new IFN the > context should instead be derivable in > targetm.vectorize.can_special_div_by_const hook. > > So my proposal to fix this and keep > targetm.vectorize.can_special_div_by_const as a general divide optimization > hook is to pass the actual `tree` operand0 as well to the hook such that the > hook has a bit more context. > > I was hoping to use get_nonzero_bits to get what the actual range of the > operand is. But it looks like for widening operations it still reports -1. The original motivating example was: void draw_bitmap1(uint8_t* restrict pixel, uint8_t level, int n) { for (int i = 0; i < (n & -16); i+=1) pixel[i] = (pixel[i] * level) / 0xff; } where we then do a 16-bit division. But since level and pixel[i] are unconstrained, the maximum value of pixel[i] * level is 0xfe01. There's no free bit in that sense. It seems more that range of the dividend is [0, N*N] for a divisor of N. If that's the condition we want to test for, it seems like something we need to check in the vectoriser rather than the hook. And it's not something we can easily do in the vector form, since we don't track ranges for vectors (AFAIK).
On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583 > > --- Comment #12 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #11) ... > > As said, recognizing the special case and open-coding the full divison > > in the vectorizer (or possibly even from RTL expansion) would be > > another option here. > > open-coding in the vectorizer will require 4 new IFNs and 4 new optabs as it > requires 4 different instructions. 2 for neon, 2 for SVE, which are > semantically different as permutes work different across SVE and NEON. So if > you're happy with that I will do that. You probably did so elsewhere some time ago, but what exactly are those four instructions? (pointers to specifications appreciated)
> OK, hopefully I understand now. Sorry for being slow. Not at all, Sorry if it came across a bit cranky, it wasn't meant that way! > If that's the condition we want to test for, it seems like something > we need to check in the vectoriser rather than the hook. And it's > not something we can easily do in the vector form, since we don't > track ranges for vectors (AFAIK). Ack, that also tracks with what I tried before, we don't indeed track ranges for vector ops. The general case can still be handled slightly better (I think) but it doesn't become as clear of a win as this one. > You probably did so elsewhere some time ago, but what exactly are those > four instructions? (pointers to specifications appreciated) For NEON we use: https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/ADDHN--ADDHN2--Add-returning-High-Narrow- https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/UADDW--UADDW2--Unsigned-Add-Wide- In that order, and for SVE we use two https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/ADDHNB--Add-narrow-high-part--bottom-- The difference between the SVE and the NEON versions is that the SVE version does even/odd and the NEON the typical first N, last N element wise operation. At the moment the vectorizer doesn't know anything at all about SVE's even and odd instructions. But if we create B we should create T as well.
On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583 > > --- Comment #15 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > > OK, hopefully I understand now. Sorry for being slow. > > Not at all, Sorry if it came across a bit cranky, it wasn't meant that way! > > > If that's the condition we want to test for, it seems like something > > we need to check in the vectoriser rather than the hook. And it's > > not something we can easily do in the vector form, since we don't > > track ranges for vectors (AFAIK). > > Ack, that also tracks with what I tried before, we don't indeed track ranges > for vector ops. The general case can still be handled slightly better (I think) > but it doesn't become as clear of a win as this one. > > > You probably did so elsewhere some time ago, but what exactly are those > > four instructions? (pointers to specifications appreciated) > > For NEON we use: > https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/ADDHN--ADDHN2--Add-returning-High-Narrow- so thats a add + pack high > https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/UADDW--UADDW2--Unsigned-Add-Wide- and that unpacks (zero-extends) the high/low part of one operand of an add I wonder if we'd open-code the pack / unpack and use regular add whether combine can synthesize uaddw and addhn? The pack and unpack would be vec_perms on GIMPLE (plus V_C_E). > In that order, and for SVE we use two > https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/ADDHNB--Add-narrow-high-part--bottom-- probably similar. So the difficulty here will be to decide whether that's in the end better than what the pattern handling code does now, right? Because I think most targets will be able to do the above but lacking the special adds it will be slower because of the extra packing/unpacking? That said, can we possibly do just that costing (would be a first in the pattern code I guess) with a target hook? Or add optabs for the addh operations so we can query support?
> > > > Ack, that also tracks with what I tried before, we don't indeed track ranges > > for vector ops. The general case can still be handled slightly better (I think) > > but it doesn't become as clear of a win as this one. > > > > > You probably did so elsewhere some time ago, but what exactly are those > > > four instructions? (pointers to specifications appreciated) > > > > For NEON we use: > > https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/ADDHN--ADDHN2--Add-returning-High-Narrow- > > so thats a add + pack high > Yes, though with no overflow, the addition is done in twice the precision of the original type. So it's more a widening add + pack high which narrows it back and zero extends. > > https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/UADDW--UADDW2--Unsigned-Add-Wide- > > and that unpacks (zero-extends) the high/low part of one operand of an add > > I wonder if we'd open-code the pack / unpack and use regular add whether > combine can synthesize uaddw and addhn? The pack and unpack would be > vec_perms on GIMPLE (plus V_C_E). I don't think so for addhn, because it wouldn't truncate the top bits, it truncates the bottom bits. The instruction does element1 = Elem[operand1, e, 2*esize]; element2 = Elem[operand2, e, 2*esize]; So it widens on input. > > So the difficulty here will be to decide whether that's in the end > better than what the pattern handling code does now, right? Because > I think most targets will be able to do the above but lacking the > special adds it will be slower because of the extra packing/unpacking? > > That said, can we possibly do just that costing (would be a first in > the pattern code I guess) with a target hook? Or add optabs for > the addh operations so we can query support? We could, the alternative wouldn't be correct for costing I think.. if we generate *+ , vec_perm that's gonna be more expensive.
On Tue, 31 Jan 2023, tnfchris at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583 > > --- Comment #18 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > > > > > > Ack, that also tracks with what I tried before, we don't indeed track ranges > > > for vector ops. The general case can still be handled slightly better (I think) > > > but it doesn't become as clear of a win as this one. > > > > > > > You probably did so elsewhere some time ago, but what exactly are those > > > > four instructions? (pointers to specifications appreciated) > > > > > > For NEON we use: > > > https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/ADDHN--ADDHN2--Add-returning-High-Narrow- > > > > so thats a add + pack high > > > > Yes, though with no overflow, the addition is done in twice the precision of > the original type. So it's more a widening add + pack high which narrows it > back and zero extends. > > > > https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/UADDW--UADDW2--Unsigned-Add-Wide- > > > > and that unpacks (zero-extends) the high/low part of one operand of an add > > > > I wonder if we'd open-code the pack / unpack and use regular add whether > > combine can synthesize uaddw and addhn? The pack and unpack would be > > vec_perms on GIMPLE (plus V_C_E). > > I don't think so for addhn, because it wouldn't truncate the top bits, it > truncates the bottom bits. > > The instruction does > element1 = Elem[operand1, e, 2*esize]; > element2 = Elem[operand2, e, 2*esize]; > > So it widens on input. OK, so that's an ADD_HIGHPART_EXPR then? Though the highpart of an add is only a single bit, isn't it? For scalar you'd use the carry bit here and instructions like adc to consume it. Is addhn to do such thing on vectors? When writing generic vector code is combine able to synthesize addhn from widen, plus, pack-high? As said in the old discussion I'm not opposed to adding new IFNs, but I'd like to see useful building blocks (that ideally map to ISAs) instead of IFN-for-complex-pattern-X The alternative way was to improve division expansion in general which is the can_div_special_by_const_p thing, but we do not seem to be able to capture the requirements correctly here. > > So the difficulty here will be to decide whether that's in the end > > better than what the pattern handling code does now, right? Because > > I think most targets will be able to do the above but lacking the > > special adds it will be slower because of the extra packing/unpacking? > > > > That said, can we possibly do just that costing (would be a first in > > the pattern code I guess) with a target hook? Or add optabs for > > the addh operations so we can query support? > > We could, the alternative wouldn't be correct for costing I think.. if we > generate *+ , vec_perm that's gonna be more expensive. Well, the target cost model can always detect such patterns ... But sure, using the actual ISA is preferable for costing and also to avoid "breaking" the combination by later "optimization". OTOH at least some basic constant folding for all such ISA IFNs is required to avoid regressing cases where complete unrolling later allows constant evaluation but first vectorizing breaking that.
> > I don't think so for addhn, because it wouldn't truncate the top bits, it > > truncates the bottom bits. > > > > The instruction does > > element1 = Elem[operand1, e, 2*esize]; > > element2 = Elem[operand2, e, 2*esize]; > > > > So it widens on input. > > OK, so that's an ADD_HIGHPART_EXPR then? Though the highpart of an > add is only a single bit, isn't it? For scalar you'd use the > carry bit here and instructions like adc to consume it. Is addhn > to do such thing on vectors? Sorry, it's not but that's cause I read the docs wrong. This instruction doesn't widen on addition but narrows on result. The esize for the instruction is determined output not the input. To unconfuse everyone lets just explain what this is doing: The optimization is to try to optimize unsigned division by a mask found by setting all bit in half the size of the input. i.e. for a divide of a short, the division needs to be by 0xff. This works by decomposing the unsigned division of x / 0xff into (x + ((x + 257) >> 8)) >> 8. This is correct for all ranges as long as the operations are done in at least double the precision of x. If the operations are done in the same precision as x, then the only valid ranged for this are those where x + 257 does not cause an overflow. This is why the maximum range of a widened operation is fine, because 0xFE01 + 0x101 = 0xFF02 and so does not overflow. In this case we can avoid the widening and do the operation in a smaller precision. The two vector instructions we're using does this, i.e. addhn == (x + 257) >> 8) and uaddw == addhn_res + x. Finally we emit a >> 8 in the end. This shift later gets removed because a widening operation will do the same sequence for the _lo and _hi pairs. since we are doing two logical right shifts we can just permute the values out. Hope this clear it up to you both. > > When writing generic vector code is combine able to synthesize addhn > from widen, plus, pack-high? No, it's currently unspecs. We could recognize add, shift and narrow as it though. > > But sure, using the actual ISA is preferable for costing and also to > avoid "breaking" the combination by later "optimization". OTOH at least > some basic constant folding for all such ISA IFNs is required to > avoid regressing cases where complete unrolling later allows constant > evaluation but first vectorizing breaking that. So I suppose one way to open code this in the vectorizer, would be to use the hook to tell the vectorizer to decompose the operation that's within the valid range to (x + ((x + 257) >> 8)) >> 8 (without any intermediate promotions), and fudge the costing for it and use combine to recognize the sequence? it's 4 instruction sequence so should be able to. I also spent the day wondering if I could come up with a sequence that's correct in general and that I can use combine to optimize further in case a widening operation is found. Unfortunately while I can create a sequence: uint32x4_t foo (uint32x4_t v) { uint32x4_t cst = vreinterpretq_u32_u16 (vdupq_n_u16 (0x1)); uint64x2_t l1 = vaddl_u32 (vget_low_u32 (cst), vget_low_u32 (v)); uint64x2_t l2 = vaddl_high_u32 (cst, v); uint32x2_t t1 = vshrn_n_u64 (l1, 16); uint32x4_t t2 = vshrn_high_n_u64 (t1, l2, 16); uint64x2_t l3 = vaddl_u32 (vget_low_u32 (t2), vget_low_u32 (v)); uint64x2_t l4 = vaddl_high_u32 (t2, v); uint32x2_t r1 = vshrn_n_u64 (l3, 16); uint32x4_t r2 = vshrn_high_n_u64 (r1, l4, 16); return r2; } But this isn't more efficient than what's there before, though it is easier to match in combine. Part of the problem is that if the addition produces 33 bits of significance then you do need a shift and can't remove them with permutes. So the core of the optimization relies on the range of the inputs, so it does need to be done pre-vectorization.
> > OK, so that's an ADD_HIGHPART_EXPR then? Though the highpart of an > add is only a single bit, isn't it? For scalar you'd use the > carry bit here and instructions like adc to consume it. Is addhn > to do such thing on vectors? > So I think this is the only new IFN we'd need. basically we only need one representing (a + b) >> n, for certain values of n we have a single instruction for others we reject it. So I think we can just do two of these back to back and that should work. If this is ok I can implement it today.
On Thu, 2 Feb 2023, tnfchris at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583 > > --- Comment #21 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > > > > OK, so that's an ADD_HIGHPART_EXPR then? Though the highpart of an > > add is only a single bit, isn't it? For scalar you'd use the > > carry bit here and instructions like adc to consume it. Is addhn > > to do such thing on vectors? > > > > So I think this is the only new IFN we'd need. basically we only need one > representing (a + b) >> n, for certain values of n we have a single instruction > for others we reject it. But the ISA only implements n == width(a), so a true "highpart add". So I'd add OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart") only? That is it's QImode + QImode -> QImode with the carry bit in the result (sign or zero-extended)?
(In reply to rguenther@suse.de from comment #22) > On Thu, 2 Feb 2023, tnfchris at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583 > > > > --- Comment #21 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > > > > > > OK, so that's an ADD_HIGHPART_EXPR then? Though the highpart of an > > > add is only a single bit, isn't it? For scalar you'd use the > > > carry bit here and instructions like adc to consume it. Is addhn > > > to do such thing on vectors? > > > > > > > So I think this is the only new IFN we'd need. basically we only need one > > representing (a + b) >> n, for certain values of n we have a single instruction > > for others we reject it. > > But the ISA only implements n == width(a), so a true "highpart add". > So I'd add > > OPTAB_D (sadd_highpart_optab, "sadd$a3_highpart") > OPTAB_D (uadd_highpart_optab, "uadd$a3_highpart") > > only? That is it's QImode + QImode -> QImode with the carry bit in > the result (sign or zero-extended)? Sure that works I think, I'll do that then.
> Sure that works I think, I'll do that then. Just to check, I'm regtesting the patch, I assume you want me to revert the hook as well right? Since nothing will be using it.
On Wed, 8 Feb 2023, tnfchris at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108583 > > --- Comment #24 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > > Sure that works I think, I'll do that then. > > Just to check, I'm regtesting the patch, I assume you want me to revert the > hook as well right? Since nothing will be using it. Sure
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>: https://gcc.gnu.org/g:2246d576f922bae3629da0fe1dbfcc6ff06769ad commit r13-6616-g2246d576f922bae3629da0fe1dbfcc6ff06769ad Author: Tamar Christina <tamar.christina@arm.com> Date: Sun Mar 12 18:39:33 2023 +0000 middle-end: Revert can_special_div_by_const changes [PR108583] This reverts the changes for the CAN_SPECIAL_DIV_BY_CONST hook. gcc/ChangeLog: PR target/108583 * doc/tm.texi (TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Remove. * doc/tm.texi.in: Likewise. * explow.cc (round_push, align_dynamic_address): Revert previous patch. * expmed.cc (expand_divmod): Likewise. * expmed.h (expand_divmod): Likewise. * expr.cc (force_operand, expand_expr_divmod): Likewise. * optabs.cc (expand_doubleword_mod, expand_doubleword_divmod): Likewise. * target.def (can_special_div_by_const): Remove. * target.h: Remove tree-core.h include * targhooks.cc (default_can_special_div_by_const): Remove. * targhooks.h (default_can_special_div_by_const): Remove. * tree-vect-generic.cc (expand_vector_operation): Remove hook. * tree-vect-patterns.cc (vect_recog_divmod_pattern): Remove hook. * tree-vect-stmts.cc (vectorizable_operation): Remove hook.
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>: https://gcc.gnu.org/g:03c6ba86757f0684c5419c90651106900f5ecb5a commit r13-6617-g03c6ba86757f0684c5419c90651106900f5ecb5a Author: Tamar Christina <tamar.christina@arm.com> Date: Sun Mar 12 18:40:12 2023 +0000 ranger: Add range-ops for widen addition and widen multiplication [PR108583] This adds range-ops for widening addition and widening multiplication. I couldn't figure out how to write a test for this. It looks like there are self tests but not a way to write standalone ones? I did create testcases in the patch 3/4 which tests the end result. gcc/ChangeLog: PR target/108583 * gimple-range-op.h (gimple_range_op_handler): Add maybe_non_standard. * gimple-range-op.cc (gimple_range_op_handler::gimple_range_op_handler): Use it. (gimple_range_op_handler::maybe_non_standard): New. * range-op.cc (class operator_widen_plus_signed, operator_widen_plus_signed::wi_fold, class operator_widen_plus_unsigned, operator_widen_plus_unsigned::wi_fold, class operator_widen_mult_signed, operator_widen_mult_signed::wi_fold, class operator_widen_mult_unsigned, operator_widen_mult_unsigned::wi_fold, ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned, ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New. * range-op.h (ptr_op_widen_mult_signed, ptr_op_widen_mult_unsigned, ptr_op_widen_plus_signed, ptr_op_widen_plus_unsigned): New Co-Authored-By: Andrew MacLeod <amacleod@redhat.com>
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>: https://gcc.gnu.org/g:0b3c630fcc44063a61f6131af48a4171b1de2b37 commit r13-6618-g0b3c630fcc44063a61f6131af48a4171b1de2b37 Author: Tamar Christina <tamar.christina@arm.com> Date: Sun Mar 12 18:40:50 2023 +0000 middle-end: don't form FMAs when multiplication is not single use. [PR108583] The testcase typedef unsigned int vec __attribute__((vector_size(32))); vec f3 (vec a, vec b, vec c) { vec d = a * b; return d + ((c + d) >> 1); } shows a case where we don't want to form an FMA due to the MUL not being single use. In this case to form an FMA we have to redo the MUL as well as we no longer have it to share. As such making an FMA here would be a de-optimization. gcc/ChangeLog: PR target/108583 * tree-ssa-math-opts.cc (convert_mult_to_fma): Inhibit FMA in case not single use. gcc/testsuite/ChangeLog: PR target/108583 * gcc.dg/mla_1.c: New test. Co-Authored-By: Richard Sandiford <richard.sandiford@arm.com>
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>: https://gcc.gnu.org/g:81fd62d1378b7ddc1fa0967cbddcdcdcdd2d8d8c commit r13-6619-g81fd62d1378b7ddc1fa0967cbddcdcdcdd2d8d8c Author: Tamar Christina <tamar.christina@arm.com> Date: Sun Mar 12 18:42:04 2023 +0000 middle-end: Implement preferred_div_as_shifts_over_mult [PR108583] This now implements a hook preferred_div_as_shifts_over_mult that indicates whether a target prefers that the vectorizer decomposes division as shifts rather than multiplication when possible. In order to be able to use this we need to check whether the current precision has enough bits to do the operation without any of the additions overflowing. We use range information to determine this and only do the operation if we're sure am overflow won't occur. This now uses ranger to do this range check. This seems to work better than vect_get_range_info which uses range_query, but I have not switched the interface of vect_get_range_info over in this PR fix. As Andy said before initializing a ranger instance is cheap but not free, and if the intention is to call it often during a pass it should be instantiated at pass startup and passed along to the places that need it. This is a big refactoring and doesn't seem right to do in this PR. But we should in GCC 14. Currently we only instantiate it after a long series of much cheaper checks. gcc/ChangeLog: PR target/108583 * target.def (preferred_div_as_shifts_over_mult): New. * doc/tm.texi.in: Document it. * doc/tm.texi: Regenerate. * targhooks.cc (default_preferred_div_as_shifts_over_mult): New. * targhooks.h (default_preferred_div_as_shifts_over_mult): New. * tree-vect-patterns.cc (vect_recog_divmod_pattern): Use it. gcc/testsuite/ChangeLog: PR target/108583 * gcc.dg/vect/vect-div-bitmask-4.c: New test. * gcc.dg/vect/vect-div-bitmask-5.c: New test.
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>: https://gcc.gnu.org/g:f23dc726875c26f2c38dfded453aa9beba0b9be9 commit r13-6620-gf23dc726875c26f2c38dfded453aa9beba0b9be9 Author: Tamar Christina <tamar.christina@arm.com> Date: Sun Mar 12 18:42:59 2023 +0000 AArch64: Update div-bitmask to implement new optab instead of target hook [PR108583] This replaces the custom division hook with just an implementation through add_highpart. For NEON we implement the add highpart (Addition + extraction of the upper highpart of the register in the same precision) as ADD + LSR. This representation allows us to easily optimize the sequence using existing sequences. This gets us a pretty decent sequence using SRA: umull v1.8h, v0.8b, v3.8b umull2 v0.8h, v0.16b, v3.16b add v5.8h, v1.8h, v2.8h add v4.8h, v0.8h, v2.8h usra v1.8h, v5.8h, 8 usra v0.8h, v4.8h, 8 uzp2 v1.16b, v1.16b, v0.16b To get the most optimal sequence however we match (a + ((b + c) >> n)) where n is half the precision of the mode of the operation into addhn + uaddw which is a general good optimization on its own and gets us back to: .L4: ldr q0, [x3] umull v1.8h, v0.8b, v5.8b umull2 v0.8h, v0.16b, v5.16b addhn v3.8b, v1.8h, v4.8h addhn v2.8b, v0.8h, v4.8h uaddw v1.8h, v1.8h, v3.8b uaddw v0.8h, v0.8h, v2.8b uzp2 v1.16b, v1.16b, v0.16b str q1, [x3], 16 cmp x3, x4 bne .L4 For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us: .L3: ld1b z0.h, p0/z, [x0, x3] mul z0.h, p1/m, z0.h, z2.h add z1.h, z0.h, z3.h usra z0.h, z1.h, #8 lsr z0.h, z0.h, #8 st1b z0.h, p0, [x0, x3] inch x3 whilelo p0.h, w3, w2 b.any .L3 .L1: ret and to get the most optimal sequence I match (a + b) >> n (same constraint on n) to addhnb which gets us to: .L3: ld1b z0.h, p0/z, [x0, x3] mul z0.h, p1/m, z0.h, z2.h addhnb z1.b, z0.h, z3.h addhnb z0.b, z0.h, z1.h st1b z0.h, p0, [x0, x3] inch x3 whilelo p0.h, w3, w2 b.any .L3 There are multiple RTL representations possible for these optimizations, I did not represent them using a zero_extend because we seem very inconsistent in this in the backend. Since they are unspecs we won't match them from vector ops anyway. I figured maintainers would prefer this, but my maintainer ouija board is still out for repairs :) There are no new test as new correctness tests were added to the mid-end and the existing codegen tests for this already exist. gcc/ChangeLog: PR target/108583 * config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove. (*bitmask_shift_plus<mode>): New. * config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New. (@aarch64_bitmask_udiv<mode>3): Remove. * config/aarch64/aarch64.cc (aarch64_vectorize_can_special_div_by_constant, TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed. (TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT, aarch64_vectorize_preferred_div_as_shifts_over_mult): New.
Fixed, thanks for the report.