Recent patch introduced following testsuite FAILs: FAIL: gcc.target/i386/pr88531-1b.c scan-assembler-times vgatherqpd 4 FAIL: gcc.target/i386/pr88531-1b.c scan-assembler-times vmulpd 4 FAIL: gcc.target/i386/pr88531-1c.c scan-assembler-times vgatherqpd 4 FAIL: gcc.target/i386/pr88531-1c.c scan-assembler-times vmulpd 4 FAIL: gcc.target/i386/pr88531-2b.c scan-assembler-times vmulps 2 FAIL: gcc.target/i386/pr88531-2c.c scan-assembler-times vmulps 2
Needs bisection to find the offending commit.
https://gcc.gnu.org/pipermail/gcc-regression/2022-January/076174.html between commit r12-6426 and commit r12-6419
(In reply to Hongtao.liu from comment #2) > https://gcc.gnu.org/pipermail/gcc-regression/2022-January/076174.html > between commit r12-6426 and commit r12-6419 Cause by r12-6420
I think /* If the target does not support partial vectors we can shorten the number of modes to analyze for the epilogue as we know we can't pick a mode that has at least as many NUNITS as the main loop's vectorization factor, since that would imply the epilogue's vectorization factor would be at least as high as the main loop's and we would be vectorizing for more scalar iterations than there would be left. */ if (!supports_partial_vectors && maybe_ge (GET_MODE_NUNITS (vector_modes[mode_i]), first_vinfo_vf)) { is completely bogus - for -1b.c we autodetect V8SImode and first_vinfo_vf is 8 so we skip V8SImode which is OK. But then vector_modes[1] is V32QImode, it doesn't make sense to compare NUNITS of a vector mode with the VF of a loop (or NUNITS of two vector modes with different element mode). Previously we started from mode_i of the first loop or what it would consider as next mode which I suppose provided this "skipping" in case the mode array is sorted after VF (and not preference). That said, for this case we do nothing until we hit V4QImode which obviously cannot be used to vectorize a double.
Yeah I made a mistake there using the vector_mode like that, since that vector mode really only determines vector size (and vector ISA for aarch64). I am almost finished testing a patch that instead goes through the 'used_vector_modes' to find the largest element for all used vector modes, then use related_vector_mode to get the vector mode for that element with the same size as the current vector_mode[mode_i]. That would give us the lowest possible VF for that loop and vector size. Should be posting the fix soon.
On Thu, 13 Jan 2022, avieira at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103997 > > avieira at gcc dot gnu.org changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |avieira at gcc dot gnu.org > > --- Comment #5 from avieira at gcc dot gnu.org --- > Yeah I made a mistake there using the vector_mode like that, since that vector > mode really only determines vector size (and vector ISA for aarch64). > > I am almost finished testing a patch that instead goes through the > 'used_vector_modes' to find the largest element for all used vector modes, then > use related_vector_mode to get the vector mode for that element with the same > size as the current vector_mode[mode_i]. That would give us the lowest possible > VF for that loop and vector size. That's of course only accurate in case the vectorization will happen with the very same structure. But since we re-do pattern detection it might be we end up with a lower VF requirement even? Guess we can revisit that when it happens ... It does sound like a reasonable heuristic though.
Hmm thinking out loud here. As vector sizes (or ISAs) change vectorization strategies could indeed change. Best that I can think of is things like rounding, where you might need to do operations in higher precision, and some targets could potentially support instructions that widen, round and narrow again in the same instruction at some size + ISA combination and not in other, which means some would have a 'higher' element size mode in there where others don't. But that assumes the vectorizer would represent such 'widen + round + narrow' instructions in a single pattern, hiding the 'higher precision' elements. Which as far as I know don't exist right now. There may be other cases I can't think of ofc. We could always be even more conservative and only skip if the highest possible element size for the current vector size + ISA would lead to a mode with NUNITS greater or equal to the current vector mode. Or ... just never skip a mode, I don't have a good feeling for how much that would cost compile time wise though.
The master branch has been updated by Andre Simoes Dias Vieira <avieira@gcc.gnu.org>: https://gcc.gnu.org/g:f4ca0a53be18dfc7162fd5dcc1e73c4203805e14 commit r12-6740-gf4ca0a53be18dfc7162fd5dcc1e73c4203805e14 Author: Andre Vieira <andre.simoesdiasvieira@arm.com> Date: Wed Jan 19 14:11:32 2022 +0000 vect: Fix epilogue mode skipping gcc/ChangeLog: PR tree-optimization/103997 * tree-vect-loop.cc (vect_analyze_loop): Fix mode skipping for epilogue vectorization.
Compare to one commit before (ffc7f200adbdf47f14b3594d9b21855c19cf797a) commit r12-6740-gf4ca0a53be18dfc7162fd5dcc1e73c4203805e14 causes regression on AlderLake (12900K) Multi-Core 548.exchange2_r -3.48% Skylake Workstation(7920x) Single Core 538.imagick_r -2.29% 549.fotonik3d_r -3.81% With label march_native_ofast_lto and 5 iterations -march=native -Ofast -funroll-loops -flto
Hi Levy, I did a quick experiment, compiled exchange2_r with trunk and with trunk + all my epilogue and unroll vector patches reverted, with '-march=alderlake -Ofast -flto -funroll_loops' and the codegen is pretty much the same. Could it be that picking a different mode than we did before all of my patches, was a better choice? If this is the case then this is something that should be fixed by an appropriate cost-model, picking the best mode for the specific loop's epilogue. The patches I reverted were: f4ca0a53be18dfc7162fd5dcc1e73c4203805e14 7ca1582ca60dc84cc3fc46b9cda620e2a0bed1bb 016bd7523131b645bca5b5530c81ab5149922743 d3ff7420e941931d32ce2e332e7968fe67ba20af What were you using as a baseline for that last regression?
Hi Avieira The baseline was one commit before. (ffc7f200adbdf47f14b3594d9b21855c19cf797a) I'm experiencing some issue on local Vtune so I can't say which function or front/backend was affected. objdump shows some different binary, but too long to dig deep. I'll fix the Vtune and see if I can get some results back.
Right and did you happen to see a perf increase on these benchmarks with any of the patches I mentioned the hash of in the previous comment? Just to explain a bit further what I think is going on. Before my initial patches the epilogue loop analysis would start at the mode_i + 1 of the first loop, in other others, the next mode in the list of modes. After the patch (1) we started this from mode_i = 1, so the first mode after VOIDmode, this caused some ICEs if the target didn't add any, not sure about your targets, but that was fixed in (2). In patch (3) Kewen added a fix to my check for potential use of partial vectors, to check the param_vect_partial_vector_usage since that can disable partial vector even if the target supports them. So I suspect that either of these 3 patches inadvertently changed the vectorization strategy for the epilogue of some loop(s) in these benchmarks. So when I commited patch (4) f4ca0a53be18dfc7162fd5dcc1e73c4203805e14, the vectorization strategy went back to what it was previously. If this is indeed what happened then the regression you are seeing is just an indication that the original vectorization strategy was sub-optimal. This is something that should be looked at in separate and looked at as an optimization, probably by improving the cost modelling of the vectorizer for your target. Patch 1) commit d3ff7420e941931d32ce2e332e7968fe67ba20af Author: Andre Vieira <andre.simoesdiasvieira@arm.com> Date: Thu Dec 2 14:34:15 2021 +0000 [vect] Re-analyze all modes for epilogues Patch 2) commit 016bd7523131b645bca5b5530c81ab5149922743 Author: Andre Vieira <andre.simoesdiasvieira@arm.com> Date: Tue Jan 11 15:52:59 2022 +0000 [vect] PR103971, PR103977: Fix epilogue mode selection for autodetect only Patch 3) commit 6d51a9c6447bace21f860e70aed13c6cd90971bd Author: Kewen Lin <linkw@linux.ibm.com> Date: Fri Jan 14 07:02:10 2022 -0600 vect: Check partial vector param for supports_partial_vectors [PR104015] Patch 4) commit f4ca0a53be18dfc7162fd5dcc1e73c4203805e14 Author: Andre Vieira <andre.simoesdiasvieira@arm.com> Date: Wed Jan 19 14:11:32 2022 +0000 vect: Fix epilogue mode skipping
The change will basically re-enable epilogue vectorization which was disabled before on accident. At least for exchange I know it will not benefit from that because of low iteration counts that are not statically known. So the result was to be expected. Btw, this bug is now fixed.
Hi Avieira and Richard I checked the data for the last half month and you are right, that no real regression was caused. Thank you all for the detailed explanation.