Bug 103997 - [12 Regression] gcc.target/i386/pr88531-??.c scan-assembler-times FAILs
Summary: [12 Regression] gcc.target/i386/pr88531-??.c scan-assembler-times FAILs
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: regression (show other bugs)
Version: 12.0
: P1 normal
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: testsuite-fail
Depends on:
Blocks: 103998 104058
  Show dependency treegraph
 
Reported: 2022-01-12 19:45 UTC by Uroš Bizjak
Modified: 2022-01-26 01:27 UTC (History)
6 users (show)

See Also:
Host:
Target: x86
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-01-13 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Uroš Bizjak 2022-01-12 19:45:15 UTC
Recent patch introduced following testsuite FAILs:

FAIL: gcc.target/i386/pr88531-1b.c scan-assembler-times vgatherqpd 4
FAIL: gcc.target/i386/pr88531-1b.c scan-assembler-times vmulpd 4
FAIL: gcc.target/i386/pr88531-1c.c scan-assembler-times vgatherqpd 4
FAIL: gcc.target/i386/pr88531-1c.c scan-assembler-times vmulpd 4
FAIL: gcc.target/i386/pr88531-2b.c scan-assembler-times vmulps 2
FAIL: gcc.target/i386/pr88531-2c.c scan-assembler-times vmulps 2
Comment 1 Uroš Bizjak 2022-01-12 19:46:14 UTC
Needs bisection to find the offending commit.
Comment 2 Hongtao.liu 2022-01-13 03:13:07 UTC
https://gcc.gnu.org/pipermail/gcc-regression/2022-January/076174.html
between commit r12-6426 and commit r12-6419
Comment 3 Hongtao.liu 2022-01-13 03:27:33 UTC
(In reply to Hongtao.liu from comment #2)
> https://gcc.gnu.org/pipermail/gcc-regression/2022-January/076174.html
> between commit r12-6426 and commit r12-6419

Cause by r12-6420
Comment 4 Richard Biener 2022-01-13 08:19:30 UTC
I think

      /* If the target does not support partial vectors we can shorten the
         number of modes to analyze for the epilogue as we know we can't pick a
         mode that has at least as many NUNITS as the main loop's vectorization
         factor, since that would imply the epilogue's vectorization factor
         would be at least as high as the main loop's and we would be
         vectorizing for more scalar iterations than there would be left.  */
      if (!supports_partial_vectors
          && maybe_ge (GET_MODE_NUNITS (vector_modes[mode_i]), first_vinfo_vf))
        {

is completely bogus - for -1b.c we autodetect V8SImode and first_vinfo_vf is 8
so we skip V8SImode which is OK.  But then vector_modes[1] is V32QImode,
it doesn't make sense to compare NUNITS of a vector mode with the VF of a loop
(or NUNITS of two vector modes with different element mode).

Previously we started from mode_i of the first loop or what it would consider
as next mode which I suppose provided this "skipping" in case the mode
array is sorted after VF (and not preference).

That said, for this case we do nothing until we hit V4QImode which obviously
cannot be used to vectorize a double.
Comment 5 avieira 2022-01-13 09:17:06 UTC
Yeah I made a mistake there using the vector_mode like that, since that vector mode really only determines vector size (and vector ISA for aarch64).

I am almost finished testing a patch that instead goes through the 'used_vector_modes' to find the largest element for all used vector modes, then use related_vector_mode to get the vector mode for that element with the same size as the current vector_mode[mode_i]. That would give us the lowest possible VF for that loop and vector size.

Should be posting the fix soon.
Comment 6 rguenther@suse.de 2022-01-13 10:14:48 UTC
On Thu, 13 Jan 2022, avieira at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103997
> 
> avieira at gcc dot gnu.org changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |avieira at gcc dot gnu.org
> 
> --- Comment #5 from avieira at gcc dot gnu.org ---
> Yeah I made a mistake there using the vector_mode like that, since that vector
> mode really only determines vector size (and vector ISA for aarch64).
> 
> I am almost finished testing a patch that instead goes through the
> 'used_vector_modes' to find the largest element for all used vector modes, then
> use related_vector_mode to get the vector mode for that element with the same
> size as the current vector_mode[mode_i]. That would give us the lowest possible
> VF for that loop and vector size.

That's of course only accurate in case the vectorization will happen
with the very same structure.  But since we re-do pattern detection
it might be we end up with a lower VF requirement even?  Guess we can
revisit that when it happens ...

It does sound like a reasonable heuristic though.
Comment 7 avieira 2022-01-13 10:42:59 UTC
Hmm thinking out loud here. As vector sizes (or ISAs) change vectorization strategies could indeed change. Best that I can think of is things like rounding, where you might need to do operations in higher precision, and some targets could potentially support instructions that widen, round and narrow again in the same instruction at some size + ISA combination and not in other, which means some would have a 'higher' element size mode in there where others don't. But that assumes the vectorizer would represent such 'widen + round + narrow' instructions in a single pattern, hiding the 'higher precision' elements. Which as far as I know don't exist right now.

There may be other cases I can't think of ofc. We could always be even more conservative and only skip if the highest possible element size for the current vector size + ISA would lead to a mode with NUNITS greater or equal to the current vector mode. Or ... just never skip a mode, I don't have a good feeling for how much that would cost compile time wise though.
Comment 8 GCC Commits 2022-01-19 14:15:07 UTC
The master branch has been updated by Andre Simoes Dias Vieira <avieira@gcc.gnu.org>:

https://gcc.gnu.org/g:f4ca0a53be18dfc7162fd5dcc1e73c4203805e14

commit r12-6740-gf4ca0a53be18dfc7162fd5dcc1e73c4203805e14
Author: Andre Vieira <andre.simoesdiasvieira@arm.com>
Date:   Wed Jan 19 14:11:32 2022 +0000

    vect: Fix epilogue mode skipping
    
    gcc/ChangeLog:
    
            PR tree-optimization/103997
            * tree-vect-loop.cc (vect_analyze_loop): Fix mode skipping for epilogue
            vectorization.
Comment 9 Levy Hsu 2022-01-24 04:00:21 UTC
Compare to one commit before (ffc7f200adbdf47f14b3594d9b21855c19cf797a)
commit r12-6740-gf4ca0a53be18dfc7162fd5dcc1e73c4203805e14 causes regression on

AlderLake (12900K) Multi-Core
548.exchange2_r -3.48%

Skylake Workstation(7920x) Single Core
538.imagick_r -2.29%
549.fotonik3d_r -3.81%

With label march_native_ofast_lto and 5 iterations
-march=native -Ofast -funroll-loops -flto
Comment 10 avieira 2022-01-24 14:01:45 UTC
Hi Levy,

I did a quick experiment, compiled exchange2_r with trunk and with trunk + all my epilogue and unroll vector patches reverted, with '-march=alderlake -Ofast -flto -funroll_loops' and the codegen is pretty much the same.

Could it be that picking a different mode than we did before all of my patches, was a better choice? If this is the case then this is something that should be fixed by an appropriate cost-model, picking the best mode for the specific loop's epilogue.

The patches I reverted were:
f4ca0a53be18dfc7162fd5dcc1e73c4203805e14
7ca1582ca60dc84cc3fc46b9cda620e2a0bed1bb
016bd7523131b645bca5b5530c81ab5149922743
d3ff7420e941931d32ce2e332e7968fe67ba20af

What were you using as a baseline for that last regression?
Comment 11 Levy Hsu 2022-01-25 06:01:02 UTC
Hi Avieira

The baseline was one commit before. (ffc7f200adbdf47f14b3594d9b21855c19cf797a)
I'm experiencing some issue on local Vtune so I can't say which function or front/backend was affected.
objdump shows some different binary, but too long to dig deep.
I'll fix the Vtune and see if I can get some results back.
Comment 12 avieira 2022-01-25 10:01:23 UTC
Right and did you happen to see a perf increase on these benchmarks with any of the patches I mentioned the hash of in the previous comment?

Just to explain a bit further what I think is going on. Before my initial patches the epilogue loop analysis would start at the mode_i + 1 of the first loop, in other others, the next mode in the list of modes.

After the patch (1) we started this from mode_i = 1, so the first mode after VOIDmode, this caused some ICEs if the target didn't add any, not sure about your targets, but that was fixed in (2).

In patch (3) Kewen added a fix to my check for potential use of partial vectors, to check the param_vect_partial_vector_usage since that can disable partial vector even if the target supports them.

So I suspect that either of these 3 patches inadvertently changed the vectorization strategy for the epilogue of some loop(s) in these benchmarks. So when I commited patch (4) f4ca0a53be18dfc7162fd5dcc1e73c4203805e14, the vectorization strategy went back to what it was previously. If this is indeed what happened then the regression you are seeing is just an indication that the original vectorization strategy was sub-optimal. This is something that should be looked at in separate and looked at as an optimization, probably by improving the cost modelling of the vectorizer for your target.
















Patch 1)
commit d3ff7420e941931d32ce2e332e7968fe67ba20af
Author: Andre Vieira <andre.simoesdiasvieira@arm.com>
Date:   Thu Dec 2 14:34:15 2021 +0000

    [vect] Re-analyze all modes for epilogues

Patch 2)
commit 016bd7523131b645bca5b5530c81ab5149922743
Author: Andre Vieira <andre.simoesdiasvieira@arm.com>
Date:   Tue Jan 11 15:52:59 2022 +0000

    [vect] PR103971, PR103977: Fix epilogue mode selection for autodetect only

Patch 3)
commit 6d51a9c6447bace21f860e70aed13c6cd90971bd
Author: Kewen Lin <linkw@linux.ibm.com>
Date:   Fri Jan 14 07:02:10 2022 -0600

    vect: Check partial vector param for supports_partial_vectors [PR104015]

Patch 4)
commit f4ca0a53be18dfc7162fd5dcc1e73c4203805e14
Author: Andre Vieira <andre.simoesdiasvieira@arm.com>
Date:   Wed Jan 19 14:11:32 2022 +0000

    vect: Fix epilogue mode skipping
Comment 13 Richard Biener 2022-01-25 11:02:56 UTC
The change will basically re-enable epilogue vectorization which was disabled before on accident.  At least for exchange I know it will not benefit from that because of low iteration counts that are not statically known.  So the result was to be expected.

Btw, this bug is now fixed.
Comment 14 Levy Hsu 2022-01-26 01:27:35 UTC
Hi Avieira and Richard

I checked the data for the last half month and you are right, that no real regression was caused. Thank you all for the detailed explanation.