Bug 86557 - missed vectorization with std::vector compared to icc 18
Summary: missed vectorization with std::vector compared to icc 18
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.1.0
: P3 enhancement
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 101611
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2018-07-17 23:46 UTC by nightstrike
Modified: 2021-08-03 00:34 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-07-18 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description nightstrike 2018-07-17 23:46:37 UTC
ICC 18 is able to vectorize this loop, while GCC 8 is not.

#include <vector>

std::size_t f(std::vector<std::vector<float>> const & v) {
    std::size_t ret = 0;
    for (std::size_t i = 0; i < v.size(); ++i)
      ret += v[i].size();
    return ret;
}
Comment 1 Marc Glisse 2018-07-18 06:23:25 UTC
Indeed, this example was mentioned during the discussion on better diagnostics but not entered in bugzilla, thanks. IIRC the issue is that we do not handle EXACT_DIV_EXPR in the vectorizer, which should be easy enough.

(then it isn't obvious to me that vectorizing this particular loop is a good idea, but that's an independent question)
Comment 2 Richard Biener 2018-07-18 09:12:32 UTC
But even with that we seem to need AVX512F to vectorize it, with AVX2 we get

t.C:5:31: note:   not vectorized: relevant stmt not supported: patt_45 = patt_44 >> 2;

thus, somehow V2DI arithmetic right shifts are not available.  Indeed
it looks like we only have named patterns for V4SI arithmetic right shifts for AVX2.

I'm going to bootstrap / test the vectorizer fix.
Comment 3 Richard Biener 2018-07-18 09:18:25 UTC
ICC seems to emulate this even for SSE2 where I'm not sure this is profitable:

..B1.2:                         # Preds ..B1.2 ..B1.1
                                # Execution count [1.02e+03]
        movdqu    .L_2il0floatpacket.0(%rip), %xmm2             #6.19
        lea       x(,%rax,8), %rdx                              #6.12
        movdqu    (%rdx), %xmm1                                 #6.12
        movdqa    %xmm2, %xmm0                                  #6.19
        pand      %xmm1, %xmm0                                  #6.19
        movdqa    %xmm1, %xmm3                                  #6.19
        psrlq     $1, %xmm3                                     #6.19
        psrad     $1, %xmm0                                     #6.19
        por       %xmm0, %xmm3                                  #6.19
        psrlq     $62, %xmm3                                    #6.19
        paddq     %xmm1, %xmm3                                  #6.19
        pand      %xmm3, %xmm2                                  #6.19
        psrlq     $2, %xmm3                                     #6.19
        psrad     $2, %xmm2                                     #6.19
        por       %xmm2, %xmm3                                  #6.19
        movdqu    %xmm3, (%rdx)                                 #6.5
        addq      $2, %rax                                      #5.3
        cmpq      $1024, %rax                                   #5.3
        jb        ..B1.2        # Prob 99%                      #5.3


and for AVX2:

..B1.2:                         # Preds ..B1.2 ..B1.1
                                # Execution count [1.02e+03]
        lea       x(,%rax,8), %rdx                              #6.12
        vmovdqu   (%rdx), %ymm4                                 #6.12
        vpsrlq    $1, %ymm4, %ymm0                              #6.19
        vpsrad    $1, %ymm4, %ymm1                              #6.19
        vpblendw  $204, %ymm1, %ymm0, %ymm2                     #6.19
        vpsrlq    $62, %ymm2, %ymm3                             #6.19
        vpaddq    %ymm4, %ymm3, %ymm5                           #6.19
        vpsrlq    $2, %ymm5, %ymm6                              #6.19
        vpsrad    $2, %ymm5, %ymm7                              #6.19
        vpblendw  $204, %ymm7, %ymm6, %ymm8                     #6.19
        vmovdqu   %ymm8, (%rdx)                                 #6.5
        addq      $4, %rax                                      #5.3
        cmpq      $1024, %rax                                   #5.3
        jb        ..B1.2        # Prob 99%                      #5.3


long x[1024];

void foo()
{
  for (int i = 0; i < 1024; ++i)
    x[i] = x[i] / 4;
}
Comment 4 Richard Biener 2018-07-18 12:57:47 UTC
Author: rguenth
Date: Wed Jul 18 12:57:15 2018
New Revision: 262854

URL: https://gcc.gnu.org/viewcvs?rev=262854&root=gcc&view=rev
Log:
2018-07-18  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/86557
	* tree-vect-patterns.c (vect_recog_divmod_pattern): Also handle
	EXACT_DIV_EXPR.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-vect-patterns.c
Comment 5 Richard Biener 2018-07-19 07:27:31 UTC
target part remains
Comment 6 Andrew Pinski 2021-08-03 00:34:31 UTC
Fixed with the patch which fixes PR 101611.