Bug 78007 - Important loop from 482.sphinx3 is not vectorized
Summary: Important loop from 482.sphinx3 is not vectorized
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 7.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2016-10-17 11:10 UTC by Yuri Rumyantsev
Modified: 2016-11-09 08:59 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2016-10-17 00:00:00


Attachments
test-case to reproduce (173 bytes, text/plain)
2016-10-17 11:13 UTC, Yuri Rumyantsev
Details
untested patch (1.57 KB, patch)
2016-10-18 09:35 UTC, Richard Biener
Details | Diff
patch I am testing (2.04 KB, patch)
2016-11-08 11:26 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Yuri Rumyantsev 2016-10-17 11:10:31 UTC
The issue is related to missing support for __builtin_bswap32:

t1.c:9:3: note: function is not vectorizable.
t1.c:9:3: note: not vectorized: relevant stmt not supported: _13 = __builtin_bswap32 (load_dst_8);

Simple reproducer is attached.
Comment 1 Yuri Rumyantsev 2016-10-17 11:13:25 UTC
Created attachment 39821 [details]
test-case to reproduce

It is sufficient to compiler it with -Ofast option on x86 platform.
Comment 2 Richard Biener 2016-10-17 11:20:16 UTC
Should be relatively easy to handle with a VIEW_CONVERT, VEC_PERM_EXPR, VIEW_CONVERT sequence.
Comment 3 Richard Biener 2016-10-18 09:35:57 UTC
Created attachment 39827 [details]
untested patch

Mostly untested prototype.  For -mavx2 we get from the testcase innermost loop

.L6:
        vmovdqa (%r9,%rdx), %ymm0
        addl    $1, %r8d
        vperm2i128      $0, %ymm0, %ymm0, %ymm0
        vpshufb %ymm1, %ymm0, %ymm0
        vmovdqa %ymm0, (%r9,%rdx)
        addq    $32, %rdx
        cmpl    %r11d, %r8d
        jb      .L6

with -msse4:

.L6:
        movdqa  (%rax,%rdx), %xmm0
        addl    $1, %r8d
        pshufb  %xmm1, %xmm0
        movaps  %xmm0, (%rax,%rdx)
        addq    $16, %rdx
        cmpl    %r10d, %r8d
        jb      .L6

not sure if I got the bswap permutation vector constant correct either ;)  (quick hack)

  vect_load_dst_8.13_63 = MEM[(u32 *)vectp_b.11_61];
  load_dst_8 = *_3;
  _64 = VIEW_CONVERT_EXPR<vector(16) char>(vect_load_dst_8.13_63);
  _65 = VEC_PERM_EXPR <_64, _64, { 3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1, 0, 3, 2, 1, 0 }>;
  _66 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(_65);
  _13 = __builtin_bswap32 (load_dst_8);
  MEM[(u32 *)vectp_b.14_69] = _66;
Comment 4 Richard Biener 2016-10-18 09:39:01 UTC
Probably handling should be moved after targetm.vectorize.builtin_vectorized_function handling to allow arms builtin-bswap vectorization via vrev to apply (not sure if its permutation
handling selects vrev for a bswap permutation).
Comment 5 Richard Biener 2016-11-08 11:26:19 UTC
Created attachment 39990 [details]
patch I am testing
Comment 6 Richard Biener 2016-11-09 08:19:37 UTC
Author: rguenth
Date: Wed Nov  9 08:19:05 2016
New Revision: 241992

URL: https://gcc.gnu.org/viewcvs?rev=241992&root=gcc&view=rev
Log:
2016-11-09  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/78007
	* tree-vect-stmts.c (vectorizable_bswap): New function.
	(vectorizable_call): Call vectorizable_bswap for
	BUILT_IN_BSWAP{16,32,64} if arguments are not promoted.

	* gcc.dg/vect/vect-bswap32.c: Adjust.
	* gcc.dg/vect/vect-bswap64.c: Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/vect/vect-bswap32.c
    trunk/gcc/testsuite/gcc.dg/vect/vect-bswap64.c
    trunk/gcc/tree-vect-stmts.c
Comment 7 Richard Biener 2016-11-09 08:59:07 UTC
Fixed.