Bug 40771 - generated code is ~25% slower when autovectorization is enabled
Summary: generated code is ~25% slower when autovectorization is enabled
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.5.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2009-07-16 15:03 UTC by Zdenek Sojka
Modified: 2024-04-03 23:36 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-pc-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
preprocessed source (3.66 KB, text/plain)
2009-07-16 15:06 UTC, Zdenek Sojka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Zdenek Sojka 2009-07-16 15:03:15 UTC
For the following code:

---------------------------
uint8_t data[16];
static __attribute__((noinline)) void test(unsigned i)
{
	unsigned j;
	for (j = 0; j < 16; j++)
		data[j] = (i + j) >> 8;
}
---------------------------

code generated with -O3 -ftree-vectorize is ~25% slower than with -O3 -fno-tree-vectorize for gcc 4.4 and 4.5. 4.3 and older don't vectorize this code.

Command line:
gcc tst2a.c -o tst2.o -O3 -march=k8 -fno-tree-vectorize
gcc tst2a.c -o tst2.o -O3 -march=k8 -ftree-vectorize
(using -m32 -fomit-frame-pointer has no significant effect on performance)

Tested versions (average time in ticks, 1<<24 loops):
3.4.6 (gentoo) - (66 ticks) very slow, probably doesn't unroll the loop (I haven't looked at the code)
4.1.2 - 4.3.3 (gentoo) - (20 ticks) doesn't autovectorize even when -ftree-vectorize is specified
4.4.0 (gentoo) - (20 without vectorizing, 30 with)
4.5.0 (r149701) - (19 ticks / 24 ticks) non-vectorized code is faster by 1 tick with -march=k8 than with -march=barcelona (even when my arch is barcelona)

(I am reporting this only against 4.5.0 since I don't have vanilla 4.4.0 and older)
Tests were repeated several times, run with highest priority and with affinity set to one core.

CPU is AMD Phenom (4 cores, Barcelona) running at fixed 1400MHz.

Attached is code including whole test code.
Comment 1 Zdenek Sojka 2009-07-16 15:06:19 UTC
Created attachment 18205 [details]
preprocessed source

Includes contents of headers <stdint.h>, <stdio.h>
Comment 2 Zdenek Sojka 2009-07-16 15:06:59 UTC
# ./gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --enable-languages=c,c++ --prefix=/mnt/svn/gcc-trunk/build/
Thread model: posix
gcc version 4.5.0 20090714 (experimental) (GCC)
Comment 3 Zdenek Sojka 2020-09-06 10:47:12 UTC
The vectorized code seems to have improved in gcc-9 over gcc-8.
Comment 4 Andrew Pinski 2024-04-03 23:36:42 UTC
AARCH64 vectorization looks decent too:
```
        dup     v31.8h, w0
        adrp    x2, .LC0
        adrp    x0, .LC1
        adrp    x1, .LANCHOR0
        ldr     q30, [x2, #:lo12:.LC0]
        ldr     q29, [x0, #:lo12:.LC1]
        add     v30.8h, v31.8h, v30.8h
        add     v29.8h, v31.8h, v29.8h
        uzp2    v29.16b, v30.16b, v29.16b
        str     q29, [x1, #:lo12:.LANCHOR0]
```

The only improvement that can be made there is with SVE, those ldr could be `index` instructions instead but that is PR 113328 .