For the following code: --------------------------- uint8_t data[16]; static __attribute__((noinline)) void test(unsigned i) { unsigned j; for (j = 0; j < 16; j++) data[j] = (i + j) >> 8; } --------------------------- code generated with -O3 -ftree-vectorize is ~25% slower than with -O3 -fno-tree-vectorize for gcc 4.4 and 4.5. 4.3 and older don't vectorize this code. Command line: gcc tst2a.c -o tst2.o -O3 -march=k8 -fno-tree-vectorize gcc tst2a.c -o tst2.o -O3 -march=k8 -ftree-vectorize (using -m32 -fomit-frame-pointer has no significant effect on performance) Tested versions (average time in ticks, 1<<24 loops): 3.4.6 (gentoo) - (66 ticks) very slow, probably doesn't unroll the loop (I haven't looked at the code) 4.1.2 - 4.3.3 (gentoo) - (20 ticks) doesn't autovectorize even when -ftree-vectorize is specified 4.4.0 (gentoo) - (20 without vectorizing, 30 with) 4.5.0 (r149701) - (19 ticks / 24 ticks) non-vectorized code is faster by 1 tick with -march=k8 than with -march=barcelona (even when my arch is barcelona) (I am reporting this only against 4.5.0 since I don't have vanilla 4.4.0 and older) Tests were repeated several times, run with highest priority and with affinity set to one core. CPU is AMD Phenom (4 cores, Barcelona) running at fixed 1400MHz. Attached is code including whole test code.
Created attachment 18205 [details] preprocessed source Includes contents of headers <stdint.h>, <stdio.h>
# ./gcc -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --enable-languages=c,c++ --prefix=/mnt/svn/gcc-trunk/build/ Thread model: posix gcc version 4.5.0 20090714 (experimental) (GCC)
The vectorized code seems to have improved in gcc-9 over gcc-8.
AARCH64 vectorization looks decent too: ``` dup v31.8h, w0 adrp x2, .LC0 adrp x0, .LC1 adrp x1, .LANCHOR0 ldr q30, [x2, #:lo12:.LC0] ldr q29, [x0, #:lo12:.LC1] add v30.8h, v31.8h, v30.8h add v29.8h, v31.8h, v29.8h uzp2 v29.16b, v30.16b, v29.16b str q29, [x1, #:lo12:.LANCHOR0] ``` The only improvement that can be made there is with SVE, those ldr could be `index` instructions instead but that is PR 113328 .