This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Altivec strangeness?
- From: Daniel Egger <degger at fhm dot edu>
- To: Aldy Hernandez <aldyh at redhat dot com>
- Cc: GCC Developer Mailinglist <gcc at gcc dot gnu dot org>
- Date: 25 Feb 2002 01:23:10 +0100
- Subject: Re: Altivec strangeness?
- References: <E65BBDE0-297D-11D6-97EE-000393750C1E@redhat.com>
Am Mon, 2002-02-25 um 00.26 schrieb Aldy Hernandez:
> > a) nasty because it requires a lot of typing.
> declare a macro:
> #define VSHORT_1S ((vector short int){1,1,1,1,1,1,1,1})
That's no much shorter than
const vector short shortones = (vector short int){1,1,1,1,1,1,1,1};
globally defined.
> as i have mentioned before, the vector initializers generate pretty
> bad code, but that will be remedied when, in 3.2, i rewrite them
> to use the vector constant infrastructure. right now, they just
> get initialized as arrays, which is less than optimal.
Indeed.
> in the code's defense, how many times do you initialize a given
> vector in a function? once! it's not like it's going to drag
> performance down.
No, not in my case. I've small functions which have an generic
implementation but can be replaced by vectorised code. A profile
of a short run of the application will look like that:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
11.83 0.68 0.68 99864 6.81 9.81 synth_filter
11.48 1.34 0.66 2105726 0.31 0.31 put_pixels_altivec
11.48 2.00 0.66 1436416 0.46 0.46 j_rev_dct_altivec
For a function which is called a few million times per second runtime
it makes a lot of difference whether a constant vector is loaded
from memory whereby extra code is required to setup the base address
for the vector load or the vector simply get splatted into a vector
register which uses less memory, less opcodes and is likely happen
in the same amount of cpu cycles.
This is an example of assembly output produced by gcc 3.1:
.align 2
.globl put_pixels_clamped_altivec
.type put_pixels_clamped_altivec,@function
put_pixels_clamped_altivec:
lis %r0,0x108
lis %r9,zeros@ha
ori %r0,%r0,16
la %r9,zeros@l(%r9)
dst %r3,%r0,0
lvx %v13,0,%r9
li %r0,8
li %r11,0
mtctr %r0
li %r9,4
.L53:
lvx %v0,0,%r3
addi %r3,%r3,16
vpkshus %v0,%v0,%v13
vspltw %v1,%v0,1
vspltw %v0,%v0,0
stvewx %v0,%r11,%r4
stvewx %v1,%r9,%r4
add %r4,%r4,%r5
bdnz .L53
blr
As you can see it takes an additional lis, la to get the address
for the vector load. The inner loop is executed 8 times BTW.
> and if you have it in a loop, it's probably invariant, so move it out of it.
You bet on it. :)
> let's concentrate on getting the bugs ironed out of the current
> implementation, and then we can tackle code quality issues.
I hope you don't mind if I fool a bit around with code generation
now. :)
--
Servus,
Daniel