This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Altivec strangeness?


Am Mon, 2002-02-25 um 00.26 schrieb Aldy Hernandez:

> > a) nasty because it requires a lot of typing.
> declare a macro:
> 	#define VSHORT_1S ((vector short int){1,1,1,1,1,1,1,1})

That's no much shorter than
const vector short shortones = (vector short int){1,1,1,1,1,1,1,1};
globally defined.
 
> as i have mentioned before, the vector initializers generate pretty
> bad code, but that will be remedied when, in 3.2, i rewrite them
> to use the vector constant infrastructure.  right now, they just
> get initialized as arrays, which is less than optimal.

Indeed.
 
> in the code's defense, how many times do you initialize a given
> vector in a function?  once!  it's not like it's going to drag
> performance down.

No, not in my case. I've small functions which have an generic
implementation but can be replaced by vectorised code. A profile
of a short run of the application will look like that:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 11.83      0.68     0.68    99864     6.81     9.81  synth_filter
 11.48      1.34     0.66  2105726     0.31     0.31  put_pixels_altivec
 11.48      2.00     0.66  1436416     0.46     0.46  j_rev_dct_altivec

For a function which is called a few million times per second runtime
it makes a lot of difference whether a constant vector is loaded
from memory whereby extra code is required to setup the base address
for the vector load or the vector simply get splatted into a vector
register which uses less memory, less opcodes and is likely happen
in the same amount of cpu cycles.

This is an example of assembly output produced by gcc 3.1:
	.align 2
        .globl put_pixels_clamped_altivec
        .type   put_pixels_clamped_altivec,@function
put_pixels_clamped_altivec:
        lis %r0,0x108
        lis %r9,zeros@ha
        ori %r0,%r0,16
        la %r9,zeros@l(%r9)
        dst %r3,%r0,0
        lvx %v13,0,%r9
        li %r0,8
        li %r11,0
        mtctr %r0
        li %r9,4
.L53:
        lvx %v0,0,%r3
        addi %r3,%r3,16
        vpkshus %v0,%v0,%v13
        vspltw %v1,%v0,1
        vspltw %v0,%v0,0
        stvewx %v0,%r11,%r4
        stvewx %v1,%r9,%r4
        add %r4,%r4,%r5
        bdnz .L53
        blr


As you can see it takes an additional lis, la to get the address
for the vector load. The inner loop is executed 8 times BTW.

> and if you have it in a loop, it's probably invariant, so move it out of it.

You bet on it. :)

> let's concentrate on getting the bugs ironed out of the current
> implementation, and then we can tackle code quality issues.

I hope you don't mind if I fool a bit around with code generation
now. :)
 
-- 
Servus,
       Daniel


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]