This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

multiplication



CPU is an Athlon. Compilation was done using -mcpu=athlon.

-----------------------------------------------------------------------------

Measurement:

        65:   7.00 byte   1.335 clock   2.059 clock   mov r,t; shl $6,r; add t,r
        65:   7.00 byte   1.377 clock   3.004 clock   mov r,t; shl $6,t; add t,r
 
Shifting the copy "t" takes one clock more. So the first version is better.

int xxx;
main () { x *= 65; }

main:
        movl    xxx, %eax		
        movl    %eax, %ecx	; ecx is the copy t, eax is r
        sall    $6, %ecx	; shift the copy ????
        addl    %eax, %ecx	; add
        movl    %ecx, xxx	; store
        ret

This was the wrong version which gcc selected.

-----------------------------------------------------------------------------

Measurement:

        70:   9.00 byte   1.419 clock   4.006 clock   lea (r,r,2),t; shl $6,r; lea (r,t,2),r
        70:  10.00 byte   1.586 clock   4.006 clock   lea (r,r,2),t; shl $6,r; shl t; add t,r
        70:  10.00 byte   1.627 clock   4.006 clock   lea (r,r,2),t; shl $6,r; add t,r; add t,r
        70:  10.00 byte   1.639 clock   4.006 clock   lea (r,r,2),t; shl $5,r; add t,r; shl r
        70:  10.00 byte   1.669 clock   4.006 clock   lea (r,r,2),t; shl t; shl $6,r; add t,r
        70:  10.00 byte   1.669 clock   4.006 clock   lea (r,r,4),t; add r,t; shl $6,r; add t,r
        70:  10.00 byte   1.711 clock   4.006 clock   shl r; lea (r,r,2),t; shl $5,r; add t,r
        70:  10.00 byte   1.846 clock   5.007 clock   lea (r,r,8),t; neg r; lea (r,t,4),r; shl r
        70:  11.00 byte   1.864 clock   5.007 clock   lea (r,r,4),r; lea (r,r,8),t; lea (r,r,4),r; add t,r
        70:  15.00 byte   1.829 clock   5.007 clock   lea (,r,2),t; lea (t,t,2),r; shl $5,t; add t,r
        70:  15.00 byte   1.836 clock   5.007 clock   lea (,r,2),t; shl $6,r; lea (r,t,2),r; add t,r
        70:  15.00 byte   1.843 clock   5.007 clock   lea (,r,2),r; lea (r,r,2),t; shl $5,r; add t,r
        70:  15.00 byte   1.898 clock   5.007 clock   lea (,r,4),t; lea (r,r,2),r; lea (r,t,8),r; shl r
        70:  15.00 byte   1.899 clock   5.007 clock   lea (,r,8),t; lea (r,r,2),r; lea (r,t,4),r; shl r
        70:  12.00 byte   2.462 clock   6.008 clock   mov r,t; shl $4,t; add r,t; lea (r,t,2),r; add r,r
        70:   3.00 byte   2.045 clock   5.007 clock   imul $70,r,r

main:
        movl    xxx, %edx
        movl    %edx, %eax		
        sall    $4, %eax		
        addl    %edx, %eax		
        leal    (%edx,%eax,2), %ecx	
        addl    %ecx, %ecx		
        movl    %ecx, xxx
	ret

This solution takes
	- 3 byte more than the fastest version
	- 1 clock throughput more than the fastest version
	- 2 clocks more latency than the fastest version

Compared with the simple imul:
	- 9 byte more
	- 0.42 clock throughput more
	- 1 clocks more latency

This is real deoptimization.

-- 
Frank Klemm


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]