This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
multiplication
- To: gcc at gcc dot gnu dot org
- Subject: multiplication
- From: Frank Klemm <pfk at fuchs dot offl dot uni-jena dot de>
- Date: Sat, 1 Sep 2001 01:35:24 +0200
- >Received: (from pfk@localhost)by fuchs.offl.uni-jena.de (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) id BAA16612for gcc@gcc.gnu.org; Sat, 1 Sep 2001 01:35:24 +0200
CPU is an Athlon. Compilation was done using -mcpu=athlon.
-----------------------------------------------------------------------------
Measurement:
65: 7.00 byte 1.335 clock 2.059 clock mov r,t; shl $6,r; add t,r
65: 7.00 byte 1.377 clock 3.004 clock mov r,t; shl $6,t; add t,r
Shifting the copy "t" takes one clock more. So the first version is better.
int xxx;
main () { x *= 65; }
main:
movl xxx, %eax
movl %eax, %ecx ; ecx is the copy t, eax is r
sall $6, %ecx ; shift the copy ????
addl %eax, %ecx ; add
movl %ecx, xxx ; store
ret
This was the wrong version which gcc selected.
-----------------------------------------------------------------------------
Measurement:
70: 9.00 byte 1.419 clock 4.006 clock lea (r,r,2),t; shl $6,r; lea (r,t,2),r
70: 10.00 byte 1.586 clock 4.006 clock lea (r,r,2),t; shl $6,r; shl t; add t,r
70: 10.00 byte 1.627 clock 4.006 clock lea (r,r,2),t; shl $6,r; add t,r; add t,r
70: 10.00 byte 1.639 clock 4.006 clock lea (r,r,2),t; shl $5,r; add t,r; shl r
70: 10.00 byte 1.669 clock 4.006 clock lea (r,r,2),t; shl t; shl $6,r; add t,r
70: 10.00 byte 1.669 clock 4.006 clock lea (r,r,4),t; add r,t; shl $6,r; add t,r
70: 10.00 byte 1.711 clock 4.006 clock shl r; lea (r,r,2),t; shl $5,r; add t,r
70: 10.00 byte 1.846 clock 5.007 clock lea (r,r,8),t; neg r; lea (r,t,4),r; shl r
70: 11.00 byte 1.864 clock 5.007 clock lea (r,r,4),r; lea (r,r,8),t; lea (r,r,4),r; add t,r
70: 15.00 byte 1.829 clock 5.007 clock lea (,r,2),t; lea (t,t,2),r; shl $5,t; add t,r
70: 15.00 byte 1.836 clock 5.007 clock lea (,r,2),t; shl $6,r; lea (r,t,2),r; add t,r
70: 15.00 byte 1.843 clock 5.007 clock lea (,r,2),r; lea (r,r,2),t; shl $5,r; add t,r
70: 15.00 byte 1.898 clock 5.007 clock lea (,r,4),t; lea (r,r,2),r; lea (r,t,8),r; shl r
70: 15.00 byte 1.899 clock 5.007 clock lea (,r,8),t; lea (r,r,2),r; lea (r,t,4),r; shl r
70: 12.00 byte 2.462 clock 6.008 clock mov r,t; shl $4,t; add r,t; lea (r,t,2),r; add r,r
70: 3.00 byte 2.045 clock 5.007 clock imul $70,r,r
main:
movl xxx, %edx
movl %edx, %eax
sall $4, %eax
addl %edx, %eax
leal (%edx,%eax,2), %ecx
addl %ecx, %ecx
movl %ecx, xxx
ret
This solution takes
- 3 byte more than the fastest version
- 1 clock throughput more than the fastest version
- 2 clocks more latency than the fastest version
Compared with the simple imul:
- 9 byte more
- 0.42 clock throughput more
- 1 clocks more latency
This is real deoptimization.
--
Frank Klemm