This is the mail archive of the
mailing list for the GCC project.
Re: G++ could optimize ASM code more
Look for the Intel Optimization Manual on intel.com. The appendixes
have latency and throughput information for the instruction set on
various Intel processors.
Uh-oh, that's hard. I tried to find the information, but I did only
found a part of the informations I was looking for.
First, I used -masm=intel to use the Intel syntax and got.
- for the no-typecast-variant (imull):
imul ecx, esi # imull
movsx rcx, ecx # movslq
- for the typecast-variant (imulq):
imul rcx, rsi # imulq
In the Intel manual I collected following informations from Appendix C,
0f_3h 0f_2h 0f_3h 0f_2h
imul r32 10 14 1 3
imul imm32 - 14 1 3
imul - 15-18 - 5
mov 1 0.5 0.5 0.5
movsb/movsw 1 0.5 0.5 0.5
I have 3 problems:
1. I do not know my DisplayName/DisplayFamily (0f_2h or 0f_3h?).
2. The table does not contain "movsx"
3. Should I compare Latency or Throughput if I want to produce fast
code? Or doesn't it matter which value I compare?
I assume that movsx has the same latency of movsw (but not sure) and I
think that "imul" in the table refers to AT&T's "imulq" resp. Intel's
"imul rcx, rsi" while "imul r32" in the table refers to AT&T's "imull"
resp. Intel's "imul ecx, esi". Am I right?
Am 09.05.2012 20:30, schrieb Ian Lance Taylor:
Daniel Marschall <firstname.lastname@example.org> writes:
I did understand that the compiler used "signed" multiplication
instead of an unsigned one because char*char needs to be extended.
Maybe I am wrong, but couldn't the compiler "know" that the result
will be at least unsigned because unsigned * unsigned = unsigned ?
Well, but the rules of C say that the unsigned char values are
zero-extended to int, and then they are multiplied using a signed
multiplication. So the result is not unsigned. The compiler really
would have to do some sort of type or value based reasoning here to
determine that an unsigned multiplication would work also.
Mh... good point. I do not know much about Assembler so I just
the shorter the code the better.
If imull is faster than imulq, then
the question is, if imull+movslq is still faster than a single
imulq. Do you know where I can find these informations for my CPU
(Intel Xeon X3440)? I was searching for a table which shows how many
CPU-ticks the imull, imulq and movslq need, but yet I have not found
My Linux is 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012
And the CPU is "Intel(R) Xeon(R) CPU X3440 @ 2.53GHz". (I hope the
"amd64" version of Debian is the correct one, or should our admin
installed the "ia64" variant since it is an Intel CPU?)