This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: G++ could optimize ASM code more


Hello,

Look for the Intel Optimization Manual on intel.com.  The appendixes
have latency and throughput information for the instruction set on
various Intel processors.

Uh-oh, that's hard. I tried to find the information, but I did only found a part of the informations I was looking for.


First, I used -masm=intel to use the Intel syntax and got.

- for the no-typecast-variant (imull):

imul    ecx, esi   # imull
movsx   rcx, ecx   # movslq

- for the typecast-variant (imulq):

imul rcx, rsi # imulq

In the Intel manual I collected following informations from Appendix C, Table C-16a:

		Latency		Throughput
		0f_3h	0f_2h	0f_3h	0f_2h
imul r32	10	14	1	3
imul imm32	-	14	1	3
imul		-	15-18	-	5
mov		1	0.5	0.5	0.5
movsb/movsw	1	0.5	0.5	0.5


I have 3 problems:
1. I do not know my DisplayName/DisplayFamily (0f_2h or 0f_3h?).
2. The table does not contain "movsx"
3. Should I compare Latency or Throughput if I want to produce fast code? Or doesn't it matter which value I compare?


I assume that movsx has the same latency of movsw (but not sure) and I think that "imul" in the table refers to AT&T's "imulq" resp. Intel's "imul rcx, rsi" while "imul r32" in the table refers to AT&T's "imull" resp. Intel's "imul ecx, esi". Am I right?

Daniel

Am 09.05.2012 20:30, schrieb Ian Lance Taylor:
Daniel Marschall <daniel-marschall@viathinksoft.de> writes:

I did understand that the compiler used "signed" multiplication
instead of an unsigned one because char*char needs to be extended.

Maybe I am wrong, but couldn't the compiler "know" that the result
will be at least unsigned because unsigned * unsigned = unsigned ?

Well, but the rules of C say that the unsigned char values are zero-extended to int, and then they are multiplied using a signed multiplication. So the result is not unsigned. The compiler really would have to do some sort of type or value based reasoning here to determine that an unsigned multiplication would work also.

Mh... good point. I do not know much about Assembler so I just thought
the shorter the code the better.

Sadly, no.



If imull is faster than imulq, then
the question is, if imull+movslq is still faster than a single
imulq. Do you know where I can find these informations for my CPU
(Intel Xeon X3440)? I was searching for a table which shows how many
CPU-ticks the imull, imulq and movslq need, but yet I have not found
one.

My Linux is 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 x86_64
GNU/Linux .


And the CPU is "Intel(R) Xeon(R) CPU X3440 @ 2.53GHz". (I hope the
"amd64" version of Debian is the correct one, or should our admin have
installed the "ia64" variant since it is an Intel CPU?)


Ian


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]