This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq



------- Comment #30 from vvv at ru dot ru  2009-05-14 09:01 -------
Created an attachment (id=17863)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17863&action=view)
Testing tool.

Here is results of my testing.
Code:
align   128
test_cikl:
        rept 14         ; 14 if SH=0, 15 if SH=1, 16 if SH=2
        {
        nop
        }
         cmp  al,0           ; 2 bytes

         jz   $+10h+NOPS     ; 2 bytes offset=xxxx0
         cmp  al,1           ; 2 bytes offset=xxxx2
         jz   $+0Ch+NOPS     ; 2 bytes offset=xxxx4
         cmp  al,2           ; 2 bytes offset=xxxx6
         jz   $+08h+NOPS     ; 2 bytes offset=xxxx8
         cmp  al,3           ; 2 bytes offset=xxxxA
match =1, NOPS
{
   nop
}
match =2, NOPS
{
   xchg eax,eax         ; 2-bytes NOP
}
         jz   $+04h          ; 2 bytes offset=xxxxC
         ja   $+02h          ; 2 bytes offset=xxxxE

         mov  eax,ecx
         and  eax,7h
         loop test_cikl

This code tested on Core2,Xeon and P4 CPU. Results in RDTSC ticks.

; Core 2 Duo
;    NOPS/tick/Max  NOPS/tick/Max    NOPS/tick/Max
; SH=0  0/571/729      1/306/594       2/315/630
; SH=1  0/338/612      1/338/648       2/339/648
; SH=2  0/339/666      1/339/675       2/333/693

; Xeon 3110
;    NOPS/tick/Max  NOPS/tick/Max    NOPS/tick/Max
; SH=0  0/586/693      1/310/675       2/310/675
; SH=1  0/333/657      1/330/648       2/464/630
; SH=2  0/333/657      1/470/594       2/474/603

; P4
;    NOPS/tick/Max  NOPS/tick/Max    NOPS/tick/Max
; SH=0 0/1027/1317     1/1094/1258     2/1028/1207
; SH=1 0/1151/1377     1/1068/1352     2/902/1275
; SH=2 0/1124/1275     1/1148/1335     2/979/1139

Conclusion:
1. Core2 and Xeon - similar results. P4 - something strange.
For Core2 & Xeon padding very effective. Code with padding almoust 2 times
faster. No sence for P4?
2. My previous sentence

VVV> 1. AMD limitation for 16-bytes page (memory range XXX0 - XXXF),but
VVV> Intel limitation for 16-bytes chunk  (memory range XXXX - XXXX+10h)

is wrong. At leat for Core2 & Xeon. For this CPU "16-bytes chunk" means
memory range XXX0 - XXXF.

Unfortunately, I can't test AMD.

PS. My testing tool in attachmen. It start under MSDOS, switch to 32-bit mode,
switch to 64-bit mode and measure rdtsc ticks for test code.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]