This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

long long / long long


---- Code ----------------------------------------------------

.text
.type   __divdi3,@function
.global __divdi3

__divdi3:
        fildll  12(%esp)
        fildll   4(%esp)
        subl    $12,%esp
        movl    %esp,%ecx
        movw    $0x0C00,%ax
        fnstcw  (%ecx)
        orw     0(%ecx),%ax
        movw    %ax,2(%ecx)
        fldcw   2(%ecx)
        fdivp
        fistpll 4(%ecx)
        fldcw   0(%ecx)
        movl    4(%esp),%eax
        movl    8(%esp),%edx
        addl    $12,%esp
        ret



---- "Benchmark": Duration of a loop of --------------------------

    long long  x [1000];
    long long  y [1000];

    for (i = 0; i < 1000; i++)
        s += x[i] / y[i];


---- results ---------------------------------------------------- 
Old routine on Athlon:
	106 clocks including the a outer loop and storing the arguments on the stack.
	
This routine on Athlon:
	79 clocks including the a outer loop and storing the arguments on the stack.

  + shorter
  + can be inlined
  + sometimes the rounding control switch can be moved avoided by moving it outside a loop
  + faster for a lot of data
  - slower for trivial data (?)
  - do not work with SSE2 (needs 63 or 64 bit mantissa)

---- optimization -----------------------------------------------
This routine on Athlon after inling and moving fstcw/fldcw outside the loop:
	21 clocks including the a outer loop


Interested? Or are 64 bit are uninteresting for benchmarks?

-- 
Frank Klemm


Still remaining:
	long long % long long
	long long / long
	long long % long
	long long / const
	long long % const



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]