This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Patch to improve x86 FP -> integer conversions


>  >   Old                   uops          New                   uops
>  > 
>  >   fnstcw -4(%ebp)        3            fnstcw -4(%ebp)        3
>  >   movl -4(%ebp),%ecx     1            movb $12,-3(%ebp)      1
>  >   movb $12,%ch           1            fnstcw -6(%ebp)        3
>  >   movl %ecx,-12(%ebp)    1
>  >   fldcw -12(%ebp)        3            fldcw -4(%ebp)         3
>  >   ...                                 ...
>  > 
>  > The old sequence (which required the scratch register) was
>  > 9 uops / 3 decode cycles / 2 memory writes / 2 memory reads.
>  > The new sequence (which doesn't require the scratch register) is
>  > 10 uops / 3 decode cycles / 3 memory writes / 1 memory read.
>
> In the old sequence we generated fewer uops.  Of the instructions, only two
> of them were multi-uop insns (which is good if we are able to schedule the
> individual instructions in the sequence since it's more likely the single
> uop insns will be able to go to the 2nd & 3rd decoders).

Currently the insns are not scheduled.  The sequence temporarly reprograms
the floating point control register in order to achieve the truncation
so we can't allow other FP insns to mix with the sequence which makes it
hard to allow this sequence to be scheduled.  Burning 1 extra uop seemed
to me like a good trade considering that there are only four QImode registers
on the x86.  We'll make up that 1 uop if having that extra register prevents
a spill.

> You've replaced 2 writes + 2 reads with 3 writes + 1 read.  My experience
> has been that writes are generally more expensive than reads (writeback
> buffers have size limits, serialization issues, etc).

Two of these writes in the new sequence are to the same memory.  Doesn't
the x86 merge these into one write?  The first read in the old sequence
is a large load following a small store, the second read in the old sequence
is a small load following a large store ... I believe that both cause
stalls on PPro / PII processors.

-- John
-------------------------------------------------------------------------
|   Feith Systems  |   Voice: 1-215-646-8000  |  Email: john@feith.com  |
|    John Wehle    |     Fax: 1-215-540-5495  |                         |
-------------------------------------------------------------------------



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]