#define SLOW_BYTE_ACCESS in i386.h

Fri Sep 1 20:47:00 GMT 2006

On Fri, Sep 01, 2006 at 12:24:18PM -0700, Hui-May Chang wrote:
> 
> On Sep 1, 2006, at 11:26 AM, H. J. Lu wrote:
> 
> >On Fri, Sep 01, 2006 at 11:03:31AM -0700, Eric Christopher wrote:
> >>Hui-May Chang wrote:
> >>>I have a question regarding "#define SLOW_BYTE_ACCESS" in i386.h.  
> >>>It is
> >>>used in "get_best_mode" routine which finds the best mode to use  
> >>>when
> >>>referencing a bit field. It is currently set to 0. If it is set it  
> >>>to 1,
> >>>it means "accessing less than a word of memory is no faster than
> >>>accessing a word of memory".  I experimented with it and observed  
> >>>great
> >>>performance improvement. It is set to 1 for some other  
> >>>configurations
> >>>(e.g., rs6000, pa, ia64). Is it always a win to set it? Is it  
> >>>better to
> >>>set it for certain i386 architectures?
> >>>
> >>
> >>I'll bet that it's probably advantageous to set it for a couple of
> >>reasons in the new chips at least:
> >>
> >>1) You avoid the problem that got you here of large bitfields needing
> >>shift/insert operations
> >>
> >>2) You avoid length changing since you're mostly operating on  
> >>things in
> >>word mode.
> >>
> >>However, I'm not an expert on the chip so I'd suggest posting a small
> >>testcase that shows #1 for people and the resultant code  
> >>differences so
> >>they can see the difference. Hopefully someone with more intel
> >>experience (like HJ or Jan) can comment on whether or not this is a  
> >>good
> >>general idea for the processor.
> >>
> >
> >I tried this patch and enabled it for Conroe and Nocona. It doesn't
> >have much impact on SPEC CPU 2000 on Conroe and it seems bad on  
> >Nocona.
> >Maybe we should add -mslow-byte-access to investigate it further.
> >
> >
> >H.J.
> >---
> >--- gcc/config/i386/i386.c.slow	2006-08-23 17:15:14.000000000 -0700
> >+++ gcc/config/i386/i386.c	2006-08-24 12:10:25.000000000 -0700
> >@@ -831,6 +831,8 @@ const int x86_cmpxchg16b = m_NOCONA;
> > const int x86_xadd = ~m_386;
> > const int x86_pad_returns = m_ATHLON_K8 | m_GENERIC;
> >
> >+const int x86_slow_byte_access = 0;
> >+
> > /* In case the average insn count for single function invocation is
> >    lower than this constant, emit fast (but longer) prologue and
> >    epilogue code.  */
> >--- gcc/config/i386/i386.h.slow	2006-08-23 17:15:14.000000000 -0700
> >+++ gcc/config/i386/i386.h	2006-08-24 12:01:49.000000000 -0700
> >@@ -164,6 +164,7 @@ extern const int x86_use_bt;
> > extern const int x86_cmpxchg, x86_cmpxchg8b, x86_cmpxchg16b,  
> >x86_xadd;
> > extern const int x86_use_incdec;
> > extern const int x86_pad_returns;
> >+extern const int x86_slow_byte_access;
> > extern int x86_prefetch_sse;
> >
> > #define TARGET_USE_LEAVE (x86_use_leave & TUNEMASK)
> >@@ -219,6 +220,7 @@ extern int x86_prefetch_sse;
> > #define TARGET_USE_INCDEC (x86_use_incdec & TUNEMASK)
> > #define TARGET_PAD_RETURNS (x86_pad_returns & TUNEMASK)
> > #define TARGET_EXT_80387_CONSTANTS (x86_ext_80387_constants &  
> >TUNEMASK)
> >+#define TARGET_SLOW_BYTE_ACCESS (x86_slow_byte_access & TUNEMASK)
> >
> > #define ASSEMBLER_DIALECT (ix86_asm_dialect)
> >
> >@@ -1840,7 +1842,7 @@ do {						 \
> >    subsequent accesses occur to other fields in the same word of the
> >    structure, but to different bytes.  */
> >
> >-#define SLOW_BYTE_ACCESS 0
> >+#define SLOW_BYTE_ACCESS TARGET_SLOW_BYTE_ACCESS
> >
> > /* Nonzero if access to memory by shorts is slow and undesirable.  */
> > #define SLOW_SHORT_ACCESS 0
> We got the following request from a customer,
> 
> When accessing a 32-bit bitfield on x86, gcc automatically allocates a  
> 8-bit or 16-bit register to manipulate the portion of the bitfield  
> modified rather than using a whole 32-bit register.  This leads to  
> poor performance when multiple updates to that 32-bit bitfield are  
> performed as the portions are modified are always written to memory  
> before a read of another portion are performed.  If a sequence  
> contains N modifications, there will be N loads and N stores of 8 or  
> 16-bit values rather than a single 32-bit load and 32-bit store.
> 

It is pretty bad.

> I am interested to see which CPU 2000 benchmark got affected.
> 

			-O2	-O2 -mslow-byte-access
164.gzip                 998             998     0%
175.vpr                  1118            1102    -1.43113%
176.gcc                  1535            1534    -0.0651466%
181.mcf                  819             820     0.1221%
186.crafty               1543            1541    -0.129618%
197.parser               966             965     -0.10352%
252.eon                  1712            1713    0.0584112%
253.perlbmk              1629            1626    -0.184162%
254.gap                  1680            1679    -0.0595238%
255.vortex               1701            1700    -0.0587889%
256.bzip2                1289            1285    -0.310318%
300.twolf                1650            1589    -3.69697%
Est. SPECint_base2000    1346            1340    -0.445765%

H.J.