#define SLOW_BYTE_ACCESS in i386.h

Fri Sep 1 19:21:00 GMT 2006

On Sep 1, 2006, at 11:26 AM, H. J. Lu wrote:

> On Fri, Sep 01, 2006 at 11:03:31AM -0700, Eric Christopher wrote:
>> Hui-May Chang wrote:
>>> I have a question regarding "#define SLOW_BYTE_ACCESS" in i386.h.  
>>> It is
>>> used in "get_best_mode" routine which finds the best mode to use  
>>> when
>>> referencing a bit field. It is currently set to 0. If it is set it  
>>> to 1,
>>> it means "accessing less than a word of memory is no faster than
>>> accessing a word of memory".  I experimented with it and observed  
>>> great
>>> performance improvement. It is set to 1 for some other  
>>> configurations
>>> (e.g., rs6000, pa, ia64). Is it always a win to set it? Is it  
>>> better to
>>> set it for certain i386 architectures?
>>>
>>
>> I'll bet that it's probably advantageous to set it for a couple of
>> reasons in the new chips at least:
>>
>> 1) You avoid the problem that got you here of large bitfields needing
>> shift/insert operations
>>
>> 2) You avoid length changing since you're mostly operating on  
>> things in
>> word mode.
>>
>> However, I'm not an expert on the chip so I'd suggest posting a small
>> testcase that shows #1 for people and the resultant code  
>> differences so
>> they can see the difference. Hopefully someone with more intel
>> experience (like HJ or Jan) can comment on whether or not this is a  
>> good
>> general idea for the processor.
>>
>
> I tried this patch and enabled it for Conroe and Nocona. It doesn't
> have much impact on SPEC CPU 2000 on Conroe and it seems bad on  
> Nocona.
> Maybe we should add -mslow-byte-access to investigate it further.
>
>
> H.J.
> ---
> --- gcc/config/i386/i386.c.slow	2006-08-23 17:15:14.000000000 -0700
> +++ gcc/config/i386/i386.c	2006-08-24 12:10:25.000000000 -0700
> @@ -831,6 +831,8 @@ const int x86_cmpxchg16b = m_NOCONA;
>  const int x86_xadd = ~m_386;
>  const int x86_pad_returns = m_ATHLON_K8 | m_GENERIC;
>
> +const int x86_slow_byte_access = 0;
> +
>  /* In case the average insn count for single function invocation is
>     lower than this constant, emit fast (but longer) prologue and
>     epilogue code.  */
> --- gcc/config/i386/i386.h.slow	2006-08-23 17:15:14.000000000 -0700
> +++ gcc/config/i386/i386.h	2006-08-24 12:01:49.000000000 -0700
> @@ -164,6 +164,7 @@ extern const int x86_use_bt;
>  extern const int x86_cmpxchg, x86_cmpxchg8b, x86_cmpxchg16b,  
> x86_xadd;
>  extern const int x86_use_incdec;
>  extern const int x86_pad_returns;
> +extern const int x86_slow_byte_access;
>  extern int x86_prefetch_sse;
>
>  #define TARGET_USE_LEAVE (x86_use_leave & TUNEMASK)
> @@ -219,6 +220,7 @@ extern int x86_prefetch_sse;
>  #define TARGET_USE_INCDEC (x86_use_incdec & TUNEMASK)
>  #define TARGET_PAD_RETURNS (x86_pad_returns & TUNEMASK)
>  #define TARGET_EXT_80387_CONSTANTS (x86_ext_80387_constants &  
> TUNEMASK)
> +#define TARGET_SLOW_BYTE_ACCESS (x86_slow_byte_access & TUNEMASK)
>
>  #define ASSEMBLER_DIALECT (ix86_asm_dialect)
>
> @@ -1840,7 +1842,7 @@ do {							\
>     subsequent accesses occur to other fields in the same word of the
>     structure, but to different bytes.  */
>
> -#define SLOW_BYTE_ACCESS 0
> +#define SLOW_BYTE_ACCESS TARGET_SLOW_BYTE_ACCESS
>
>  /* Nonzero if access to memory by shorts is slow and undesirable.  */
>  #define SLOW_SHORT_ACCESS 0
We got the following request from a customer,

When accessing a 32-bit bitfield on x86, gcc automatically allocates a  
8-bit or 16-bit register to manipulate the portion of the bitfield  
modified rather than using a whole 32-bit register.  This leads to  
poor performance when multiple updates to that 32-bit bitfield are  
performed as the portions are modified are always written to memory  
before a read of another portion are performed.  If a sequence  
contains N modifications, there will be N loads and N stores of 8 or  
16-bit values rather than a single 32-bit load and 32-bit store.

We used bitfields on PowerPC because they were essentially free on  
that instruction set.  To get around this on intel many areas would  
have to revert to long series of statements that did the appropriate  
modifications using the entire 32-bit bitfield with shifts, ands,  
ors.  If an entire bitfield can be fit in a register, please allocate  
a 32-bit register and have all work performed on that register.

#define  ARRAY_LENGTH  16
union bitfield {
   struct {
     unsigned int field0 : 6;
     unsigned int field1 : 6;
     unsigned int field2 : 6;
     unsigned int field3 : 6;
     unsigned int field4 : 3;
     unsigned int field5 : 4;
     unsigned int field6 : 1;
   } bitfields, bits;
   unsigned int   u32All;
   signed int     i32All;
   float          f32All;
};

typedef struct program_t {
   union bitfield array[ARRAY_LENGTH];
} program;

void foo(program* prog, unsigned int fmt1)
{
   unsigned int shift = 0;
   unsigned int texCount = 0;
   unsigned int i;

   for (i = 0; i < 8; i++)
   {
     prog->array[i].bitfields.field0 = texCount;
     prog->array[i].bitfields.field1 = texCount + 1;
     prog->array[i].bitfields.field2 = texCount + 2;
     prog->array[i].bitfields.field3 = texCount + 3;
     texCount += (fmt1 >> shift) & 0x7;
     shift    += 3;
   }
}

Here are the -O2 assembly outputs before and after enabling it.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: i386-bitfield4.old.s
Type: application/octet-stream
Size: 917 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20060901/122f24f5/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: i386-bitfield4.new.s
Type: application/octet-stream
Size: 807 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20060901/122f24f5/attachment-0001.obj>
-------------- next part --------------

I am interested to see which CPU 2000 benchmark got affected.

Hui-May Chang