This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: #define SLOW_BYTE_ACCESS in i386.h

On Sep 1, 2006, at 11:26 AM, H. J. Lu wrote:

On Fri, Sep 01, 2006 at 11:03:31AM -0700, Eric Christopher wrote:
Hui-May Chang wrote:
I have a question regarding "#define SLOW_BYTE_ACCESS" in i386.h. It is
used in "get_best_mode" routine which finds the best mode to use when
referencing a bit field. It is currently set to 0. If it is set it to 1,
it means "accessing less than a word of memory is no faster than
accessing a word of memory". I experimented with it and observed great
performance improvement. It is set to 1 for some other configurations
(e.g., rs6000, pa, ia64). Is it always a win to set it? Is it better to
set it for certain i386 architectures?

I'll bet that it's probably advantageous to set it for a couple of reasons in the new chips at least:

1) You avoid the problem that got you here of large bitfields needing
shift/insert operations

2) You avoid length changing since you're mostly operating on things in
word mode.

However, I'm not an expert on the chip so I'd suggest posting a small
testcase that shows #1 for people and the resultant code differences so
they can see the difference. Hopefully someone with more intel
experience (like HJ or Jan) can comment on whether or not this is a good
general idea for the processor.

I tried this patch and enabled it for Conroe and Nocona. It doesn't
have much impact on SPEC CPU 2000 on Conroe and it seems bad on Nocona.
Maybe we should add -mslow-byte-access to investigate it further.

H.J. --- --- gcc/config/i386/i386.c.slow 2006-08-23 17:15:14.000000000 -0700 +++ gcc/config/i386/i386.c 2006-08-24 12:10:25.000000000 -0700 @@ -831,6 +831,8 @@ const int x86_cmpxchg16b = m_NOCONA; const int x86_xadd = ~m_386; const int x86_pad_returns = m_ATHLON_K8 | m_GENERIC;

+const int x86_slow_byte_access = 0;
/* In case the average insn count for single function invocation is
lower than this constant, emit fast (but longer) prologue and
epilogue code. */
--- gcc/config/i386/i386.h.slow 2006-08-23 17:15:14.000000000 -0700
+++ gcc/config/i386/i386.h 2006-08-24 12:01:49.000000000 -0700
@@ -164,6 +164,7 @@ extern const int x86_use_bt;
extern const int x86_cmpxchg, x86_cmpxchg8b, x86_cmpxchg16b, x86_xadd;
extern const int x86_use_incdec;
extern const int x86_pad_returns;
+extern const int x86_slow_byte_access;
extern int x86_prefetch_sse;

#define TARGET_USE_LEAVE (x86_use_leave & TUNEMASK)
@@ -219,6 +220,7 @@ extern int x86_prefetch_sse;
#define TARGET_USE_INCDEC (x86_use_incdec & TUNEMASK)
#define TARGET_PAD_RETURNS (x86_pad_returns & TUNEMASK)
#define TARGET_EXT_80387_CONSTANTS (x86_ext_80387_constants & TUNEMASK)
+#define TARGET_SLOW_BYTE_ACCESS (x86_slow_byte_access & TUNEMASK)

#define ASSEMBLER_DIALECT (ix86_asm_dialect)

@@ -1840,7 +1842,7 @@ do {							\
    subsequent accesses occur to other fields in the same word of the
    structure, but to different bytes.  */


 /* Nonzero if access to memory by shorts is slow and undesirable.  */
We got the following request from a customer,

When accessing a 32-bit bitfield on x86, gcc automatically allocates a 8-bit or 16-bit register to manipulate the portion of the bitfield modified rather than using a whole 32-bit register. This leads to poor performance when multiple updates to that 32-bit bitfield are performed as the portions are modified are always written to memory before a read of another portion are performed. If a sequence contains N modifications, there will be N loads and N stores of 8 or 16-bit values rather than a single 32-bit load and 32-bit store.

We used bitfields on PowerPC because they were essentially free on that instruction set. To get around this on intel many areas would have to revert to long series of statements that did the appropriate modifications using the entire 32-bit bitfield with shifts, ands, ors. If an entire bitfield can be fit in a register, please allocate a 32-bit register and have all work performed on that register.

#define  ARRAY_LENGTH  16
union bitfield {
  struct {
    unsigned int field0 : 6;
    unsigned int field1 : 6;
    unsigned int field2 : 6;
    unsigned int field3 : 6;
    unsigned int field4 : 3;
    unsigned int field5 : 4;
    unsigned int field6 : 1;
  } bitfields, bits;
  unsigned int   u32All;
  signed int     i32All;
  float          f32All;

typedef struct program_t {
  union bitfield array[ARRAY_LENGTH];
} program;

void foo(program* prog, unsigned int fmt1)
  unsigned int shift = 0;
  unsigned int texCount = 0;
  unsigned int i;

  for (i = 0; i < 8; i++)
    prog->array[i].bitfields.field0 = texCount;
    prog->array[i].bitfields.field1 = texCount + 1;
    prog->array[i].bitfields.field2 = texCount + 2;
    prog->array[i].bitfields.field3 = texCount + 3;
    texCount += (fmt1 >> shift) & 0x7;
    shift    += 3;

Here are the -O2 assembly outputs before and after enabling it.

Attachment: i386-bitfield4.old.s
Description: Binary data

Description: Binary data

I am interested to see which CPU 2000 benchmark got affected.

Hui-May Chang

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]