This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][RFA][PR target/15184] Partial fix for direct byte access on x86

From: Jeff Law <law at redhat dot com>
To: gcc-patches at gcc dot gnu dot org
Cc: Segher Boessenkool <segher at kernel dot crashing dot org>
Date: Tue, 27 Jan 2015 23:08:11 -0700
Subject: Re: [PATCH][RFA][PR target/15184] Partial fix for direct byte access on x86
Authentication-results: sourceware.org; auth=none
References: <54C700F1 dot 6080502 at redhat dot com>

I'm withdrawing the combine_simplify_rtx hunk of this patch. Whileworking cleaning up my improvements for the remaining of testcases Istumbled upon a simpler change which covers all the tests.

What's kind of funny is I'd been staring at the relevant code a goodlypart of the weekend without seeing how easily it could be extended andthat the result doesn't have to match a pattern as combine can split thehorrid mess in such a way that we two matched insns which when combinedwith other nearby insns ultimately collapse into precisely what we want.

We're still going to need the changes to the heuristic to enable 4 insncombinations as we need to be giving nice big blobs of code tocombine_simplify_rtx and its children.

It's actually kind of cool to see something like this flow intomake_field_assignment:

(set (mem/c:HI (symbol_ref:SI ("y") [flags 0x40] <var_decl0x7ffff670bcf0 y>) [2 y+0 S2 A16])(subreg:HI (ior:SI (zero_extend:SI (mem/c:QI (symbol_ref:SI ("y")[flags 0x40] <var_decl 0x7ffff670bcf0 y>) [2 y+0 S1 A16]))

            (reg:SI 100 [ D.1569 ])) 0))

And make_field_assignment turns it into:

(set (mem/c:QI (const:SI (plus:SI (symbol_ref:SI ("y") [flags 0x40]<var_decl 0x7ffff670bcf0 y>)

                (const_int 1 [0x1]))) [2 y+1 S1 A8])
    (subreg:QI (lshiftrt:SI (reg:SI 100 [ D.1569 ])
            (const_int 8 [0x8])) 0))

Combine then chooses the lshift expression as a split point and emits aninsn for the lshift and substitutes a nice simple reg into that hunk ofRTL above for the lshift....

(set (mem/c:QI (const:SI (plus:SI (symbol_ref:SI ("y") [flags 0x40]<var_decl 0x7ffff670bcf0 y>)

                (const_int 1 [0x1]))) [2 y+1 S1 A8])
    (subreg:QI (reg:SI 103) 0))

ooooohhhhh, that looks perfect. Now it's just a simple matter ofcleanup :-) Which is actually kindof fun to watch:


We'll have this after the combine & split step:

(insn 9 8 10 2 (parallel [
            (set (reg:SI 99 [ D.1569 ])
                (ashift:SI (reg:SI 96 [ c ])
                    (const_int 8 [0x8])))
            (clobber (reg:CC 17 flags))
        ]) j.c:33 510 {*ashlsi3_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (expr_list:REG_DEAD (reg:SI 96 [ c ])
            (nil))))

(insn 10 9 13 2 (parallel [
            (set (reg:SI 100 [ D.1569 ])
                (and:SI (reg:SI 99 [ D.1569 ])
                    (const_int 65280 [0xff00])))
            (clobber (reg:CC 17 flags))
        ]) j.c:33 380 {*andsi_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (expr_list:REG_DEAD (reg:SI 99 [ D.1569 ])
            (nil))))

(insn 13 10 14 2 (parallel [
            (set (reg:SI 103)
                (lshiftrt:SI (reg:SI 100 [ D.1569 ])
                    (const_int 8 [0x8])))
            (clobber (reg:CC 17 flags))
        ]) j.c:33 543 {*lshrsi3_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (expr_list:REG_DEAD (reg:SI 100 [ D.1569 ])
            (nil))))

(insn 14 13 0 2 (set (mem/c:QI (const:SI (plus:SI (symbol_ref:SI ("y")[flags 0x40] <var_decl 0x7ffff670bcf0 y>)

                    (const_int 1 [0x1]))) [2 y+1 S1 A8])
        (subreg:QI (reg:SI 103) 0)) j.c:33 93 {*movqi_internal}
     (expr_list:REG_DEAD (reg:SI 103)
        (nil)))

Eventually all those participate in combinations again resulting in just:

(insn 14 13 0 2 (set (mem/c:QI (const:SI (plus:SI (symbol_ref:SI ("y")[flags 0x40] <var_decl 0x7fd247c16cf0 y>)

                    (const_int 1 [0x1]))) [2 y+1 S1 A8])
        (subreg:QI (reg:SI 96 [ c ]) 0)) j.c:33 93 {*movqi_internal}
     (expr_list:REG_DEAD (reg:SI 96 [ c ])
        (nil)))

ie, movb %al, y+1

Sometimes I malign the combiner, but there are also days when I justhave to say "wow", not bad given its last major revamp was in 1990/1991(which brought in the splitting code noted above.


Anyway, onward bootstrapping and testing...


Jeff

On 01/26/15 20:07, Jeff Law wrote:

Segher: I know you're not officially noted as a maintainer or reviewer
for combine.c, but that's something I'd like to change if you're
interested in a larger role.  In the mean time, any feedback you have
would be appreciated.


So the issue mentioned in the BZ is that fairly obvious code sequences
that ought to use simple byte moves are expanding into hideous sequences
(load, store, couple bitwise logicals, maybe a shift or extension thrown
in for good measure).

As mentioned in the BZ, one of the issues is that combine is limited in
terms of how many insns it will look at.  As it turns out that was
addressed not terribly low ago and we can do 4 insn combinations. With
just a little work in combine.c we can get the desired code for the
first two testcases as well as two of my own.

The first issue is 4 insn combinations are (reasonably) guarded in such
a way as to avoid them if they are unlikely to succeed.  We basically
look at the operands of the 4 insns and try to guess if there's a
reasonable chance a combination would succeed.  If not, no 4 insn
combinations are tried.

So the first part of this patch improves that heuristic.  What we see
with these byte accesses is a pattern like

load word from memory
masking
bit-ior
store word to memory

That's very consistent.  The form of the masking and bit-ior in the
middle varies quite a bit, but consistently across all the tests I've
looked at we have the memory operations in i0 and i3.  It's worth noting
that the memory load might have a zero (or sign?) extension as well and
that the load/store might be using different modes.

So when we see that overall structure, we increase the "goodness" of
trying to combine i0-i3.

The second change we need is an additional simplification.

If we have
(subreg:M1 (zero_extend:M2 (x))

Where M1 > M2 and both are scalar integer modes.  It's advantageous to
strip the SUBREG and instead have a wider extension.  So a concrete example

(subreg:SI (zero_extend:HI (X:QI))

Will get turned into

(zero_extend:SI (X:QI))


Now this does have an interesting side effect, namely that the bits
between HI and SI now have a defined value whereas they previously had
"don't care" values.  In theory, we lose a bit of flexibility in that
regard, but eliminating the subreg in favor of the wider extension
allows the existing simplification machinery to do a better job and
lowers the cost of the resulting simplified insn.

The net result is we get a simple movb for the first two testcases.

Getting a movb for the other two testcases in the PR is a bit harder,
but possible.  The patch to handle those testcases needs further cleanup.


Bootstrapped and regression tested on x86_64-unknown-linux-gnu.
Thoughts?  OK for the trunk?

Follow-Ups:
- Re: [PATCH][RFA][PR target/15184] Partial fix for direct byte access on x86
  - From: Mike Stump

References:
- [PATCH][RFA][PR target/15184] Partial fix for direct byte access on x86
  - From: Jeff Law

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]