This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

multi-word operations


[ Retry with compressed data. ]

I thought I would share the results of an afternoon's exploration.

The problem concerns the optimization of multi-word operations, in
particular, mixing single- and double-word operands.  The case study
is on x86.

The executive summary is that I hate SUBREG, and we should banish it
to strange architecture corner cases, if not entirely.

Consider the following two functions

	int f1(unsigned long long foo)
	{
	  return (foo >> 32) != 0;
	}

	int f2(long long foo, unsigned int x)
	{
	  return (foo & x) != 0;
	}

With the current compiler, we get truely bletcherous code:

f1:
        movl 4(%esp),%eax
        movl 8(%esp),%edx
        movl %edx,%eax
        xorl %edx,%edx
        movl %eax,%ecx
        orl %edx,%ecx
        setne %al
        andl $255,%eax
        ret
f2:
        pushl %esi
        pushl %ebx
        movl 20(%esp),%eax
        xorl %edx,%edx
        movl 12(%esp),%ebx
        movl 16(%esp),%esi
        andl %eax,%ebx
        andl %edx,%esi
        movl %ebx,%edx
        movl %esi,%ecx
        movl %edx,%eax
        orl %ecx,%eax
        setne %al
        andl $255,%eax
        popl %ebx
        popl %esi
        ret

The base problem here is that the individual operations are not exposed
to combine, and so very little can be done.  This is, by the by, a
separate problem from the i386.md lossage I'll talk about later.  This
has to do with the fact that combine only gets to see operations on
entire registers.  I.e. from f2, we only see

  (set (reg:DI 24) (reg:DI 24))
    (expr_list:REG_EQUAL (zero_extend:DI (reg/v:SI 22)))

  (set (subreg:SI (reg:DI 25) 1)
       (and:SI (subreg:SI (reg:DI 21) 1)
	       (subreg:SI (reg:DI 24) 1)))

and not the insn immediately preceeding that sets (subreg:SI (reg:DI 24) 1)
to zero.

There are a couple of ways, I saw that this could be fixed.

  (1) Handle all sorts of new special cases, such as zero_extend.
  (2) Extend combine to look through individual subregs.
  (3) Generate multi-word operations without subregs.

I see (1) as being a lot of tedious work with little or no payoff.
Primarily because we'd only get to propogate such data one level,
and then it would vanish.  For instance, in f2, we'd be able to 
streamline the and, but the compare against zero would not benefit
from the information that the high word is known to be zero.

I see (3) as the ultimate goal, but to do it right would be a staggering
amount of work -- more than I was ready to get myself into at the moment. 
But with this we should be able to get the absolute best code, especially
on machines with few registers where we dislike in the extreme to require
two adjacent registers.  I.e. x86.

So I set about with (2), initially only to gauge the level of difficulty.
What I came up with works for the two test cases.  It is kind of gross
though -- so the posting is for illustration only.  What it does:

Modify flow to collect subreg useage in the LOG_LINKS.  These should
really be ordered so as to provide combine with the opportunity to
perform whole-register optimizations first.  As it is, we could miss
the forest for the trees so to speak.

Modify combine in hackish ways not to reject subreg combinations out
of hand.  The worst piece of this is where I could not find a good way
to update combinable_i3pat so as to prevent

  (set (subreg:SI (reg:DI 24) 1) (blah))
  (set (reg:DI 25) (lshiftrt:DI (reg:DI 24) (const_int 32)))

from being combined.  The problem being that the destination of the
first instruction does not appear in the second.  So the combination
simply throws away the first.  My hack was to notice that no substitutions
were made at all into i3 and to disallow the combination if so.  This
is probably an incomplete test.

Also involved in this was "enhancing" (read hacking) several rtlanal
routines to be able to detect when two subregs do not interfere with
one another.

The remaining (clobber (reg:DI 24)) insns were giving the register
allocator fits.  Off the top of my head I could think of nothing after
combine (or even flow) that ought to need them, so I deleted them all. 
I'm fully willing to believe this is wrong, but I also know that I can
produce DImode clobbers for registers that get completely optimized away.

Finally, the i386 back end does stupid things like advertising DImode
operations that it doesn't have.  Things that the middle-end can do in
exactly the same ways, with the huge disadvantage that we don't get to
see what's going on.  The solution being to brutally axe them. 

Anyway, the afternoon's result looks like this

f1:
        cmpl $0,8(%esp)
        setne %al
        andl $255,%eax
        ret
f2:
        movl 12(%esp),%eax
        andl 4(%esp),%eax
        xorl %edx,%edx
        testl %eax,%eax
        setne %al
        andl $255,%eax
        ret

f1 is optimal, f2 nearly so.  Removing the zeroing of edx requires
solution (3), so that it will be subject to the normal dead-code
elimination.

At this point, I believe (3) to be the only Correct solution. 

Inside expressions, exposing all of the pieces to the register allocator,
to combine, and to the other passes that deal less and less well with
subregs, can only be a win.  Certainly we have to care for interfacing
with external entities -- memory and other functions -- but this could be
done more along the lines of CONCAT (or a more potent alternative),
wherein the pieces are immediately availible and open to the optimizer.

The use of subreg for mode switching -- si<->sf and the like -- could
be handled by a new rtl code that is not as overloaded (perhaps LOW_PART).
The use of subreg for truncation or extension (paradoxical subregs)
could either be handled with TRUNCATE and *EXTEND, or again with LOW_PART.

The only case left that I can think of has to do with FP data movement
on hosts in which DFmode or TFmode values must occasionally be manipulated
in SFmode parts, but which support actual arithmetic on the wide form, 
which must be on aligned registers.  I'm thinking here of Sparc.  

(Also know that part of my crusade against subreg is that one can only
express increments of the word size, which fails when you have hard regs
smaller than the word size of the machine.  For example Sparc64 has 32
FP regs that are addressable as 32-bit entities.  So we loose hard on
certain operations.  Eliminating subreg entirely in favor of something
else might fix this.)

Enough babbling.  The patch follows, for your amusement.


r~

d-i386-sr.gz


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]