This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: long long / long long


In article <200109100451.VAA25485@racerx.synopsys.com>,
Joe Buck  <jbuck@synopsys.COM> wrote:
>Frank Klemm writes:
>[ improved long long code sequences]
>> 
>> Interested? Or are 64 bit are uninteresting for benchmarks?
>
>Well, the Linux kernel developers found that they couldn't let gcc
>do long long arithmetic because it did such a poor job, so they do
>it in assembly or in C on pairs of 32 bit values instead.  So at
>least some folks probably wouldn't mind seeing an improvement.

Well, the linux kernel people would also scream very loudly if the
compiler started using floating point for integer divides (Linux uses
-fno-fp-regs on architectures where it is needed/supported, but x86
doesn't even _have_ that flag right now).  In the kernel, we do NOT want
to pollute the (big) FP state, as the kernel doesn't want to
save/restore it all the time. 

Also, in the kernel we avoid things like "long long" divisions like the
plague anyway.  It's going to be slow however you do it, and there's
almost never any reason to do it at all.  It's somewhat more common to
have a 64/32->64 division, and Linux does that in inline assembly for
the (still fairly rare) cases that need it. 

However, at the same time 64-bit ops _are_ getting more and more common,
simply because 32 bits are starting to be a big limitations in things
like disk block numbers (verily, 2 terabytes isn't as big a number as it
used to be, and 32-bit sector offsets are starting to get tight). 

So..

If gcc developers start looking at double-integer 64-bit things, the
highest priority by far should be making the _simple_ operations and the
spilling faster.  The code generated for many simple 64-bit ops is
horrible because gcc has a very strict notion of what a 64-bit entity is
on a 32-bit architecture.  And that notion doesn't always make much
sense. 

For example, gcc seems to be unable to think of a 64-bit entity as two
almost-independent 32-bit parts, and does some strange register
allocation (I _think_ gcc can't mix and match registers - it seems to
always use fixed pairings (ie eax:edx and ecx:ebx).

See this trivial example to see what I'm talking about:

	unsigned long long a;

	int main(void)
	{
	        a &= ~1ULL;
	}

which really _should_ result in

	main:
		andl $-2,a
		ret

but instead results in

	main:
	        movl    a, %eax
	        andl    $-2, %eax
	        movl    a+4, %edx
	        movl    %eax, a
	        movl    %edx, a+4
	        ret

Notice how gcc loaded the high bits, and stored them again unchanged. 
Stupid.  Also note how gcc did _not_ use the immediate-to-memory format,
even though you'll see it do so if "a" had been just a regular 32-bit
entity... 

It would be much better to actually split up the 64-bit operations into
32-bit operations at a VERY early stage, and then allow them to be
optimized as regular 32-bit operations. So

	a &= ~1ULL;

should be split up early to

	a.high &= ~0UL;
	a.low &= ~1UL;

and then it is trivially simple to notice that the first operation is a
no-op, and the second operation is a perfectly normal and that gcc is
well able to optimize to the proper result (for normal 32-bit ops, gcc
does NOT generate the above stupid "load + op + store", but generates a
simple immediate "andl" to memory). 

Yes, doing the above kind of splitting would mean that gcc _has_ to
understand about the carry flag in eflags, and would obviously require
creating a few new requried patterns inside gcc (ie patterns for
"addsi3_c" and "addcsi3" etc to teach gcc about adds that generate carry
and adds that use carry).

But in return you'd get better code generation, and you could kill some
of the existing patterns (ie "adddi3" should just _go_away_).

There are very few cases where you don't want to think of DI as just
2*SI, I suspect.  So doing the split early would probably result in
uglier RTL ("what the heck is this code doing") but better code by
allowing it to spill just one half of the DI, for example.

(Note: you might be able to do part of this by defining adddi3 to be a
define_expand instead of a define_insn, but I think that would still be
late enough that all the optimization passes would not be able to work
with it as well as they should.  It would be a much smaller change,
though, and maybe it might make most of the bletcherousness go away). 

		Linus


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]