This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Inefficiencies in large integers

From: Asm Twiddler <asmtwiddler at gmail dot com>
To: Oleg Endo <oleg dot endo at t-online dot de>
Cc: gcc at gcc dot gnu dot org
Date: Sun, 18 Aug 2013 21:33:05 -0400
Subject: Re: Inefficiencies in large integers
References: <CAFqBsph_U3m8a9rkz1RLP+J7zedwy7S5EmnPD9Y+vLMiu1mRSA at mail dot gmail dot com> <1376822202 dot 12318 dot 126 dot camel at yam-132-YW-E178-FTW>

Hello Oleg,

On Sun, Aug 18, 2013 at 6:36 AM, Oleg Endo wrote:
> On Sun, 2013-08-18 at 00:55 -0400, Asm Twiddler wrote:
>> Hello all,
>>
>> I'm not sure whether this has been posted before, but gcc creates
>> slightly inefficient code for large integers in several cases:
>>
>
> I'm not sure what the actual question is.
> Bug reports and enhancement suggestions of that kind usually go to
> bugzilla and you should also specify which compiler version you're
> referring to.

Sorry, I thought it would be better to post here first.
Also, I'm using gcc 4.6.3 but I have also tested with the SVN branch,
4.9.0 20130817.

> Anyway, I've tried your examples on SH (4.9) which also does 64 bit
> operations with stitched 32 bit ops.
>
>> unsigned long long val;
>>
>> void example1() {
>>     val += 0x800000000000ULL;
>> }
>>
>> On x86 this results in the following assembly:
>> addl $0, val
>> adcl $32768, val+4
>> ret
>
> This is probably because if a target defines a plus:DI / minus:DI
> patterns (which is most likely to be the case, because of carry / borrow
> bit handling peculiarities) these kind of zero bits special cases will
> not be handled automatically.
> Another example would be:
>
> unsigned long long example11 (unsigned long long val, unsigned long x)
> {
>   val += (unsigned long long)x << 32;
>   return val;
> }

I see that compiling for x86 emits a plus:DI, but it's split during
the .split2 pass into two plus:SI instructions.

Also I'm not sure if the carry/borrow bit is the problem.
The following code will also generate an unnecessary instruction:

unsigned long long val;

void example12() {
    val |= 0x800000000000ULL;
}

Gives the following assembly:
orl $0, val
orl $32768, val+4
ret

Unlike the plus:DI, the two ior:SI instructions are created at the .expand pass.

>> The first add is unnecessary as it shouldn't modify val or set the carry.
>> This isn't too bad, but compiling for a something like AVR, results in
>> 8 byte loads, followed by three additions (of the high bytes),
>> followed by another 8 byte saves.
>> The compiler doesn't recognize that 5 of those loads and 5 of those
>> saves are unnecessary.
>
> This is probably because of the same or similar reason as mentioned
> above.  I've tried the following:
>
> void example4 (unsigned long long* x)
> {
>   *x |= 1;
> }
>
> and it results in:
>         mov.l   @(4,r4),r0
>         or      #1,r0
>         rts
>         mov.l   r0,@(4,r4)
>
> So I guess the fundamental subreg load/store handling seems to work.
>
>
>> Here is another inefficiency for x86:
>>
>> unsigned long long val = 0;
>> unsigned long small = 0;
>>
>> unsigned long long example1() {
>>     return val | small;
>> }
>>
>> unsigned long long example2() {
>>     return val & small;
>> }
>>
>> The RTL's generated for example1 and example2 are very similar until
>> the fwprop1 stage.
>> Since the largest word size on x86 is 4 bytes, each operation is
>> actually split into two.
>> The forward propagator correctly realizes that anding the upper 4
>> bytes results in a zero.
>> However, it doesn't seem to recognize that oring the upper 4 bytes
>> should return val's high word.
>> This problem also occurs in the xor operation, and also when
>> subtracting (val - small).
>
> In my case the double ior:SI and and:SI operations are eliminated in
> the .cse1 pass and the resulting code is optimal.

I've modified
  unsigned long long val = 0;
  unsigned long small = 0;
to
  extern const unsigned long long val;
  extern const unsigned long small;
and now both examples are optimal.
The problematic code appears to be (in fwprop.c):

if (!(flags & PR_HANDLE_MEM) && MEM_P (x) && !MEM_READONLY_P (x))
    {
      /* If unsafe, change MEMs to CLOBBERs or SCRATCHes (to preserve whether
they have side effects or not).  */
      *px = (side_effects_p (x)
    ? gen_rtx_CLOBBER (GET_MODE (x), const0_rtx)
    : gen_rtx_SCRATCH (GET_MODE (x)));
      return false;
    }

I see in the rtl that x86 does a zero_extend directly from memory.
This zero_extend is then propagated in the .fwprop1 pass in the
function propagate_rtx_1.
Thus the ior:SI referring to the high word is now oring the zero
extended part of 'small' with the high word of 'val'.
Of course, propagate_rtx_1 recurses until it hits the above block of code.
As a result, if the generated code isn't constant and PR_CAN_APPEAR
isn't set in the flags, then propagate_rtx_1 will return false.
This means that any generated code is discarded.
I believe this only happens when both operands refer to memory and
when the operation doesn't return a constant (as with the ior case).

The problem is that the above block is returning false because it
thinks we're going to access the propagated memory.
However, since we are accessing the zero extended part, the memory is
never actually touched in the propagation.

> My impression is that the stitched multiword add/sub thing could be
> addressed in a target independent way so that it would work for all
> affected targets automatically.
> The other issues seem to be individual target problems.
>
> Cheers,
> Oleg

Thanks for the response.

References:
- Inefficiencies in large integers
  - From: Asm Twiddler
- Re: Inefficiencies in large integers
  - From: Oleg Endo

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]