Bug 66012 - Sub-optimal 64bit load is generated instead of zero-extension
Summary: Sub-optimal 64bit load is generated instead of zero-extension
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 6.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2015-05-05 05:01 UTC by Uroš Bizjak
Modified: 2016-09-21 09:17 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2016-09-21 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Uroš Bizjak 2015-05-05 05:01:14 UTC
This testcase:

--cut here--
long l[4];

long long test (void)
{
  return (l[0] & (long long) 0xffffffff)
         | ((l[1] & (long long) 0xffffffff) << 32);
}
--cut here--

generates non-optimal 64bit load:

        movl    l(%rip), %edx
>>      movq    l+8(%rip), %rax
        salq    $32, %rax
        orq     %rdx, %rax
        ret

The .optimized dump is already missing masking that would generate zero-extension on x86_64:

test ()
{
  long int _2;
  long long int _3;
  long int _4;
  long long int _5;
  long long int _6;

  <bb 2>:
  _2 = l[0];
  _3 = _2 & 4294967295;
  _4 = l[1];
  _5 = _4 << 32;
  _6 = _3 | _5;
  return _6;

}
Comment 1 Jakub Jelinek 2015-05-05 07:12:30 UTC
In GIMPLE that masking is generally useless though, you shift the bits away, and without the extra BIT_AND_EXPR the expression is more canonical and shorter.
So, supposedly during expansion or combine you could figure this from the left shift with large shift count.
Comment 2 Andrew Pinski 2015-12-23 23:29:52 UTC
Happens on aarch64 also:
test:
        adrp    x0, l
        add     x1, x0, :lo12:l
        ldr     x1, [x1, 8]
        ldr     w0, [x0, #:lo12:l]
        orr     x0, x0, x1, lsl 32
        ret
Comment 3 Andrew Pinski 2015-12-23 23:30:32 UTC
Confirmed.
Comment 4 Andrew Pinski 2016-09-21 09:17:56 UTC
On the trunk we generate one 32bit load and one 64bit load (at least for aarch64):
test:
        adrp    x0, l
        add     x1, x0, :lo12:l
        ldr     w0, [x0, #:lo12:l]   ; 32bit load to w0
        ldr     x1, [x1, 8]          ; 64bit load to x1
        orr     x0, x0, x1, lsl 32
        ret