66012 – Sub-optimal 64bit load is generated instead of zero-extension

Bug 66012 - Sub-optimal 64bit load is generated instead of zero-extension

Summary: Sub-optimal 64bit load is generated instead of zero-extension

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	6.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2015-05-05 05:01 UTC by Uroš Bizjak
Modified:	2016-09-21 09:17 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2016-09-21 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Uroš Bizjak 2015-05-05 05:01:14 UTC

This testcase:

--cut here--
long l[4];

long long test (void)
{
  return (l[0] & (long long) 0xffffffff)
         | ((l[1] & (long long) 0xffffffff) << 32);
}
--cut here--

generates non-optimal 64bit load:

        movl    l(%rip), %edx
>>      movq    l+8(%rip), %rax
        salq    $32, %rax
        orq     %rdx, %rax
        ret

The .optimized dump is already missing masking that would generate zero-extension on x86_64:

test ()
{
  long int _2;
  long long int _3;
  long int _4;
  long long int _5;
  long long int _6;

  <bb 2>:
  _2 = l[0];
  _3 = _2 & 4294967295;
  _4 = l[1];
  _5 = _4 << 32;
  _6 = _3 | _5;
  return _6;

}

Comment 1 Jakub Jelinek 2015-05-05 07:12:30 UTC

In GIMPLE that masking is generally useless though, you shift the bits away, and without the extra BIT_AND_EXPR the expression is more canonical and shorter.
So, supposedly during expansion or combine you could figure this from the left shift with large shift count.

Comment 2 Andrew Pinski 2015-12-23 23:29:52 UTC

Happens on aarch64 also:
test:
        adrp    x0, l
        add     x1, x0, :lo12:l
        ldr     x1, [x1, 8]
        ldr     w0, [x0, #:lo12:l]
        orr     x0, x0, x1, lsl 32
        ret

Comment 3 Andrew Pinski 2015-12-23 23:30:32 UTC

Confirmed.

Comment 4 Andrew Pinski 2016-09-21 09:17:56 UTC

On the trunk we generate one 32bit load and one 64bit load (at least for aarch64):
test:
        adrp    x0, l
        add     x1, x0, :lo12:l
        ldr     w0, [x0, #:lo12:l]   ; 32bit load to w0
        ldr     x1, [x1, 8]          ; 64bit load to x1
        orr     x0, x0, x1, lsl 32
        ret