This testcase: --cut here-- long l[4]; long long test (void) { return (l[0] & (long long) 0xffffffff) | ((l[1] & (long long) 0xffffffff) << 32); } --cut here-- generates non-optimal 64bit load: movl l(%rip), %edx >> movq l+8(%rip), %rax salq $32, %rax orq %rdx, %rax ret The .optimized dump is already missing masking that would generate zero-extension on x86_64: test () { long int _2; long long int _3; long int _4; long long int _5; long long int _6; <bb 2>: _2 = l[0]; _3 = _2 & 4294967295; _4 = l[1]; _5 = _4 << 32; _6 = _3 | _5; return _6; }
In GIMPLE that masking is generally useless though, you shift the bits away, and without the extra BIT_AND_EXPR the expression is more canonical and shorter. So, supposedly during expansion or combine you could figure this from the left shift with large shift count.
Happens on aarch64 also: test: adrp x0, l add x1, x0, :lo12:l ldr x1, [x1, 8] ldr w0, [x0, #:lo12:l] orr x0, x0, x1, lsl 32 ret
Confirmed.
On the trunk we generate one 32bit load and one 64bit load (at least for aarch64): test: adrp x0, l add x1, x0, :lo12:l ldr w0, [x0, #:lo12:l] ; 32bit load to w0 ldr x1, [x1, 8] ; 64bit load to x1 orr x0, x0, x1, lsl 32 ret