Bug 55177 - missed optimizations with __builtin_bswap
Summary: missed optimizations with __builtin_bswap
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.7.2
: P3 enhancement
Target Milestone: 12.0
Assignee: Andrew Pinski
URL:
Keywords: missed-optimization, TREE
Depends on: 60669 103228
Blocks:
  Show dependency treegraph
 
Reported: 2012-11-02 09:59 UTC by David Woodhouse
Modified: 2023-09-21 13:55 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-11-02 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Woodhouse 2012-11-02 09:59:31 UTC
extern int x;

void foo(void)
{
  int a = __builtin_bswap32(x);
  a &= 0x5a5b5c5d;
  x = __builtin_bswap32(a);
}

With GCC 4.7.2 (x86_64 Fedora) this compiles to:
foo:
.LFB0:
	.cfi_startproc
	movl	x(%rip), %eax
	bswap	%eax
	andl	$-1515936861, %eax
	bswap	%eax
	movl	%eax, x(%rip)
	ret
	.cfi_endproc


Surely the actual swap should be optimised out, and the 0x5a5b5c5d constant mask should be swapped instead so we just load, mask with 0x5d5c5b5a and store without any runtime swapping?

(See also PR42586 which may be related)
Comment 1 David Woodhouse 2012-11-02 10:45:52 UTC
We have a similar issue with:

extern void func(void);
int baz(void)
{
	if (__builtin_bswap32(x) & 0x80000)
		func();
}	

baz:
.LFB1:
	.cfi_startproc
	movl	x(%rip), %eax
	bswap	%eax
	testl	$524288, %eax
	jne	.L12
	rep
	ret
…

Again there's no need for a bswap followed by testing with a constant.
Comment 2 Eric Botcazou 2012-11-02 10:56:17 UTC
If you don't need to swap, then do not use __builtin_bswap!  The first example is really dumb.  That being said, the second example has some merit and we should do something to optimize it at the RTL level.
Comment 3 David Woodhouse 2012-11-02 17:05:03 UTC
The first example isn't *that* dumb, as a cut-down test case of real code which may look more complex in reality.

If the real code really *is* as simple as my test case, you're right that perhaps we *could* optimise it ourselves by eschewing the normal accessor macros for explicit-endian values, and manually byteswapping the constant instead.

But really, we shouldn't *have* to. The compiler can see what's happening, and it can deal with it a whole lot better than we can, especially when it comes to loads and stores. 

Your argument applies just as well to the second test case. I could just byteswap the constant instead of the variable. But still I shouldn't have to.
Comment 4 Eric Botcazou 2012-11-02 17:45:54 UTC
> The first example isn't *that* dumb, as a cut-down test case of real code 
> which may look more complex in reality.

OK, and I can indeed think of some usage patterns in real life.

> If the real code really *is* as simple as my test case, you're right that
> perhaps we *could* optimise it ourselves by eschewing the normal accessor
> macros for explicit-endian values, and manually byteswapping the constant
> instead.
> 
> But really, we shouldn't *have* to. The compiler can see what's happening, and
> it can deal with it a whole lot better than we can, especially when it comes
> to loads and stores. 

Another category would be comparisons then:

int compare1 (int x)
{
  return __builtin_bswap32 (x) == 0xdeadbeef;
}

int compare2 (int x, int y)
{
  return __builtin_bswap32 (x) == __builtin_bswap32 (y);
}

Will have a look shortly.
Comment 5 David Woodhouse 2012-11-02 19:41:28 UTC
Indeed. Bear in mind that sometimes we *hide* the actual variable (by prefixing its name or putting it in a small struct of its own), just to *force* people to use the appropriate byte-swapping accessor functions/macros. And with good reason, because people will often forget to handle the endianness issues otherwise.

So what you describe as 'really dumb' is actually something that we *force* people to do. We'd be much worse off without it.

Of course, if we could just mark the variable with __attribute__((bigendian)) then perhaps it would all JustWork™ ... but that's a separate discussion ☹
Comment 6 Eric Botcazou 2012-11-02 21:59:27 UTC
> So what you describe as 'really dumb' is actually something that we *force*
> people to do. We'd be much worse off without it.

'really dumb' applied only to the example though, not to your design decisions which are probably reasonable, even if they yield these useless byte-swapping primitives in some cases.
Comment 7 David Woodhouse 2012-11-08 14:29:37 UTC
Linux kernel patch to use the builtins at http://marc.info/?l=linux-arch&m=135212414925921&w=2

I think I have the GCC version checks right there, for the availability of the builtins? That is, __builtin_bswap32() and __builtin_bswap64() since GCC 4.4, and __builtin_bswap16() since GCC 4.6 on PowerPC or GCC 4.9 on other platforms?
Comment 8 Eric Botcazou 2012-11-08 16:14:23 UTC
> I think I have the GCC version checks right there, for the availability of the
> builtins? That is, __builtin_bswap32() and __builtin_bswap64() since GCC 4.4,
> and __builtin_bswap16() since GCC 4.6 on PowerPC or GCC 4.9 on other platforms?

4.8 on other platforms.
Comment 9 David Woodhouse 2013-03-08 12:11:24 UTC
This is now enabled in the Linux kernel.

Core patch: http://git.kernel.org/linus/cf66bb93 (in v3.8 but does nothing there)
x86:        http://git.kernel.org/linus/83a57a4d (will be in v3.9)
PowerPC:    http://git.kernel.org/linus/fe3955cb (will be in v3.9)

There's an ARM patch which looks like it missed the boat for 3.9 but should hopefully be in v3.10: https://lkml.org/lkml/2013/2/22/475

On ARM where we have no load-and-swap or store-and-swap primitives we are already seeing a tiny win. But fixing up more of the missing optimisations would be good.
Comment 10 Andrew Pinski 2013-03-08 15:18:35 UTC
Eric,
  I have some patches to improve __builtin_bswap* that I can line up for 4.9.  Do you mind if I take this bug?  They are located on the pinskia/bytewiseunop branch in git:
http://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/pinskia/bytewiseunop
Comment 11 Eric Botcazou 2013-03-09 00:05:46 UTC
> Eric,
>   I have some patches to improve __builtin_bswap* that I can line up for 4.9. 
> Do you mind if I take this bug?  They are located on the pinskia/bytewiseunop
> branch in git:
> http://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/pinskia/bytewiseunop

I have something too (albeit preliminary), but go ahead.
Comment 12 Andrew Pinski 2013-03-09 00:16:23 UTC
Mine for 4.9.  I will also post the bswap for mipsr2 at the same time as I post these.
Comment 13 Andreas Krebbel 2014-04-04 12:40:03 UTC
Is that one fixed with?
2013-05-24  Eric Botcazou  <ebotcazou@adacore.com>

	PR rtl-optimization/55177
	* simplify-rtx.c (simplify_unary_operation_1) <NOT>: Deal with BSWAP.
	(simplify_byte_swapping_operation): New.
	(simplify_binary_operation_1): Call it for AND, IOR and XOR.
	(simplify_relational_operation_1): Deal with BSWAP.
Comment 14 Andrew Pinski 2014-11-22 22:11:11 UTC
The original testcase in comment #0 is still not optimized at the gimple level due to extra casts.  If I use unsigned instead of int, the testcase is optimized at the gimple level.
Comment 15 David Woodhouse 2015-01-23 13:50:54 UTC
More missed optimistions (gcc version 4.9.2 20141101 (Red Hat 4.9.2-1) (GCC))

I see it using movbe for the pointer_abuse() function, but can't see a simple way to make it use movbe *without* killing kittens.

gcc -march=atom -O2 -x c -o - -S - -Wstrict-aliasing <<EOF

struct pkt {
	unsigned char data[10];
};

unsigned or(struct pkt *p)
{
	return (p->data[0] << 24) | (p->data[1] << 16) | (p->data[2] << 8) | p->data[1];
}

unsigned add(struct pkt *p)
{
	return (p->data[0] << 24) + (p->data[1] << 16) + (p->data[2] << 8) + p->data[1];
}

unsigned pointer_abuse(struct pkt *p)
{
	return __builtin_bswap32(*(unsigned *)p->data);
}
EOF
Comment 16 Uroš Bizjak 2015-01-23 14:00:01 UTC
(In reply to David Woodhouse from comment #15)

> {
> 	return (p->data[0] << 24) | (p->data[1] << 16) | (p->data[2] << 8) |
> p->data[1];
> }

p->data[3] ?

> unsigned add(struct pkt *p)
> {
> 	return (p->data[0] << 24) + (p->data[1] << 16) + (p->data[2] << 8) +
> p->data[1];

... and here?
Comment 17 David Woodhouse 2015-01-23 14:08:46 UTC
Er, yes. Sorry, I originally tried it with uint16_t but it wasn't even using movbe for the pointer_abuse() function then, so I made it 32-bit instead. Badly. Come to think of it, the lack of movbe in the 16-bit pointer_abuse() case is potentially another bug worth fixing.

Here's the corrected test case.

/* gcc -march=atom -O2 -c load_be.c -Wstrict-aliasing -o- -S */
struct pkt {
	unsigned char data[10];
};

unsigned or(struct pkt *p)
{
	return (p->data[0] << 24) | (p->data[1] << 16) | (p->data[2] << 8) | p->data[3];
}

unsigned add(struct pkt *p)
{
	return (p->data[0] << 24) + (p->data[1] << 16) + (p->data[2] << 8) + p->data[3];
}

unsigned pointer_abuse(struct pkt *p)
{
	return __builtin_bswap16(*(unsigned short *)p->data);
}
Comment 18 Uroš Bizjak 2015-01-23 14:11:31 UTC
(In reply to David Woodhouse from comment #17)

> unsigned or(struct pkt *p)

gcc 5.0 optimizes the above case ...

> unsigned add(struct pkt *p)

... but not this one.  Probably just the case of missing match.pd pattern.
Comment 19 Andrew Pinski 2016-11-30 05:31:23 UTC
Note some of these are optimized now at the RTL level but we should be able to catch some more at the TREE level.  I hope to do some of that over Christmas in time for stage 1.
Comment 20 Andrew Pinski 2020-02-09 01:54:35 UTC
So the original testcase shows there are missing other bitop related optimizations on the tree level and conversions.
I have two patches which fix the original issue.
The first patch also fixes:
unsigned foo(unsigned x, int b)
{
  int a = x;
  a &= b;
  x = a;
  x |= b;
  return x;
}
--- CUT ---
The problem here is:
  a_3 = (int) x_2(D);
  a_5 = a_3 & b_4(D);
  x_6 = (unsigned int) a_5;

is not optimized to:
  _7 = (unsigned int) b_4(D);
  x_6 = _6 & x_2(D);

Which shows up in testcase in comment #0:
  _2 = __builtin_bswap32 (x.0_1);
  a_6 = (int) _2;
  a_7 = a_6 & 1515936861;
  a.1_3 = (unsigned int) a_7;
  _4 = __builtin_bswap32 (a.1_3);

Fixing the bitop with convert, we get rid of the two byteswaps in comment #0.

diff --git a/gcc/match.pd b/gcc/match.pd
index 363006e28fd..d0258a19534 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1393,7 +1393,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
           /* Or if the precision of TO is not the same as the precision
              of its mode.  */
           || !type_has_mode_precision_p (type)))
-   (convert (bitop @0 (convert @1))))))
+   (convert (bitop @0 (convert @1)))))
+ (simplify
+  (convert (bitop:c (nop_convert @0) @1))
+  (if (tree_nop_conversion_p (type, TREE_TYPE (@1))
+       && types_match (type, TREE_TYPE (@0)))
+   (bitop @0 (convert @1)))))

 (for bitop (bit_and bit_ior)
      rbitop (bit_ior bit_and)
Comment 21 Andrew Pinski 2020-02-09 23:42:29 UTC
(In reply to Andrew Pinski from comment #20)
> diff --git a/gcc/match.pd b/gcc/match.pd

I ran into a testcase regression with my new correct version.  See PR 60669 for that case.
Comment 22 Andrew Pinski 2021-11-15 08:35:33 UTC
(In reply to Andrew Pinski from comment #20)
> So the original testcase shows there are missing other bitop related
> optimizations on the tree level and conversions.
> I have two patches which fix the original issue.
> The first patch also fixes:
> unsigned foo(unsigned x, int b)
> {
>   int a = x;
>   a &= b;
>   x = a;
>   x |= b;
>   return x;
> }
> --- CUT ---
> The problem here is:
>   a_3 = (int) x_2(D);
>   a_5 = a_3 & b_4(D);
>   x_6 = (unsigned int) a_5;
> 
> is not optimized to:
>   _7 = (unsigned int) b_4(D);
>   x_6 = _6 & x_2(D);
> 
> Which shows up in testcase in comment #0:
>   _2 = __builtin_bswap32 (x.0_1);
>   a_6 = (int) _2;
>   a_7 = a_6 & 1515936861;
>   a.1_3 = (unsigned int) a_7;
>   _4 = __builtin_bswap32 (a.1_3);
> 
> Fixing the bitop with convert, we get rid of the two byteswaps in comment #0.

That I happened to file as PR 103228 a few days ago but only on accident while looking into a different issue.  Since PR 60669 is now fixed, I will test my original patch to fix this.
Comment 23 Andrew Pinski 2021-11-15 09:19:52 UTC
(In reply to Andrew Pinski from comment #20)
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 363006e28fd..d0258a19534 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -1393,7 +1393,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>            /* Or if the precision of TO is not the same as the precision
>               of its mode.  */
>            || !type_has_mode_precision_p (type)))
> -   (convert (bitop @0 (convert @1))))))
> +   (convert (bitop @0 (convert @1)))))
> + (simplify
> +  (convert (bitop:c (nop_convert @0) @1))
> +  (if (tree_nop_conversion_p (type, TREE_TYPE (@1))
> +       && types_match (type, TREE_TYPE (@0)))
> +   (bitop @0 (convert @1)))))
> 
>  (for bitop (bit_and bit_ior)
>       rbitop (bit_ior bit_and)

Note this part of the patch went in as part of PR 94718.
Comment 24 GCC Commits 2021-11-17 23:40:14 UTC
The trunk branch has been updated by Andrew Pinski <pinskia@gcc.gnu.org>:

https://gcc.gnu.org/g:32221357007666124409ec3ee0d3a1cf263ebc9e

commit r12-5358-g32221357007666124409ec3ee0d3a1cf263ebc9e
Author: Andrew Pinski <apinski@marvell.com>
Date:   Mon Nov 15 09:31:20 2021 +0000

    Fix PR tree-optimization/103228 and 103228: folding of (type) X op CST where type is a nop convert
    
    Currently we fold (type) X op CST into (type) (X op ((type-x) CST)) when the conversion widens
    but not when the conversion is a nop. For the same reason why we move the widening conversion
    (the possibility of removing an extra conversion), we should do the same if the conversion is a
    nop.
    
    Committed as approved with the comment change.
    
            PR tree-optimization/103228
            PR tree-optimization/55177
    
    gcc/ChangeLog:
    
            * match.pd ((type) X bitop CST): Also do this
            transformation for nop conversions.
    
    gcc/testsuite/ChangeLog:
    
            * gcc.dg/tree-ssa/pr103228-1.c: New test.
            * gcc.dg/tree-ssa/pr55177-1.c: New test.
Comment 25 Andrew Pinski 2021-11-17 23:41:12 UTC
Fixed fully in GCC 12.