extern int x; void foo(void) { int a = __builtin_bswap32(x); a &= 0x5a5b5c5d; x = __builtin_bswap32(a); } With GCC 4.7.2 (x86_64 Fedora) this compiles to: foo: .LFB0: .cfi_startproc movl x(%rip), %eax bswap %eax andl $-1515936861, %eax bswap %eax movl %eax, x(%rip) ret .cfi_endproc Surely the actual swap should be optimised out, and the 0x5a5b5c5d constant mask should be swapped instead so we just load, mask with 0x5d5c5b5a and store without any runtime swapping? (See also PR42586 which may be related)
We have a similar issue with: extern void func(void); int baz(void) { if (__builtin_bswap32(x) & 0x80000) func(); } baz: .LFB1: .cfi_startproc movl x(%rip), %eax bswap %eax testl $524288, %eax jne .L12 rep ret … Again there's no need for a bswap followed by testing with a constant.
If you don't need to swap, then do not use __builtin_bswap! The first example is really dumb. That being said, the second example has some merit and we should do something to optimize it at the RTL level.
The first example isn't *that* dumb, as a cut-down test case of real code which may look more complex in reality. If the real code really *is* as simple as my test case, you're right that perhaps we *could* optimise it ourselves by eschewing the normal accessor macros for explicit-endian values, and manually byteswapping the constant instead. But really, we shouldn't *have* to. The compiler can see what's happening, and it can deal with it a whole lot better than we can, especially when it comes to loads and stores. Your argument applies just as well to the second test case. I could just byteswap the constant instead of the variable. But still I shouldn't have to.
> The first example isn't *that* dumb, as a cut-down test case of real code > which may look more complex in reality. OK, and I can indeed think of some usage patterns in real life. > If the real code really *is* as simple as my test case, you're right that > perhaps we *could* optimise it ourselves by eschewing the normal accessor > macros for explicit-endian values, and manually byteswapping the constant > instead. > > But really, we shouldn't *have* to. The compiler can see what's happening, and > it can deal with it a whole lot better than we can, especially when it comes > to loads and stores. Another category would be comparisons then: int compare1 (int x) { return __builtin_bswap32 (x) == 0xdeadbeef; } int compare2 (int x, int y) { return __builtin_bswap32 (x) == __builtin_bswap32 (y); } Will have a look shortly.
Indeed. Bear in mind that sometimes we *hide* the actual variable (by prefixing its name or putting it in a small struct of its own), just to *force* people to use the appropriate byte-swapping accessor functions/macros. And with good reason, because people will often forget to handle the endianness issues otherwise. So what you describe as 'really dumb' is actually something that we *force* people to do. We'd be much worse off without it. Of course, if we could just mark the variable with __attribute__((bigendian)) then perhaps it would all JustWork™ ... but that's a separate discussion ☹
> So what you describe as 'really dumb' is actually something that we *force* > people to do. We'd be much worse off without it. 'really dumb' applied only to the example though, not to your design decisions which are probably reasonable, even if they yield these useless byte-swapping primitives in some cases.
Linux kernel patch to use the builtins at http://marc.info/?l=linux-arch&m=135212414925921&w=2 I think I have the GCC version checks right there, for the availability of the builtins? That is, __builtin_bswap32() and __builtin_bswap64() since GCC 4.4, and __builtin_bswap16() since GCC 4.6 on PowerPC or GCC 4.9 on other platforms?
> I think I have the GCC version checks right there, for the availability of the > builtins? That is, __builtin_bswap32() and __builtin_bswap64() since GCC 4.4, > and __builtin_bswap16() since GCC 4.6 on PowerPC or GCC 4.9 on other platforms? 4.8 on other platforms.
This is now enabled in the Linux kernel. Core patch: http://git.kernel.org/linus/cf66bb93 (in v3.8 but does nothing there) x86: http://git.kernel.org/linus/83a57a4d (will be in v3.9) PowerPC: http://git.kernel.org/linus/fe3955cb (will be in v3.9) There's an ARM patch which looks like it missed the boat for 3.9 but should hopefully be in v3.10: https://lkml.org/lkml/2013/2/22/475 On ARM where we have no load-and-swap or store-and-swap primitives we are already seeing a tiny win. But fixing up more of the missing optimisations would be good.
Eric, I have some patches to improve __builtin_bswap* that I can line up for 4.9. Do you mind if I take this bug? They are located on the pinskia/bytewiseunop branch in git: http://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/pinskia/bytewiseunop
> Eric, > I have some patches to improve __builtin_bswap* that I can line up for 4.9. > Do you mind if I take this bug? They are located on the pinskia/bytewiseunop > branch in git: > http://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/pinskia/bytewiseunop I have something too (albeit preliminary), but go ahead.
Mine for 4.9. I will also post the bswap for mipsr2 at the same time as I post these.
Is that one fixed with? 2013-05-24 Eric Botcazou <ebotcazou@adacore.com> PR rtl-optimization/55177 * simplify-rtx.c (simplify_unary_operation_1) <NOT>: Deal with BSWAP. (simplify_byte_swapping_operation): New. (simplify_binary_operation_1): Call it for AND, IOR and XOR. (simplify_relational_operation_1): Deal with BSWAP.
The original testcase in comment #0 is still not optimized at the gimple level due to extra casts. If I use unsigned instead of int, the testcase is optimized at the gimple level.
More missed optimistions (gcc version 4.9.2 20141101 (Red Hat 4.9.2-1) (GCC)) I see it using movbe for the pointer_abuse() function, but can't see a simple way to make it use movbe *without* killing kittens. gcc -march=atom -O2 -x c -o - -S - -Wstrict-aliasing <<EOF struct pkt { unsigned char data[10]; }; unsigned or(struct pkt *p) { return (p->data[0] << 24) | (p->data[1] << 16) | (p->data[2] << 8) | p->data[1]; } unsigned add(struct pkt *p) { return (p->data[0] << 24) + (p->data[1] << 16) + (p->data[2] << 8) + p->data[1]; } unsigned pointer_abuse(struct pkt *p) { return __builtin_bswap32(*(unsigned *)p->data); } EOF
(In reply to David Woodhouse from comment #15) > { > return (p->data[0] << 24) | (p->data[1] << 16) | (p->data[2] << 8) | > p->data[1]; > } p->data[3] ? > unsigned add(struct pkt *p) > { > return (p->data[0] << 24) + (p->data[1] << 16) + (p->data[2] << 8) + > p->data[1]; ... and here?
Er, yes. Sorry, I originally tried it with uint16_t but it wasn't even using movbe for the pointer_abuse() function then, so I made it 32-bit instead. Badly. Come to think of it, the lack of movbe in the 16-bit pointer_abuse() case is potentially another bug worth fixing. Here's the corrected test case. /* gcc -march=atom -O2 -c load_be.c -Wstrict-aliasing -o- -S */ struct pkt { unsigned char data[10]; }; unsigned or(struct pkt *p) { return (p->data[0] << 24) | (p->data[1] << 16) | (p->data[2] << 8) | p->data[3]; } unsigned add(struct pkt *p) { return (p->data[0] << 24) + (p->data[1] << 16) + (p->data[2] << 8) + p->data[3]; } unsigned pointer_abuse(struct pkt *p) { return __builtin_bswap16(*(unsigned short *)p->data); }
(In reply to David Woodhouse from comment #17) > unsigned or(struct pkt *p) gcc 5.0 optimizes the above case ... > unsigned add(struct pkt *p) ... but not this one. Probably just the case of missing match.pd pattern.
Note some of these are optimized now at the RTL level but we should be able to catch some more at the TREE level. I hope to do some of that over Christmas in time for stage 1.
So the original testcase shows there are missing other bitop related optimizations on the tree level and conversions. I have two patches which fix the original issue. The first patch also fixes: unsigned foo(unsigned x, int b) { int a = x; a &= b; x = a; x |= b; return x; } --- CUT --- The problem here is: a_3 = (int) x_2(D); a_5 = a_3 & b_4(D); x_6 = (unsigned int) a_5; is not optimized to: _7 = (unsigned int) b_4(D); x_6 = _6 & x_2(D); Which shows up in testcase in comment #0: _2 = __builtin_bswap32 (x.0_1); a_6 = (int) _2; a_7 = a_6 & 1515936861; a.1_3 = (unsigned int) a_7; _4 = __builtin_bswap32 (a.1_3); Fixing the bitop with convert, we get rid of the two byteswaps in comment #0. diff --git a/gcc/match.pd b/gcc/match.pd index 363006e28fd..d0258a19534 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -1393,7 +1393,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* Or if the precision of TO is not the same as the precision of its mode. */ || !type_has_mode_precision_p (type))) - (convert (bitop @0 (convert @1)))))) + (convert (bitop @0 (convert @1))))) + (simplify + (convert (bitop:c (nop_convert @0) @1)) + (if (tree_nop_conversion_p (type, TREE_TYPE (@1)) + && types_match (type, TREE_TYPE (@0))) + (bitop @0 (convert @1))))) (for bitop (bit_and bit_ior) rbitop (bit_ior bit_and)
(In reply to Andrew Pinski from comment #20) > diff --git a/gcc/match.pd b/gcc/match.pd I ran into a testcase regression with my new correct version. See PR 60669 for that case.
(In reply to Andrew Pinski from comment #20) > So the original testcase shows there are missing other bitop related > optimizations on the tree level and conversions. > I have two patches which fix the original issue. > The first patch also fixes: > unsigned foo(unsigned x, int b) > { > int a = x; > a &= b; > x = a; > x |= b; > return x; > } > --- CUT --- > The problem here is: > a_3 = (int) x_2(D); > a_5 = a_3 & b_4(D); > x_6 = (unsigned int) a_5; > > is not optimized to: > _7 = (unsigned int) b_4(D); > x_6 = _6 & x_2(D); > > Which shows up in testcase in comment #0: > _2 = __builtin_bswap32 (x.0_1); > a_6 = (int) _2; > a_7 = a_6 & 1515936861; > a.1_3 = (unsigned int) a_7; > _4 = __builtin_bswap32 (a.1_3); > > Fixing the bitop with convert, we get rid of the two byteswaps in comment #0. That I happened to file as PR 103228 a few days ago but only on accident while looking into a different issue. Since PR 60669 is now fixed, I will test my original patch to fix this.
(In reply to Andrew Pinski from comment #20) > diff --git a/gcc/match.pd b/gcc/match.pd > index 363006e28fd..d0258a19534 100644 > --- a/gcc/match.pd > +++ b/gcc/match.pd > @@ -1393,7 +1393,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > /* Or if the precision of TO is not the same as the precision > of its mode. */ > || !type_has_mode_precision_p (type))) > - (convert (bitop @0 (convert @1)))))) > + (convert (bitop @0 (convert @1))))) > + (simplify > + (convert (bitop:c (nop_convert @0) @1)) > + (if (tree_nop_conversion_p (type, TREE_TYPE (@1)) > + && types_match (type, TREE_TYPE (@0))) > + (bitop @0 (convert @1))))) > > (for bitop (bit_and bit_ior) > rbitop (bit_ior bit_and) Note this part of the patch went in as part of PR 94718.
The trunk branch has been updated by Andrew Pinski <pinskia@gcc.gnu.org>: https://gcc.gnu.org/g:32221357007666124409ec3ee0d3a1cf263ebc9e commit r12-5358-g32221357007666124409ec3ee0d3a1cf263ebc9e Author: Andrew Pinski <apinski@marvell.com> Date: Mon Nov 15 09:31:20 2021 +0000 Fix PR tree-optimization/103228 and 103228: folding of (type) X op CST where type is a nop convert Currently we fold (type) X op CST into (type) (X op ((type-x) CST)) when the conversion widens but not when the conversion is a nop. For the same reason why we move the widening conversion (the possibility of removing an extra conversion), we should do the same if the conversion is a nop. Committed as approved with the comment change. PR tree-optimization/103228 PR tree-optimization/55177 gcc/ChangeLog: * match.pd ((type) X bitop CST): Also do this transformation for nop conversions. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/pr103228-1.c: New test. * gcc.dg/tree-ssa/pr55177-1.c: New test.
Fixed fully in GCC 12.