User account creation filtered due to spam.

Bug 55181 - [5/6/7/8 Regression] Expensive shift loop where a bit-testing instruction could be used
Summary: [5/6/7/8 Regression] Expensive shift loop where a bit-testing instruction cou...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.7.0
: P4 normal
Target Milestone: 5.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2012-11-02 16:05 UTC by Georg-Johann Lay
Modified: 2016-08-04 07:57 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.6.2
Known to fail: 4.7.2
Last reconfirmed: 2012-11-02 00:00:00


Attachments
foo.c Test case (102 bytes, text/plain)
2012-11-02 16:11 UTC, Georg-Johann Lay
Details
foo.s (from 4.8) (479 bytes, text/plain)
2012-11-02 16:13 UTC, Georg-Johann Lay
Details
foo.s (4.6.2, good) (344 bytes, application/octet-stream)
2012-11-02 16:15 UTC, Georg-Johann Lay
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Georg-Johann Lay 2012-11-02 16:05:57 UTC
The following C code:

unsigned char lfsr (unsigned long number)
{
  unsigned char b = 0;
  if (number & (1L << 29)) b++;
  if (number & (1L << 13)) b++;

  return b;
}

compiles to a right shift 29 bits of number which is very expensive because AVR has no barrel shifter.  Instead, a bit-testing instruction could be used which takes just a few cycles and not more than 100 like with the right shift.

4.6.2 uses a bit testing instruction.

== Command line ==
Comment 1 Georg-Johann Lay 2012-11-02 16:10:05 UTC
== Command Line ==

$ avr-gcc -S foo.c -Os -dp

with avr-gcc

GCC: (GNU) 4.8.0 20120915 (experimental)

Set component to "other", dunno if avr backend is the culprit or the middle-end disregarding costs.
Comment 2 Georg-Johann Lay 2012-11-02 16:11:40 UTC
Created attachment 28597 [details]
foo.c Test case
Comment 3 Georg-Johann Lay 2012-11-02 16:13:18 UTC
Created attachment 28598 [details]
foo.s (from 4.8)

Assembler output that shows the expensive shift loop in insn 44.
Comment 4 Georg-Johann Lay 2012-11-02 16:15:16 UTC
Created attachment 28599 [details]
foo.s (4.6.2, good)

Assembler output from 4.6.2 that generates a bit test instruction.
Comment 5 Steven Bosscher 2013-03-05 23:45:38 UTC
At trunk r196410, the .164t.optimized dump looks like this:

lfsr (long unsigned int number)
{
  unsigned char b;
  long unsigned int _4;
  long unsigned int _5;
  _Bool _8;

  <bb 2>:
  _4 = number_3(D) & 536870912;
  _8 = _4 != 0;
  b_10 = (unsigned char) _8;
  _5 = number_3(D) & 8192;
  if (_5 != 0)
    goto <bb 3>;
  else
    goto <bb 4>;

  <bb 3>:
  b_6 = b_10 + 1;

  <bb 4>:
  # b_2 = PHI <b_10(2), b_6(3)>
  return b_2;

}

and exact_log2 could be used to identify the and-instruction as
a bit test instruction:
(gdb) p exact_log2(536870912)
$1 = 29
(gdb) 

AFAIK there are no named patterns for bit tests, perhaps there should be.
(http://gcc.gnu.org/onlinedocs/gccint/Standard-Names.html#Standard-Names)

So at least on x86_64 the test gets expanded as a shift:

    6: {r65:DI=r64:DI 0>>0x1d;clobber flags:CC;}
    7: {r59:QI=r65:DI#0&0x1;clobber flags:CC;}
    8: {r66:DI=r64:DI&0x2000;clobber flags:CC;}
    9: flags:CCZ=cmp(r66:DI,0)

i.e. also far from optimal.
Comment 6 Richard Biener 2013-03-06 08:51:30 UTC
Note that a bit-test instruction can only be used if the and only feeds a
comparison against zero.  Not sure how canonical bit-test patterns should
look like, but combine should be able to work this out, no?  Unless we
mess things up earlier during generic expansion (which seems to be the
case)
Comment 7 Richard Biener 2013-04-11 07:59:32 UTC
GCC 4.7.3 is being released, adjusting target milestone.
Comment 8 Oleg Endo 2013-12-16 10:28:09 UTC
(In reply to Richard Biener from comment #6)
> Note that a bit-test instruction can only be used if the and only feeds a
> comparison against zero.  Not sure how canonical bit-test patterns should
> look like, but combine should be able to work this out, no?  Unless we
> mess things up earlier during generic expansion (which seems to be the
> case)

A single bit test pattern such as (taken from SH)

(define_insn "tstsi_t"
  [(set (reg:SI T_REG)
	(eq:SI (and:SI (match_operand:SI 0 "logical_operand" "%z,r")
		       (match_operand:SI 1 "logical_operand" "K08,r"))
	       (const_int 0)))]

is not enough.  If the second operand is a constant combine will try out various variations (canonicalized bit extractions) such as zero_extend or

  [(set (reg:SI T_REG)
	 (and:SI (not:SI (match_operand:SI 0 "logical_operand" "z"))
		 (const_int 1)))]
Comment 9 Oleg Endo 2014-03-02 18:08:18 UTC
The first "if (...) b++;" is transformed to a bit extraction (right shift + and), because the result is either b = 0 or b = 1.
The second "if (...) b++" uses an and + zero-compare + branch around add.
The and + zero-compare are then combined to a bit test insn.

The first bit extraction could be turned into a bit test followed by a test result store by implementing the according zero_extract combine pattern:

(set (reg:SI 169)
    (zero_extract:SI (reg/v:SI 165 [ number ])
        (const_int 1 [0x1])
        (const_int 29 [0x1d])))

Although doing so resulted in problems when matching bit test insns, if I remember correctly.

The second bit test + branch + add is a bit more difficult, as it is always expanded as a branch + add.

On SH there are multiple minimal sequences, depending on the context / surrounding code.  The following branchless variant could used if the tested reg dies after the tests and the whole thing is not inside a (inner) loop:

        mov     r4,r0                // r0 = number
        shlr8   r0                   // r0 = r0 >> 8 (logical shift)
        tst     #(1 << (13-8)),r0    // T = (r0 & (1 << (13-8))) == 0
        shlr8   r0                   // r0 = r0 >> 8 (logical shift)
        movrt   r1                   // r1 = !T
        tst     #(1 << (29-16)),r0   // T = (r0 & (1 << (26-16))) == 0
        movrt   r0                   // r0 = !T
        rts
        add     r1,r0                // r0 = r0 + r1

If the code is in a loop, it's more efficient to load constants (which might require a constant pool on SH):

        mov.l   (1 << 13),r1
        mov.l   (1 << 29),r2
        mov     #0,r0
        ...
loop:
        ...
        // r4 = number from somewhere
        tst     r1,r4               // T = (r4 & (1 << 13)) == 0
        movrt   r3                  // r3 = !T
        tst     r2,r4               // T = (r4 & (1 << 29)) == 0
        add     r3,r0               // r0 = r0 + r3
        movrt   r3                  // r3 = !T
        add     r3,r0
        ...
        <cbranch loop>
        

Using shift + and is usually worse on SH for these kind of sequences.

I've tried adding the standard name pattern extzv<mode> but it doesn't seem to be used neither for the first if (...) nor for the second during RTL expansion.

Maybe if-convert could be taught to transform the second if (...) to a zero_extract as well.  But probably it's better to catch this earlier.
Comment 10 Oleg Endo 2014-03-02 18:52:47 UTC
(In reply to Oleg Endo from comment #9)
> 
> Maybe if-convert could be taught to transform the second if (...) to a
> zero_extract as well.  But probably it's better to catch this earlier.

... which would make the original example

unsigned char lfsr (unsigned long number)
{
  unsigned char b = 0;
  if (number & (1L << 29)) b++;
  if (number & (1L << 13)) b++;

  return b;
}

produce the same code as

unsigned char lfsr (unsigned long number)
{
  return ((number & (1L << 29)) != 0) + ((number & (1L << 13)) != 0);
}
Comment 11 Richard Biener 2014-06-12 13:43:46 UTC
The 4.7 branch is being closed, moving target milestone to 4.8.4.
Comment 12 Jakub Jelinek 2014-12-19 13:27:04 UTC
GCC 4.8.4 has been released.
Comment 13 Richard Biener 2015-06-23 08:16:29 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 14 Jakub Jelinek 2015-06-26 19:53:13 UTC
GCC 4.9.3 has been released.
Comment 15 Richard Biener 2016-08-03 10:58:18 UTC
GCC 4.9 branch is being closed
Comment 16 Georg-Johann Lay 2016-08-04 07:51:25 UTC
Author: gjl
Date: Thu Aug  4 07:50:53 2016
New Revision: 239116

URL: https://gcc.gnu.org/viewcvs?rev=239116&root=gcc&view=rev
Log:
	PR 55181
	* config/avr/avr.md: New pattern to work around do_store_flag
	generating shift instructions for bit extractions.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/avr/avr.md
Comment 17 Georg-Johann Lay 2016-08-04 07:57:14 UTC
For now and for avr, hacked around the issue by supplying a pattern for the extract...