User account creation filtered due to spam.

Bug 31695 - __builtin_ctzll slower than 2*__builtin_ctz
Summary: __builtin_ctzll slower than 2*__builtin_ctz
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.1.1
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2007-04-25 09:16 UTC by Jörg Richter
Modified: 2008-07-04 23:30 UTC (History)
2 users (show)

See Also:
Host:
Target: i686-pc-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2007-04-25 15:38:21


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jörg Richter 2007-04-25 09:16:58 UTC
int func1( unsigned long long val )
{
  return __builtin_ctzll( val );
}

int func2( unsigned long long val )
{
  unsigned lo = (unsigned)val;
  return lo ? __builtin_ctz(lo) : __builtin_ctz(unsigned(val>>32)) + 32;
}

func1 is more than 2 times slower than func2.  
But it should be at least as fast as func2

__builtin_ctzll is not expanded inline like __builtin_ctz.
Comment 1 Richard Biener 2007-04-25 15:38:21 UTC
Because it calls into libgcc and that without tail-calling:

_Z5func1y:
.LFB2:
        pushl   %ebp
.LCFI2:
        movl    %esp, %ebp
.LCFI3:
        subl    $24, %esp
.LCFI4:
        movl    8(%ebp), %eax
        movl    12(%ebp), %edx
        movl    %eax, (%esp)
        movl    %edx, 4(%esp)
        call    __ctzdi2
        leave
        ret

libgcc implements it as

int
__ctzDI2 (UDWtype x)
{
  const DWunion uu = {.ll = x};
  UWtype word;
  Wtype ret, add;

  if (uu.s.low)
    word = uu.s.low, add = 0;
  else
    word = uu.s.high, add = W_TYPE_SIZE;

  count_trailing_zeros (ret, word);
  return ret + add;
}

(count_trailing_zeros is expanded to asm bsfl on x86, that's ok)

The question remains why we don't tailcall.  And we could expand the
long-long version inline.