[Bug tree-optimization/53243] New: Use vector comparisons for if cascades
drepper.fsp at gmail dot com
gcc-bugzilla@gcc.gnu.org
Sat May 5 04:06:00 GMT 2012
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53243
Bug #: 53243
Summary: Use vector comparisons for if cascades
Classification: Unclassified
Product: gcc
Version: 4.8.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: tree-optimization
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: drepper.fsp@gmail.com
Target: x86_64-linux
Created attachment 27312
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27312
Test program (compile with and without -DOLD)
The vector units can compare multiple comparisons concurrently but this is not
used automatically in gcc in situations where it can lead to better
performance. Assume a function like this:
void
f(float a)
{
if (a < 1.0)
cb(1);
else if (a < 2.0)
cb(2);
else if (a < 3.0)
cb(3);
else if (a < 4.0)
cb(4);
else if (a < 5.0)
cb(5);
else if (a < 6.0)
cb(6);
else if (a < 7.0)
cb(7);
else if (a < 8.0)
cb(8);
else
++o;
}
In this case the first or second if is not marked with __builtin_expect as
likely, otherwise the following *might* not apply.
The routine can be rewritten for AVX machines like this:
void
f(float a)
{
const __m256 fv = _mm256_set_ps(8.0,7.0,6.0,5.0,4.0,3.0,2.0,1.0);
__m256 r = _mm256_cmp_ps(fv, _mm256_set1_ps(a), _CMP_LT_OS);
int i = _mm256_movemask_ps(r);
asm goto ("bsr %0, %0; jz %l[less1]; .pushsection .rodata; 1: .quad %l2, %l3,
%l4, %l5, %l6, %l7, %l8, %l9; .popsection; jmp *1b(,%0,8)" : : "r" (i) : :
less1, less2, less3, less4, less5, less6, less7, less8, gt8);
__builtin_unreachable ();
less1:
cb(1);
return;
less2:
cb(2);
return;
less3:
cb(3);
return;
less4:
cb(4);
return;
less5:
cb(5);
return;
less6:
cb(6);
return;
less7:
cb(7);
return;
less8:
cb(8);
return;
gt8:
++o;
}
This might not generate the absolute best code but it runs for the test program
which I attach 20% faster.
The same technique can be applied to integer comparisons. More complex if
cascades can also be simplified a lot by masking the integer bsr result
accordingly. This should still be faster.
More information about the Gcc-bugs
mailing list