[Bug tree-optimization/53243] New: Use vector comparisons for if cascades

Sat May 5 04:06:00 GMT 2012

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53243

             Bug #: 53243
           Summary: Use vector comparisons for if cascades
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: drepper.fsp@gmail.com
            Target: x86_64-linux

Created attachment 27312
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=27312
Test program (compile with and without -DOLD)

The vector units can compare multiple comparisons concurrently but this is not
used automatically in gcc in situations where it can lead to better
performance.  Assume a function like this:

void
f(float a)
{
  if (a < 1.0)
    cb(1);
  else if (a < 2.0)
    cb(2);
  else if (a < 3.0)
    cb(3);
  else if (a < 4.0)
    cb(4);
  else if (a < 5.0)
    cb(5);
  else if (a < 6.0)
    cb(6);
  else if (a < 7.0)
    cb(7);
  else if (a < 8.0)
    cb(8);
  else
    ++o;
}

In this case the first or second if is not marked with __builtin_expect as
likely, otherwise the following *might* not apply.

The routine can be rewritten for AVX machines like this:

void
f(float a)
{
  const __m256 fv = _mm256_set_ps(8.0,7.0,6.0,5.0,4.0,3.0,2.0,1.0);
  __m256 r = _mm256_cmp_ps(fv, _mm256_set1_ps(a), _CMP_LT_OS);
  int i = _mm256_movemask_ps(r);
  asm goto ("bsr %0, %0; jz %l[less1]; .pushsection .rodata; 1: .quad %l2, %l3,
%l4, %l5, %l6, %l7, %l8, %l9; .popsection; jmp *1b(,%0,8)" : : "r" (i) : :
less1, less2, less3, less4, less5, less6, less7, less8, gt8);
  __builtin_unreachable ();
 less1:
  cb(1);
  return;
 less2:
  cb(2);
  return;
 less3:
  cb(3);
  return;
 less4:
  cb(4);
  return;
 less5:
  cb(5);
  return;
 less6:
  cb(6);
  return;
 less7:
  cb(7);
  return;
 less8:
  cb(8);
  return;
 gt8:
  ++o;
}

This might not generate the absolute best code but it runs for the test program
which I attach 20% faster.

The same technique can be applied to integer comparisons.  More complex if
cascades can also be simplified a lot by masking the integer bsr result
accordingly.  This should still be faster.