52056 – Vectorizer cost model is imprecise

Bug 52056 - Vectorizer cost model is imprecise

Summary: Vectorizer cost model is imprecise

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.6.1

Importance:	P3 minor
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2012-01-30 19:22 UTC by gccbug
Modified:	2012-07-13 08:52 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	x86_64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:	2012-01-31 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description gccbug 2012-01-30 19:22:35 UTC

GCC 4.6.1 (also in at least 4.5 and 4.4), 64 bit Ubuntu.

The GCC optimizer detects and significantly speeds up some simple inner-loop computations, but small changes to those inner loops can completely change the behavior of the optimizer, losing 50% or more of the code speed.

Example code, simplified down from a hash table computation:

#include <stdio.h>
const int UNROLL=8;
int main()
{
  // static keyword in next line can trigger optimizer behavior!
  static unsigned long accum=0;
  unsigned int sum=0;  

  for (long j=0; j<0x400000000/UNROLL; ++j)			
    for (int unroll=0; unroll<UNROLL; ++unroll)	{
      unsigned long j;
      ++accum;
      // unsigned versus signed shift in next line can trigger optimizer
      j=(accum^(((unsigned long)accum)>>22));

      j*=0x6543210123456789;
      sum+=(unsigned int)(j>>32);
    }
  printf("Sum=%d\n", sum);
}

compile with  gcc -O3 test.c -o test


The inner loop of shifts, xors, mults, and adds is the core computation. The GCC optimizer is very sensitive to whether the shift above is made with a SIGNED or UNSIGNED long.   It is also sensitive to whether the "accum" variable is static or not.  

Some timings on a i7 980X CPU:

Static accum, signed shift:   11.3 seconds
Static accum, unsigned shift: 24.3 seconds
Local accum, signed shift:    12.1 seconds
Local accum, unsigned shift:  14.4 seconds


Looking at the assembly -S output, it's clear that the optimizations are valid and effective, with the dramatic speed increase coming from switching over to SSE math.  The output is correct in all 4 cases.
The only problem is that the optimizer is unable to recognize the same opportunity in all four cases, giving inconsistent performance.


This example is simplified down from the inner loops of some of my code involving Monte Carlo simulation, and has a significant affect on runtime.  The Intel ICC compiler is significantly slower than GCC in most cases on my code. For this specific example, ICC has an execution time of 14.3 seconds for all 4 cases. That 25% speed advantage of GCC is huge to us when we have 500 machines in a cluster running simulations for days!

Comment 1 gccbug 2012-01-30 22:34:37 UTC

While not relevant to gcc itself, it is interesting that clang also has trouble with consistently identifying this optimization, but in an opposite way to GCC. For clang, the unsigned shift code is faster (11.9 seconds) compared to clang's signed shift (13.3 seconds.)   GCC's speeds are 24.3 and 11.3 seconds respectively.   clang does not have any sensitivity to the static variable declaration.

Comment 2 Jakub Jelinek 2012-01-30 23:16:03 UTC

The signed vs. unsigned long right shift is quite significant, because Intel chips don't support signed quadword right shifts, only unsigned quadword right shifts (and left shifts), except that AMD chips with -mxop do support that.
So, with the unsigned long right shift the loop is vectorized, while with signed long right shift it is not, and clearly in this case the vectorization (at least two elements at a time) isn't beneficial, but the cost model doesn't figure that out.  So the faster times are without vectorization, you can get the same speed with -O3 -fno-tree-vectorize even with the unsigned shift.
Even AVX can't process more than two elements at a time, only AVX2 will be able, how fast is that loop on AVX2 capable chips compared to non-vectorized remains to be seen.

Comment 3 Richard Biener 2012-07-13 08:52:06 UTC

Link to vectorizer missed-optimization meta-bug.