Starting an OpenMP parallel section is extremely slow on a hyper-threaded Nehalem

Tim Prince
Thu Feb 11 15:02:00 GMT 2010

Sorry for getting confused about Ubuntu version dates.
A requirement for setting GOMP_CPU_AFFINITY for performance with HT is 
expected.  Adjacent threads might be expected to touch some of the same 
cache lines, so they must be run by sibling logical processors which 
share the same cache.
The only OpenMP library I have seen which makes affinity setting a 
default is the one from PGI, and that tactic is inflexible.  Intel 
compilers have a seldom used option to set such a default.
If you have the libiomp for Intel OpenMP, running with that library in 
place of libgomp might be an interesting comparison.
Among the situations which might make HT run slowly even with 
appropriate affinity could be cache and TLB capacity shortage, or all 
hot code sections depending on a shared resource such as FPU, or lack of 
cache locality (inner loops not stride 1).
The Intel MKL library tries to detect HT and (by default) use 1 thread 
maximum per core.

Tim Prince

