This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hi! I've committed attached patch to improve libgomp performance. It adds support for optional busy waiting rather than going to sleep immediately in case of contention (though this is incomplete so far, only GOMP_BLOCKTIME=0 and GOMP_BLOCKTIME=infinity make difference, I still need to write code to estimate the speed of the static inline void do_wait (int *addr, int val) { unsigned long long i, count = gomp_spin_count_var; for (i = 0; i < count; i++) if (__builtin_expect (*addr != val, 0)) return; else cpu_relax (); loop to be able to translate milliseconds into gomp_spin_count_var; probably using a short benchmark on libgomp startup, but don't want to make it too expensive - on some targets which don't have rep;nop or hint @pause perhaps just reading /proc/cpuinfo to find MHz and use hardcoded number of ticks would work too - any suggestions appreciated), and rewrites libgomp barriers, so that they use just one atomic operation in each thread, rather than at least 3 or even more as they needed so far (which was really bad in case of heavy contention). On the attached microbenchmark this gives nice speedups on quadcore CPU (similar speedups can be seen on real-world benchmarks): GOMP_BLOCKTIME=infinity ./micro-gcc-after-patch # busy waiting, new barriers barrier bench 0.990822 seconds parallel bench 2.21924 seconds static bench 0.114701 seconds dynamic bench 0.539615 seconds ./micro-gcc-after-patch # sleeping, new barriers barrier bench 15.7526 seconds parallel bench 7.05841 seconds static bench 0.357082 seconds dynamic bench 0.53934 seconds ./micro-gcc-before-patch # sleeping, old barriers barrier bench 47.8483 seconds parallel bench 8.22674 seconds static bench 0.412263 seconds dynamic bench 0.536502 seconds While the barrier bench numbers look very good, I'm still not satisfied with the time it takes to execute #pragma omp parallel when no pthread_create is needed, as the threads are docked - hope we can improve that some more. This is the most important factor in parallel bench and static bench. Dynamic bench is about the speed of gomp_loop_dynamic_next and gomp_iter_dynamic_next, wonder if the amount of work needed from reading ws->next until it is __sync_val_compare_and_swaped isn't too big and therefore too many threads restart the loop in case of high contention. Guided is even more expensive. Jakub
Attachment:
P7
Description: Text document
Attachment:
micro.c
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |