This is the mail archive of the
mailing list for the GCC project.
Re: [PATCH] profile feedback: -fprofile-use= and -fprofile-correction, correctness fixes and option semantic changes.
> > It was me. I need to submit it again for trunk. IIRC it did slow down
> > the profiled runs somewhat, but not extremely.
> This depends on the application and the system you run it on.
> i.e. if you have many threads (16+) and correspondingly high number of
> the slowdown becomes significant. I don't remember the exact numbers,
> but IIRC, the slowdown was in the order of 10 on such a system.
> This is no surprise,
> since counters for hot regions in a function
> are clustered together, occupying the same cache line,
> and any of the counter updates to the same cache line
> will be essentially serialized across the entire system,
> making the slowdown proportional to the thread/core count.
> On systems with a single chip and a single L2,
> the slowdown is naturally not that significant,
> since the coherence traffic will be all on chip and the cache line ping-ponging
> will be bound to the single chip.
> But as you add more chips, the slowdown rapidly becomes significant
> as the average cost of grabbing the cache line from the other core
> increases significantly, and also the effect of serialization on the
> overall throughput
> becomes proportionally bigger.
You are right that the costs of locking are going only to increase
making cost of the locking variant more noticable. I am leaning towards
to simply have both solutions in compiler, perhaps with the locking
variant being enabled by default.
>From maintainibility POV it is very good to have safe way for compiler
to realize that the profile is messed up. This still happens quite
often and it is important that the problems are noticed and reported.
I also think that the diagnostics instead of reading nonsential profile
is going to avoid users from doing simple mistake that will misguide GCC
to wrong optimizations and disapoint user as a result. Also ICC and
other compiler use this solution I believe.
However I am happy to have the "error tolerant" variant as an
alternative when profiling code performance of threaded program is