This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hi, the GFortran random number generator (RANDOM_NUMBER and RANDOM_SEED intrinsics) has a number of issues that the attached preliminary patch tries to address. - 64-bit integers are available on all targets GFortran supports, and the vast majority of users nowadays use targets with native 64-bit capability. Thus by using a PRNG that uses and generates 64-bit values we can get a bit of speedup compared to the current 32-bit generator. - The current implementation is a single stream generator protected by a mutex. This means that if multiple threads are calling RANDOM_NUMBER 1) It's impossible to provide repeatable streams by using the same seed, since thread scheduling is not deterministic. 2) Performance is not only bound by the performance of a single thread, but can easily be a lot slower due to lock contention. I have been thinking about how one could make use of a multi-threaded PRNG given the limitations of the RANDOM_NUMBER & RANDOM_SEED API, and I have come up with something that I think is usable. The attached patch replaces the current KISS generator by the late George Marsaglia with the xorshift1024* generator, an enhanced version of Marsaglias xorshift generator. It's a quite nice generator, e.g. it - passes the TestU1 suite. - has a quite long period, 2**1024 - 1. So even if one generates multiple seeds randomly, it's exceedingly unlikely that the streams will alias in any realistic timeframe. - furthermore, allows a "jump" function to quickly jump forwards 2**512 bits in the stream, which is nice for creating multiple independent streams. Thus, with a total period of 2**1024, it allows up to 2**512 substreams each providing 2**512 bits before any aliasing. - Code is relatively simple, similar to KISS. Now, in order to be able to use separate streams for each thread given the Fortran standard API, the patch is implemented as follows. There is a master_state, which is initially statically initialized. The per-thread state is stored in a thread-local variable, and is initially uninitialized. - When RANDOM_NUMBER is called, a check is made to see whether the per-thread generator state is initialized. If not, the state is copied from the master_state, and the jump() function is called N times, where N equals how many streams have previously been generated from the master_state (the njumps variable). When the per-thread generator state is initialized, a random number is generated by reading and updating the per-thread generator state. - When RANDOM_SEED(PUT=) is called, the master_state is updated with the new seed, njumps is reset to zero, and the current thread generator state is copied from master_state. Thus any new streams that are subsequently created use the new seed, whereas other existing streams will continue using their existing states. - When RANDOM_SEED is called without arguments, the master_seed and current thread seed is set to random data read from the OS /dev/urandom device. Otherwise like above. While the above description might appear a bit convoluted, I think the end results for users is somewhat intuitive and it supports the common use cases - For a single-threaded program, or a multi-threaded program that takes care to call RANDOM_NUMER/RANDOM_SEED from one thread only, the end result is just like with the current implementation. - For a multi-threaded program that doesn't call RANDOM_SEED, each thread gets its own deterministic random stream with up to 2**512 bits before any aliasing occurs. - In order to get random initial seeds on each invocation of the program, just call random_seed without arguments, either before calling random_number from any threads, or then in each thread. Note that the patch is preliminary, it works so one can evaluate performance but there are bugs (e.g. njumps is never incremented, so all threads generate the same stream), documentation needs to be updated, RANDOM_SEED is not tested, the fronted check for the seed size needs to be updated, etc. See the attached file rbench.f90 for a test program one can use to evaluate performance. README.md contains results for said program. In short, serial performance is roughly on par with the current implementation, with threads the new one is a lot faster, up to two orders of magnitude with 4 threads and harvesting only a single value per RANDOM_NUMBER call. Comments welcome. -- Janne Blomqvist
Benchmarks ========== Best of 5: Before ------ â?¯ ./rbench-serial Generate default real random variables Generating 10000000 default reals individually took 101691382 ticks. Generating 10000000 default reals as an array took 80101952 ticks. => ind/arr = 1.2695243931134164 Generate double real random variables Generating 10000000 double reals individually took 229964811 ticks. Generating 10000000 double reals as an array took 194742367 ticks. => ind/arr = 1.1808668783408593 â?¯ OMP_NUM_THREADS=1 ./rbench-openmp Using up to 1 threads. Generate default real random variables Generating 10000000 default reals individually took 267737830 ticks. Generating 10000000 default reals as an array took 80352250 ticks. => ind/arr = 3.3320514360207709 Generate double real random variables Generating 10000000 double reals individually took 413027737 ticks. Generating 10000000 double reals as an array took 193848274 ticks. => ind/arr = 2.1306753394151965 â?¯ OMP_NUM_THREADS=2 ./rbench-openmp Using up to 2 threads. Generate default real random variables Generating 10000000 default reals individually took 1402217107 ticks. Generating 10000000 default reals as an array took 89168399 ticks. => ind/arr = 15.725493815359409 Generate double real random variables Generating 10000000 double reals individually took 1747341779 ticks. Generating 10000000 double reals as an array took 204212526 ticks. => ind/arr = 8.5564867798560016 â?¯ OMP_NUM_THREADS=4 ./rbench-openmp Using up to 4 threads. Generate default real random variables Generating 10000000 default reals individually took 4049200541 ticks. Generating 10000000 default reals as an array took 99056032 ticks. => ind/arr = 40.877879511668709 Generate double real random variables Generating 10000000 double reals individually took 4967873698 ticks. Generating 10000000 double reals as an array took 208499589 ticks. => ind/arr = 23.826779332404342 After ----- â?¯ ./rbench-serial Generate default real random variables Generating 10000000 default reals individually took 143561900 ticks. Generating 10000000 default reals as an array took 77473457 ticks. => ind/arr = 1.8530462633157057 Generate double real random variables Generating 10000000 double reals individually took 210418243 ticks. Generating 10000000 double reals as an array took 168215964 ticks. => ind/arr = 1.2508815334554098 â?¯ OMP_NUM_THREADS=1 ./rbench-openmp Using up to 1 threads. Generate default real random variables Generating 10000000 default reals individually took 147006318 ticks. Generating 10000000 default reals as an array took 73772125 ticks. => ind/arr = 1.9927081943213647 Generate double real random variables Generating 10000000 double reals individually took 211567311 ticks. Generating 10000000 double reals as an array took 140390978 ticks. => ind/arr = 1.5069865173244965 â?¯ OMP_NUM_THREADS=2 ./rbench-openmp Using up to 2 threads. Generate default real random variables Generating 10000000 default reals individually took 77494186 ticks. Generating 10000000 default reals as an array took 37643005 ticks. => ind/arr = 2.0586609915972436 Generate double real random variables Generating 10000000 double reals individually took 106591371 ticks. Generating 10000000 double reals as an array took 70932583 ticks. => ind/arr = 1.5027137951539140 â?¯ OMP_NUM_THREADS=4 ./rbench-openmp Using up to 4 threads. Generate default real random variables Generating 10000000 default reals individually took 55788054 ticks. Generating 10000000 default reals as an array took 30241839 ticks. => ind/arr = 1.8447308710293708 Generate double real random variables Generating 10000000 double reals individually took 72714187 ticks. Generating 10000000 double reals as an array took 51929226 ticks. => ind/arr = 1.4002555516617945 Conclusions ----------- Serial ~ more or less the same Threads => new up to 2 orders of magnitude faster
Attachment:
rand_threads.diff.gz
Description: GNU Zip compressed data
Attachment:
rbench.F90
Description: Binary data
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |