This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH, RFC, gfortran] Multi-threaded random number generator


Hi,

the GFortran random number generator (RANDOM_NUMBER and RANDOM_SEED
intrinsics) has a number of issues that the attached preliminary patch
tries to address.

- 64-bit integers are available on all targets GFortran supports, and
  the vast majority of users nowadays use targets with native 64-bit
  capability.  Thus by using a PRNG that uses and generates 64-bit
  values we can get a bit of speedup compared to the current 32-bit
  generator.

- The current implementation is a single stream generator protected by
  a mutex. This means that if multiple threads are calling
  RANDOM_NUMBER

  1) It's impossible to provide repeatable streams by using the same
  seed, since thread scheduling is not deterministic.

  2) Performance is not only bound by the performance of a single
  thread, but can easily be a lot slower due to lock contention.

I have been thinking about how one could make use of a multi-threaded
PRNG given the limitations of the RANDOM_NUMBER & RANDOM_SEED API, and
I have come up with something that I think is usable.

The attached patch replaces the current KISS generator by the late
George Marsaglia with the xorshift1024* generator, an enhanced version
of Marsaglias xorshift generator. It's a quite nice generator, e.g. it

- passes the TestU1 suite.

- has a quite long period, 2**1024 - 1. So even if one generates
  multiple seeds randomly, it's exceedingly unlikely that the streams
  will alias in any realistic timeframe.

- furthermore, allows a "jump" function to quickly jump forwards
  2**512 bits in the stream, which is nice for creating multiple
  independent streams. Thus, with a total period of 2**1024, it allows
  up to 2**512 substreams each providing 2**512 bits before any
  aliasing.

- Code is relatively simple, similar to KISS.

Now, in order to be able to use separate streams for each thread given
the Fortran standard API, the patch is implemented as follows. There
is a master_state, which is initially statically initialized. The
per-thread state is stored in a thread-local variable, and is
initially uninitialized.

- When RANDOM_NUMBER is called, a check is made to see whether the
  per-thread generator state is initialized. If not, the state is
  copied from the master_state, and the jump() function is called N
  times, where N equals how many streams have previously been
  generated from the master_state (the njumps variable). When the
  per-thread generator state is initialized, a random number is
  generated by reading and updating the per-thread generator state.

- When RANDOM_SEED(PUT=) is called, the master_state is updated with
  the new seed, njumps is reset to zero, and the current thread
  generator state is copied from master_state. Thus any new streams
  that are subsequently created use the new seed, whereas other
  existing streams will continue using their existing states.

- When RANDOM_SEED is called without arguments, the master_seed and
  current thread seed is set to random data read from the OS
  /dev/urandom device. Otherwise like above.

While the above description might appear a bit convoluted, I think the
end results for users is somewhat intuitive and it supports the common
use cases

- For a single-threaded program, or a multi-threaded program that
  takes care to call RANDOM_NUMER/RANDOM_SEED from one thread only,
  the end result is just like with the current implementation.

- For a multi-threaded program that doesn't call RANDOM_SEED, each
  thread gets its own deterministic random stream with up to 2**512
  bits before any aliasing occurs.

- In order to get random initial seeds on each invocation of the program, just
  call random_seed without arguments, either before calling
  random_number from any threads, or then in each thread.

Note that the patch is preliminary, it works so one can evaluate
performance but there are bugs (e.g. njumps is never incremented, so
all threads generate the same stream), documentation needs to be
updated, RANDOM_SEED is not tested, the fronted check for the seed
size needs to be updated, etc.

See the attached file rbench.f90 for a test program one can use to
evaluate performance. README.md contains results for said program. In
short, serial performance is roughly on par with the current
implementation, with threads the new one is a lot faster, up to two
orders of magnitude with 4 threads and harvesting only a single value
per RANDOM_NUMBER call.

Comments welcome.

-- 
Janne Blomqvist
Benchmarks
==========

Best of 5:

Before
------

â?¯ ./rbench-serial
 Generate default real random variables
 Generating     10000000  default reals individually took             101691382  ticks.
 Generating     10000000  default reals as an array took               80101952  ticks. => ind/arr =    1.2695243931134164     
 Generate double real random variables
 Generating     10000000  double reals individually took              229964811  ticks.
 Generating     10000000  double reals as an array took               194742367  ticks. => ind/arr =    1.1808668783408593     


â?¯ OMP_NUM_THREADS=1 ./rbench-openmp
 Using up to            1  threads.
 Generate default real random variables
 Generating     10000000  default reals individually took             267737830  ticks.
 Generating     10000000  default reals as an array took               80352250  ticks. => ind/arr =    3.3320514360207709     
 Generate double real random variables
 Generating     10000000  double reals individually took              413027737  ticks.
 Generating     10000000  double reals as an array took               193848274  ticks. => ind/arr =    2.1306753394151965     


â?¯ OMP_NUM_THREADS=2 ./rbench-openmp
 Using up to            2  threads.
 Generate default real random variables
 Generating     10000000  default reals individually took            1402217107  ticks.
 Generating     10000000  default reals as an array took               89168399  ticks. => ind/arr =    15.725493815359409     
 Generate double real random variables
 Generating     10000000  double reals individually took             1747341779  ticks.
 Generating     10000000  double reals as an array took               204212526  ticks. => ind/arr =    8.5564867798560016     


â?¯ OMP_NUM_THREADS=4 ./rbench-openmp
 Using up to            4  threads.
 Generate default real random variables
 Generating     10000000  default reals individually took            4049200541  ticks.
 Generating     10000000  default reals as an array took               99056032  ticks. => ind/arr =    40.877879511668709     
 Generate double real random variables
 Generating     10000000  double reals individually took             4967873698  ticks.
 Generating     10000000  double reals as an array took               208499589  ticks. => ind/arr =    23.826779332404342     
 

After
-----

â?¯ ./rbench-serial
 Generate default real random variables
 Generating     10000000  default reals individually took             143561900  ticks.
 Generating     10000000  default reals as an array took               77473457  ticks. => ind/arr =    1.8530462633157057     
 Generate double real random variables
 Generating     10000000  double reals individually took              210418243  ticks.
 Generating     10000000  double reals as an array took               168215964  ticks. => ind/arr =    1.2508815334554098     


â?¯ OMP_NUM_THREADS=1 ./rbench-openmp
 Using up to            1  threads.
 Generate default real random variables
 Generating     10000000  default reals individually took             147006318  ticks.
 Generating     10000000  default reals as an array took               73772125  ticks. => ind/arr =    1.9927081943213647     
 Generate double real random variables
 Generating     10000000  double reals individually took              211567311  ticks.
 Generating     10000000  double reals as an array took               140390978  ticks. => ind/arr =    1.5069865173244965     
 
â?¯ OMP_NUM_THREADS=2 ./rbench-openmp
 Using up to            2  threads.
 Generate default real random variables
 Generating     10000000  default reals individually took              77494186  ticks.
 Generating     10000000  default reals as an array took               37643005  ticks. => ind/arr =    2.0586609915972436     
 Generate double real random variables
 Generating     10000000  double reals individually took              106591371  ticks.
 Generating     10000000  double reals as an array took                70932583  ticks. => ind/arr =    1.5027137951539140     

â?¯ OMP_NUM_THREADS=4 ./rbench-openmp
 Using up to            4  threads.
 Generate default real random variables
 Generating     10000000  default reals individually took              55788054  ticks.
 Generating     10000000  default reals as an array took               30241839  ticks. => ind/arr =    1.8447308710293708     
 Generate double real random variables
 Generating     10000000  double reals individually took               72714187  ticks.
 Generating     10000000  double reals as an array took                51929226  ticks. => ind/arr =    1.4002555516617945     


Conclusions
-----------

Serial ~ more or less the same

Threads => new up to 2 orders of magnitude faster

Attachment: rand_threads.diff.gz
Description: GNU Zip compressed data

Attachment: rbench.F90
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]