Bug 23283 - Sun's JIT faster than gcc for Random.nextDouble
Summary: Sun's JIT faster than gcc for Random.nextDouble
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: java (show other bugs)
Version: 3.4.4
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-08 10:04 UTC by W Netzberg
Modified: 2005-08-23 14:51 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2005-08-23 14:15:25


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description W Netzberg 2005-08-08 10:04:55 UTC
I have a very compact code example (10 lines) for which running java interpreter
on a class file takes 4.5s and executing optimized binary created from the class
file with gcj takes 5.5s. Something is funny here. Can anyone explain what's
happening?

% uname -a
Linux localdomain 2.6.12-1.1372_FC3 #1 Fri Jul 15 00:59:10 EDT 2005 i686 i686
i386 GNU/Linux

>>>>>>>>>>> The java code (X.java):
import java.lang.*;
import java.util.*;
import java.io.*;
public class X {
    public static void main(String[] args) {
	Random R = new Random();
	for (int i=0; i < 10000000; i++)
	    R.nextDouble();
    }
}
>>>>>>>>>>The Makefile
all:
	@javac X.java
	@echo "*** Interpreted:"
	@time java -classpath . X
	@gcj -O3 --main=X -o x X.java
	@echo "*** Compiled:"
	@time ./x
>>>>>>>>>>Here is the output:
*** Interpreted:
4.30user 0.02system 0:04.51elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (1major+2734minor)pagefaults 0swaps
*** Compiled:
5.55user 0.01system 0:05.58elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2069minor)pagefaults 0swaps
Comment 1 Andrew Pinski 2005-08-08 13:32:25 UTC
Actually for me Sun's java and GCJ take about the same at the default settings (well for GCJ compiled at 
-O3).

Now if I change Sun's java to use the server tuned JIT, Sun's java is 2 seconds while GCJ is still 4.5 or so.


timecopper:~>time ./a.out
4999358.6586720785
4.900u 0.070s 0:06.12 81.2%     0+0k 0+0io 148pf+0w
copper:~>javac X.java
java Xcopper:~>java X
5001132.020866861
copper:~>time java X
5000974.98115474
4.791u 0.037s 0:04.88 98.7%     0+0k 0+0io 0pf+0w
copper:~>time java -server X
5000248.791328681
3.290u 0.061s 0:03.91 85.6%     0+0k 0+0io 54pf+0w

This with keeping and printing out the return value for nextDouble.

This might be startup/shut down time taking into account.
Comment 2 Tom Tromey 2005-08-23 14:15:24 UTC
I see this too.

Compiling with -fno-bounds-check helps, but not enough.

One possibility is simply that our implementation of nextDouble is
inefficient.  I doubt this function was coded for maximum performance.
Comment 3 Mark Wielaard 2005-08-23 14:31:43 UTC
Subject: Re:  Sun's JIT faster than gcc for
	Random.nextDouble

It looks like the problem is that we don't remove the synchronization
for nextDouble() even though the test case is single-threaded.

qprof: /tmp/x: 299 samples, 299 counts
X::main(JArray<java::lang::String*>*):X.java:8       5      (  2%)
libc.so.6(memchr)                                    1      (  0%)
libgcj.so.6                                          2      (  1%)
libgcj.so.6(_Jv_MonitorEnter)                        110    ( 37%)
libgcj.so.6(_Jv_MonitorExit)                         108    ( 36%)
libgcj.so.6(_ZN4java4util6Random4nextEi)             27     (  9%)
libgcj.so.6(_ZN4java4util6Random10nextDoubleEv)      46     ( 15%)


Comment 4 Paolo Bonzini 2005-08-23 14:48:51 UTC
Yes, I think that most invocations of next should be inlined, and wrapped in a
single synchronized block.

Apart from that, I am pretty sure that here

    seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
    return (int) (seed >>> (48 - bits));

it makes no difference if you do the AND or not.

So nextDouble could be implemented as:

  long first;
  long second;
  synchronized (this) {
    seed = (seed * 0x5DEECE66DL + 0xBL);
    first = (seed & 0x0000FFFFFFC00000L) << 5;
    seed = (seed * 0x5DEECE66DL + 0xBL);
    second = (seed >> 21) & 0x7FFFFFF;
  }
  return (first | second) / (double) (1L << 53);

Similarly, for nextFloat

  float f;
  int bits;
  synchronized (this) {
    seed = (seed * 0x5DEECE66DL + 0xBL);
    bits = (int) (seed >> 24);
  }
  return bits / 16777216.0f;

nextInt

  int bits;
  synchronized (this) {
    seed = (seed * 0x5DEECE66DL + 0xBL);
    bits = (int) (seed >> 16);
  }

nextLong

  long first, second;
  synchronized (this) {
    seed = (seed * 0x5DEECE66DL + 0xBL);
    first = (seed << 16) & 0xFFFFFFFF00000000L;
    seed = (seed * 0x5DEECE66DL + 0xBL);
    second = (seed >> 16) & 0xFFFFFFFFL;
  }
  return first | second;

nextInt (n)

  int bits;
  synchronized (seed)
    {
      seed = (seed * 0x5DEECE66DL + 0xBL);
      bits = (int) (seed >> 17);
    }
  if ((n & -n) == n)  // i.e., n is a power of 2
    return (int)((n * (long) bits) >> 31);

  int bits, val;
  val = bits % n;
  if (bits - val + n - 1 < 0)
    synchronized (seed)
      {
        do
          {
            seed = (seed * 0x5DEECE66DL + 0xBL);
            bits = (int) (seed >> 17);
          }
        while (bits - val + n - 1 < 0);
      }
  return val;

nextBoolean

  boolean bit;
  synchronized (this) {
    seed = (seed * 0x5DEECE66DL + 0xBL);
    bit = (seed & 0x800000000000) != 0;
  }
  return bit;

And I left out nextBytes, I know.

Also these are untested, which is why I'm not preparing a full patch.

Paolo
Comment 5 Paolo Bonzini 2005-08-23 14:51:43 UTC
> It looks like the problem is that we don't remove the synchronization
> for nextDouble() even though the test case is single-threaded.

If we can remove even only half of the synchronization overhead, by
synchronizing just once per nextDouble() call, it's a win.