This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Substantial sort algorithm slow down from PIII to P4

From: DAVID MAY <dmay at tvi dot edu>
To: gcc at gnu dot org
Cc: DAVID MAY <dmay at tvi dot edu>, JONAH HARRIS <jharris at tvi dot edu>
Date: Mon, 26 Jan 2004 13:30:53 -0700 (MST)
Subject: Substantial sort algorithm slow down from PIII to P4

I have a performance issue regarding the P4 processor and an algorithm 
I created for sorting data.  The code used for this discussion can be 
found at

http://w3.tvi.edu/~dmay/sort-20040126.tar.bz2

This sort algorithm is one that I use in many of the projects that I 
work on as a UNIX/Linux system administrator.  It is foundational to a 
database engine I use for many sysadmin tasks.  I originally called it 
a shell sort 13 years ago, but I think that it is not really a shell 
sort after dozens of iterations.  Anyway, the algorithm uses a 
tree-like structure to index into the data points of a doubly linked 
list.  Overall, the performance and stability of the algorithm is 
respectable.  In all architectures except P4 that I have access to 
test, it gives better performance than a typical system provided qsort, 
although not as good performance as a non-recursive, highly optimized 
qsort.  One aspect of my algorithm is that it works better if the data 
is almost sorted or almost reverse sorted.  If the data points are 
mostly sorted, my algorithm is generally faster than quick sort.  On P4 
systems, though, the same algorithm is 39% slower than on a PIII system!

The following are some statistics that I gathered for this discussion:

System A is a Dell Optiplex GX240 (P4 1.8MHz, 512 Megs RAM, IDE HD) 
running Mandrake Linux 9.2 (GCC 3.3.1).  System B is an IBM Thinkpad 
T22 (PIII, 1.0GHz, 256 Megs, IDE HD) running Mandrake Linux 9.2.  The 
following is information derived from the running systems:

System A
=========
dfmay:dfmay:~$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 1
model name      : Intel(R) Pentium(R) 4 CPU 1.80GHz
stepping        : 2
cpu MHz         : 1794.203
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 3578.26

System B
=========
dmay:dfmay:~$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 10
cpu MHz         : 995.680
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca 
cmov pat pse36 mmx fxsr sse
bogomips        : 1985.74

I ran a program that sorts 250,000 data points (variable length, lower 
case alpha strings from 5-24 characters) using 3 different methods:  1) 
my sorting algorithm, 2) a non-recursive qsort algorithm and 3) the 
system qsort that comes with glibc.  The following are the average 
results for 100 runs of that program on System A and System B, 
respectively:

           My sort      Non-recursive qsort     System qsort
Totals:	1.05887987	0.35854819		0.76479589

	   My sort	Non-recursive qsort	System qsort
Totals:	0.75686249	0.6505027		1.13078088

The totals are in seconds with a smaller number indicating better 
performance.  (In the source download, the C file and shell script that 
I used to generate the data for these numbers is available as 
outstats.{c,sh}.)

I'd like to point out the following:

- From PIII->P4, the non-recursive qsort improves performance by 44.88%.
- From PIII->P4, the system qsort improves performance by 32.37%.

I would expect this to be the case, although I would really expect a 
wider spread due to the jump in architecture and the increased clock 
speed.  That aside, though, at least the jumps are in the direction one 
would expect.  However, the following I did not expect:
 
- From PIII->P4, my sort alogrithm DEGRADES in performance by 39.90%.

The performance averages you see for the PIII system are consistent on 
Sparc SunFire systems and PIII systems running cygwin, as well.  The 
non-recursive quick sort is faster than my sort which is faster than 
the system qsort.  The only system I am able to test on for which this 
is not the case is P4 systems.  I don't have AMD or PowerPC systems to 
test this code on.  I have used dual processor P4 systems with 
hyperthreading enabled and get comparable (not good) results to this P4 
system.

I have searched this list for P4 performance issues and see nothing 
that would address this issue, although I may have overlooked 
something.  Also, I have googled the net and found nothing of 
relevance.  My apologies if I missed something obvious.

Any ideas you could provide to help me alter my algorithm to run better 
on P4 systems would be greatly appreciated.  Please reply directly to 
this email as I don't lurk the list.

Thanks in advance for any help.


--------------------------------------
David May
Senior UNIX System Administrator
Albuquerque TVI (505) 224-3015
Sig:http://w3.tvi.edu/~dmay/gpgkey.txt

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]