Bug 21617 - [4.6 Regression] CRC64 algorithm optimization problem on Intel 32-bit
Summary: [4.6 Regression] CRC64 algorithm optimization problem on Intel 32-bit
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: unknown
: P2 normal
Target Milestone: 4.7.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks:
 
Reported: 2005-05-17 10:26 UTC by Mark Cave-Ayland
Modified: 2013-04-12 16:17 UTC (History)
5 users (show)

See Also:
Host:
Target: i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-05-22 14:17:47


Attachments
The crctest64 .c file mentioned, along with the .i and .s files (10.61 KB, application/octet-stream)
2005-05-17 10:31 UTC, Mark Cave-Ayland
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Cave-Ayland 2005-05-17 10:26:06 UTC
Using a 64-bit CRC algorithm implementation implemented using "long long int",
it has been found that gcc fails to optimise the algorithm correctly when the
-O2 parameter is used. In fact, it is found that the processing time for CRC
calculation more than doubles compared to the processing time for the algorithm
when the -O1 parameter is used.

Tom Lane from Red Hat has done some testing and discovered that this problem
only occurs on 32-bit Intel processors (see
http://archives.postgresql.org/pgsql-hackers/2005-05/msg01051.php for more
information and the results of tests on other architectures).

I've marked this issue under version "unknown", however the URL above shows that
the problem exists in versions 3.2.3, 3.3.2 and 4.0.0.

The test case is easy: compile and run the attached crctest64.c on an Intel
32-bit processor such as a P4/Xeon with the following options:

gcc -O1 crctest64.c -o crctest64-o1
gcc -O2 crctest64.c -o crctest64-o2

Comparison of the timings will show that the second version is at least 100%
slower than the first.


Many thanks,

Mark.
Comment 1 Mark Cave-Ayland 2005-05-17 10:31:57 UTC
Created attachment 8910 [details]
The crctest64 .c file mentioned, along with the .i and .s files
Comment 2 Steven Bosscher 2011-05-22 14:17:47 UTC
The test case doesn't compile for me with GCC versions 3.4.6, 4.2.4, 4.3.3, 4.4.2, and 4.5.0, with the same error message:

crctest64.c: In function 'main':
crctest64.c:225: error: lvalue required as left operand of assignment

I have changed that line from 

        (unsigned char *) data = malloc(8192);

to

        data = (unsigned char *) malloc(8192);

to get timings on 32 bits:

$ for VER in 3.4.6 4.2.4 4.3.3 4.4.2 4.5.0 ; do
>  for OPT in O0 O1 O2 O3 ; do
>    echo GCC version ${VER} optimization level -${OPT}
>    gcc-${VER} -m32 -${OPT} crctest64.c -o crctest64-${VER}-${OPT}
>    for i in 1 2 3 ; do ./crctest64-${VER}-${OPT} ; done
>  done
>done
GCC version 3.4.6 optimization level -O0
Result of CRC64 (10000 runs): 782104059a01660 in time 0.852648 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.848188 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.822457 s
GCC version 3.4.6 optimization level -O1
Result of CRC64 (10000 runs): 782104059a01660 in time 0.355243 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.346605 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.346314 s
GCC version 3.4.6 optimization level -O2
Result of CRC64 (10000 runs): 782104059a01660 in time 0.398958 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.390089 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.411387 s
GCC version 3.4.6 optimization level -O3
Result of CRC64 (10000 runs): 782104059a01660 in time 0.545008 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.547569 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.546446 s
GCC version 4.2.4 optimization level -O0
Result of CRC64 (10000 runs): 782104059a01660 in time 0.831932 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.821412 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.831557 s
GCC version 4.2.4 optimization level -O1
Result of CRC64 (10000 runs): 782104059a01660 in time 0.342785 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.342183 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.341924 s
GCC version 4.2.4 optimization level -O2
Result of CRC64 (10000 runs): 782104059a01660 in time 0.351228 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.351935 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.342161 s
GCC version 4.2.4 optimization level -O3
Result of CRC64 (10000 runs): 782104059a01660 in time 0.350711 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.351894 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.340234 s
GCC version 4.3.3 optimization level -O0
Result of CRC64 (10000 runs): 782104059a01660 in time 0.862630 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.851423 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.852658 s
GCC version 4.3.3 optimization level -O1
Result of CRC64 (10000 runs): 782104059a01660 in time 0.272298 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273758 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.294192 s
GCC version 4.3.3 optimization level -O2
Result of CRC64 (10000 runs): 782104059a01660 in time 0.282641 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.283198 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.303778 s
GCC version 4.3.3 optimization level -O3
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273614 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273006 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.293920 s
GCC version 4.4.2 optimization level -O0
Result of CRC64 (10000 runs): 782104059a01660 in time 0.937833 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.958881 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.928573 s
GCC version 4.4.2 optimization level -O1
Result of CRC64 (10000 runs): 782104059a01660 in time 0.308643 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.308575 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.307712 s
GCC version 4.4.2 optimization level -O2
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273664 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.284265 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.274092 s
GCC version 4.4.2 optimization level -O3
Result of CRC64 (10000 runs): 782104059a01660 in time 0.283466 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273742 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273861 s
GCC version 4.5.0 optimization level -O0
Result of CRC64 (10000 runs): 782104059a01660 in time 0.928573 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.928108 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.938291 s
GCC version 4.5.0 optimization level -O1
Result of CRC64 (10000 runs): 782104059a01660 in time 0.267302 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.259114 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.279355 s
GCC version 4.5.0 optimization level -O2
Result of CRC64 (10000 runs): 782104059a01660 in time 0.375379 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.385842 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.375052 s
GCC version 4.5.0 optimization level -O3
Result of CRC64 (10000 runs): 782104059a01660 in time 0.407680 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.376708 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.377681 s


Timings on 64 bits:
GCC version 3.4.6 optimization level -O0
Result of CRC64 (10000 runs): 782104059a01660 in time 0.871975 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.716389 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.775837 s
GCC version 3.4.6 optimization level -O1
Result of CRC64 (10000 runs): 782104059a01660 in time 0.294281 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.272445 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.285114 s
GCC version 3.4.6 optimization level -O2
Result of CRC64 (10000 runs): 782104059a01660 in time 0.274701 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273121 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.272264 s
GCC version 3.4.6 optimization level -O3
Result of CRC64 (10000 runs): 782104059a01660 in time 0.283980 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273684 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.282508 s
GCC version 4.2.4 optimization level -O0
Result of CRC64 (10000 runs): 782104059a01660 in time 0.763965 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.754343 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.763644 s
GCC version 4.2.4 optimization level -O1
Result of CRC64 (10000 runs): 782104059a01660 in time 0.258482 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.244632 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.238194 s
GCC version 4.2.4 optimization level -O2
Result of CRC64 (10000 runs): 782104059a01660 in time 0.240009 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.259900 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.239865 s
GCC version 4.2.4 optimization level -O3
Result of CRC64 (10000 runs): 782104059a01660 in time 0.240682 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.241178 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.249200 s
GCC version 4.3.3 optimization level -O0
Result of CRC64 (10000 runs): 782104059a01660 in time 0.752887 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.754593 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.765622 s
GCC version 4.3.3 optimization level -O1
Result of CRC64 (10000 runs): 782104059a01660 in time 0.272266 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273892 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.272209 s
GCC version 4.3.3 optimization level -O2
Result of CRC64 (10000 runs): 782104059a01660 in time 0.283182 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.274435 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.283880 s
GCC version 4.3.3 optimization level -O3
Result of CRC64 (10000 runs): 782104059a01660 in time 0.272292 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.272694 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.291955 s
GCC version 4.4.2 optimization level -O0
Result of CRC64 (10000 runs): 782104059a01660 in time 0.752652 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.754807 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.764424 s
GCC version 4.4.2 optimization level -O1
Result of CRC64 (10000 runs): 782104059a01660 in time 0.281111 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.304457 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273711 s
GCC version 4.4.2 optimization level -O2
Result of CRC64 (10000 runs): 782104059a01660 in time 0.283895 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.272624 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273521 s
GCC version 4.4.2 optimization level -O3
Result of CRC64 (10000 runs): 782104059a01660 in time 0.272190 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273906 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.283429 s
GCC version 4.5.0 optimization level -O0
Result of CRC64 (10000 runs): 782104059a01660 in time 0.771673 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.764492 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.763854 s
GCC version 4.5.0 optimization level -O1
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273928 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.272239 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.282403 s
GCC version 4.5.0 optimization level -O2
Result of CRC64 (10000 runs): 782104059a01660 in time 0.274135 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.282790 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.315162 s
GCC version 4.5.0 optimization level -O3
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273543 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.291125 s
Result of CRC64 (10000 runs): 782104059a01660 in time 0.273867 s


Machine:
Linux laptop 2.6.32-31-generic #61-Ubuntu SMP Fri Apr 8 18:25:51 UTC 2011 x86_64 GNU/Linux, "AMD Turion(tm) 64 X2 Mobile Technology TL-68", 800MHz

Observations:
- Problem is 32-bits only.
- I cannot reproduce the difference in results at O1 and O2 for older GCC versions, but GCC 4.5 does show the difference.
- Speed is worse for GCC 4.4 and GCC 4.5, i.e. regression.
Comment 3 Andrew Pinski 2011-12-08 03:34:57 UTC
-O1:
.L2:
	movl	%eax, %edx
	sarl	$31, %edx
	shrl	$24, %edx
	leal	(%eax,%edx), %ecx
	andl	$255, %ecx
	subl	%edx, %ecx
	movb	%cl, (%ebx,%eax)
	addl	$1, %eax
	cmpl	$8192, %eax
	jne	.L2

-O2:
.L4:
	movl	76(%esp), %ecx
	movl	%edi, %ebx
	shrl	$24, %ebx
	movzbl	(%ecx), %edx
	addl	$1, %ecx
	movl	%ecx, 76(%esp)
	movl	%esi, %ecx
	sall	$8, %ecx
	xorl	%ebx, %edx
	movl	%edi, %ebx
	shldl	$8, %esi, %ebx
	movl	crc_table+4(,%edx,8), %edi
	movl	crc_table(,%edx,8), %esi
	xorl	%ebx, %edi
	xorl	%ecx, %esi
	cmpl	%eax, 76(%esp)
	jne	.L4
Comment 4 Andrew Pinski 2011-12-08 03:37:28 UTC
-O1:

<bb 5>:
  # __crc0_73 = PHI <__crc0_35(5), __crc0_54(7)>
  # __data_75 = PHI <__data_32(5), data_7(7)>
  D.1900_26 = __crc0_73 >> 56;
  D.1901_27 = (int) D.1900_26;
  D.1902_28 = MEM[base: __data_75, offset: 0B];
  D.1903_29 = (int) D.1902_28;
  D.1904_30 = D.1903_29 ^ D.1901_27;
  __tab_index_31 = D.1904_30 & 255;
  __data_32 = __data_75 + 1;
  D.1905_33 = crc_table[__tab_index_31];
  D.1906_34 = __crc0_73 << 8;
  __crc0_35 = D.1905_33 ^ D.1906_34;
  if (__data_32 != D.1928_55)
    goto <bb 5>;
  else
    goto <bb 6>;

-O2:
<bb 5>:
  # __crc0_1 = PHI <__crc0_35(5), __crc0_54(7)>
  # __data_67 = PHI <__data_32(5), data_7(7)>
  D.1900_26 = __crc0_1 >> 56;
  D.1901_27 = (int) D.1900_26;
  D.1902_28 = MEM[base: __data_67, offset: 0B];
  D.1903_29 = (int) D.1902_28;
  D.1904_30 = D.1903_29 ^ D.1901_27;
  __tab_index_31 = D.1904_30 & 255;
  __data_32 = __data_67 + 1;
  D.1905_33 = crc_table[__tab_index_31];
  D.1906_34 = __crc0_1 << 8;
  __crc0_35 = D.1905_33 ^ D.1906_34;
  if (__data_32 != D.1955_86)
    goto <bb 5>;
  else
    goto <bb 6>;

Aka nothing on the tree level causes the issue.
Comment 5 Andrew Pinski 2011-12-08 03:53:22 UTC
This is definitely a register allocation issue because the RTL is basically the same when entering IRA.
Comment 6 Vladimir Makarov 2011-12-09 19:09:52 UTC
There is small difference in the code which results in such degradation.

-O1 generates an insn in the major loop

(insn 43 42 44 5 /home/cygnus/vmakarov/build1/trunk/crctest64.c:241 (parallel [
            (set (reg/v:SI 77 [ __tab_index ])
		(xor:SI (reg:SI 108)
                    (reg:SI 120)))
            (clobber (reg:CC 17 flags))
        ]) 395 {*xorsi_1} (expr_list:REG_DEAD (reg:SI 108)
        (expr_list:REG_DEAD (reg:SI 120)
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil)))))

-O2 generates analogous insn

(insn 39 38 40 5 /home/cygnus/vmakarov/build1/trunk/crctest64.c:241 (parallel [
            (set (reg/v:SI 83 [ __tab_index ])
                (xor:SI (reg/v:SI 83 [ __tab_index ])
                    (reg:SI 143)))
            (clobber (reg:CC 17 flags))
        ]) 395 {*xorsi_1} (expr_list:REG_DEAD (reg:SI 143)
	(expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))

The reason for the difference because of regmove optimization.

The RTL insn in the second variant looks even better but it makes
pseudo 83 most frequently used and assigned first by pushing it last
to the coloring stack between bunch trivially colorable pseudos.  The
set of trivially colorable pseudos contains two double word pseudos
which need two adjacent hard registers each.  Assigning pseudo 83
first (the case is complicated more because some pseudos cross calls)
results in presence of only one pair of adjacent hard registers
although there are still 2 free hard register for the second double
word pseudos but they are not adjacent.  It results in spilling of one
double word pseudo and code performance degradation.

For -O1 analog pseudo 83 (p77) is assigned last after assigning to two
double word pseudos and spilling does not occur.

To solve the problem we should increase probability of keeping free
hard registers adjacent.  It can be done by pushing multi-word pseudos
last to the coloring stack and as consequence to assign them first by
modifying function bucket_allocno_compare_func.  I did the problem was
solved unfortunately, it results in 2% performance degradation of
SPEC2000 perlbmk although there is a small code size improvement on
SPEC2000 with this heuristic.

On a general note, RA allocation is all about heuristics.  So it is
possible to find a test where it will work worse than other
heuristics.  The most important that RA works well in overall (on big
credible set of tests).  With this point of view IRA is much better
than the previous register allocator.

But because crc code is important, I'll continue the work on tuning
which does not degrade SPEC2000 and which does solve problem.
Comment 7 Vladimir Makarov 2011-12-12 20:51:19 UTC
Author: vmakarov
Date: Mon Dec 12 20:51:16 2011
New Revision: 182263

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=182263
Log:
2011-12-12  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/21617
	* ira-color.c (bucket_allocno_compare_func): Don't compare
	allocno classes.  Compare number of hard registers needed.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ira-color.c
Comment 8 Jakub Jelinek 2011-12-12 21:15:23 UTC
Fixed on the trunk.  I don't think it is desirable to backport this.
Comment 9 Jakub Jelinek 2012-03-13 12:48:16 UTC
4.4 branch is being closed, moving to 4.5.4 target.
Comment 10 Richard Biener 2012-07-02 12:10:39 UTC
The 4.5 branch is being closed, adjusting target milestone.
Comment 11 Jakub Jelinek 2013-04-12 16:17:59 UTC
The 4.6 branch has been closed, fixed in GCC 4.7.0.