Bug 32044 - [4.3 Regression] final value replacement too aggressive for e.g. targets with no native div/mod insns
Summary: [4.3 Regression] final value replacement too aggressive for e.g. targets with...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.3.0
: P2 normal
Target Milestone: 4.3.4
Assignee: Richard Biener
URL:
Keywords: missed-optimization, TREE
: 31990 33419 34167 38453 (view as bug list)
Depends on:
Blocks:
 
Reported: 2007-05-22 17:20 UTC by Ray Malitzke
Modified: 2009-05-26 10:17 UTC (History)
17 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.2.2 4.2.1 4.4.0
Known to fail: 4.3.0 4.3.3
Last reconfirmed: 2009-04-22 15:15:13


Attachments
*.s files (1.38 KB, application/octet-stream)
2007-05-22 17:27 UTC, Ray Malitzke
Details
*.s files for gcc-4.2 (1.29 KB, application/octet-stream)
2007-05-22 17:37 UTC, Ray Malitzke
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ray Malitzke 2007-05-22 17:20:31 UTC
First I am herewith re-afirming my formal request for Mr. Pinsk to refrain from
having anything to do with my submissions.

Now to the heart of the matter:

According to my (admittedly second hand (Fifth Edition of "C A Reference Manual"
by Samuel P. Harbison III & Guy L. Stelle Jr) reading; GCC, by not providing a 
means to disable the use of libgcc (including udivdi3) is not in strict 
conformance with the C standard for free-standing use through C99. __udivdi3 is a reserved identifier 
and hence non-conforming.

The irony is that, besides, being non-conforming and prejudicing free standing 
applications that aim for maximum portability, it is highly counterproductive 
in its own right.

Also, the forced and silent use of libgcc (lld does not show it being used) 
violates one of the fundamental principles of both UNIX and C. Namely that 
the user (certainly root) is to be in full and absolute command of the system 
without hidden reinterpretation of his commands or MS type questions.  As a 
practical matter the use of __builtin_expect could be taken as signal to 
allow only reordering of instructions (to avoid pipeline stalls and 
reloading of caches) are to be avoided in the marked unlikely cases. Any 
fundamental changes like exchanging a while and a subtraction for a non-hardware 
divide should no occur

If anybody at GCC wants to know what others (including L. Torvalds and A. Morton) think; checking Google on udivdi3 might be instructive.

What follows are the result of tests using current versions of gcc-4.3 and 4.2.1. 
I believe the results speak for themselves. Besides the data for x86 I also 
have quite similar data for powerpc G4,  which I will make available as a follow on.

	Program

#define NSEC_PER_SEC  1000000000UL	
int rmg(void);

int main(void)
{
	/* int sec; */
	return rmg();
}

int rmg(void)
{
static unsigned long long nsec = 0;
static int sec = 0;
    while (sec < 1 ) { 
	nsec++;
	while (__builtin_expect(nsec >= NSEC_PER_SEC, 0)) {
		nsec -= NSEC_PER_SEC;
		++sec;
	}
    }	
	return sec;
}


	gcc_43 -O0

-rwxr-xr-x 1 root root  8478 2007-05-22 08:23 rmgg_O0
-rw-r--r-- 1 root root  1238 2007-05-22 08:18 rmgg_O0.s

real	0m27.613s
user	0m27.607s
sys	0m0.003s


	gcc_43 -O1

-rwxr-xr-x 1 root root 12586 2007-05-22 08:25 rmgg_O1
-rw-r--r-- 1 root root  1572 2007-05-22 08:25 rmgg_O1.s

real	0m12.776s
user	0m12.775s
sys	0m0.003s

	gcc_43 -O2

-rwxr-xr-x 1 root root 12586 2007-05-22 08:27 rmgg_O2
-rw-r--r-- 1 root root  1874 2007-05-22 08:27 rmgg_O2.s

real	0m16.415s
user	0m16.414s
sys	0m0.004s

	gcc_43 -Os

-rwxr-xr-x 1 root root 12586 2007-05-22 08:29 rmgg_Os
-rw-r--r-- 1 root root  1925 2007-05-22 08:29 rmgg_Os.s

real	2m8.817s
user	2m8.831s
sys	0m0.003s



	Program

#define NSEC_PER_SEC  1000000000UL	
int rmg(void);

int main(void)
{
	/* int sec; */
	return rmg();
}

int rmg(void)
{
static unsigned long long nsec = 0;
static int sec = 0;
    while (sec < 1 ) { 
	nsec++;
	while (__builtin_expect(nsec >= NSEC_PER_SEC, 0)) {
		nsec -= NSEC_PER_SEC;
		++sec;
	}
    }	
	return sec;
}

	gcc_42 -O0

-rwxr-xr-x 1 root root 8471 2007-05-21 16:46 rmgg_O0
-rw-r--r-- 1 root root 1236 2007-05-21 16:41 rmgg_O0.s
time ./rmgg_O0

real	0m27.678s
user	0m27.680s
sys	0m0.002s
Script done on Mon 21 May 2007 04:53:29 PM EDT



	gcc_42 -O1 

-rwxr-xr-x 1 root root 8471 2007-05-21 16:41 rmgg_O1
-rw-r--r-- 1 root root 1572 2007-05-22 09:39 rmgg_O1.s

Script started on Mon 21 May 2007 04:56:20 PM EDT
time ./rmgg_O1

real	0m12.771s
user	0m12.767s
sys	0m0.003s
Script done on Mon 21 May 2007 04:56:55 PM EDT



	gcc_42 -O2

-rwxr-xr-x 1 root root 8471 2007-05-21 16:41 rmgg_O2
-rw-r--r-- 1 root root 1262 2007-05-21 17:41 rmgg_O2.s
Script started on Mon 21 May 2007 04:57:14 PM EDT
time ./rmgg_O2

real	0m12.532s
user	0m12.531s
sys	0m0.003s
Script done on Mon 21 May 2007 04:58:18 PM EDT



	gcc -Os

-rwxr-xr-x 1 root root 8471 2007-05-21 16:41 rmgg_Os
-rw-r--r-- 1 root root 1017 2007-05-21 16:40 rmgg_Os.s
Script started on Mon 21 May 2007 04:58:30 PM EDT
time ./rmgg_O2

real	0m12.571s
user	0m12.562s
sys	0m0.004s
Script done on Mon 21 May 2007 04:59:11 PM EDT
Comment 1 Ray Malitzke 2007-05-22 17:27:33 UTC
Created attachment 13601 [details]
*.s files

I believe that the *.s files in this case a superior to the *.i files
Comment 2 Ray Malitzke 2007-05-22 17:37:30 UTC
Created attachment 13602 [details]
*.s files for gcc-4.2

*.s files generated by gcc-4.2.1 as more responsive to the intent and superior in performance according to the selected option. -march=pentium3 make no difference.
Comment 3 Seongbae Park 2007-05-22 21:11:15 UTC
(In reply to comment #0)
> First I am herewith re-afirming my formal request for Mr. Pinsk to refrain from
> having anything to do with my submissions.

Please refrain from insulting others.

> According to my (admittedly second hand (Fifth Edition of "C A Reference
> Manual"
> by Samuel P. Harbison III & Guy L. Stelle Jr) reading; GCC, by not providing a 
> means to disable the use of libgcc (including udivdi3) is not in strict 
> conformance with the C standard for free-standing use through C99. __udivdi3 is
> a reserved identifier 
> and hence non-conforming.

If you want to play the language laywer,
please read the C standard and tell me which clause this gcc behavior violates.

> As a 
> practical matter the use of __builtin_expect could be taken as signal to 
> allow only reordering of instructions (to avoid pipeline stalls and 
> reloading of caches) are to be avoided in the marked unlikely cases. Any 
> fundamental changes like exchanging a while and a subtraction for a
> non-hardware 
> divide should no occur

The meaning of __builtin_expect is as defined in the documentation.
This particular transformation has nothing to do with __builtin_expect.
The transformation happens regardless of __builtin_expect.

scev_const_prop eliminates the loop by replacing
the final value with an expression using divide.
Probably the right approach would be to prevent this from happening
ifthe final value expression looks expensive
(like DImode divide on targets without native DImode divide).


The following patch avoids the situation 
(probably the condition check is too conservative
so this isn't fully cooked yet):

diff -r f01bb94713f4 gcc/tree-scalar-evolution.c
--- a/gcc/tree-scalar-evolution.c       Mon May 14 13:52:18 2007 +0000
+++ b/gcc/tree-scalar-evolution.c       Tue May 22 20:09:08 2007 +0000
@@ -252,6 +252,9 @@ 02110-1301, USA.  */
 #include "tree-pass.h"
 #include "flags.h"
 #include "params.h"
+/* For optab access.  */
+#include "expr.h"
+#include "optabs.h"

 static tree analyze_scalar_evolution_1 (struct loop *, tree, tree);

@@ -2942,6 +2945,7 @@ scev_const_prop (void)
       edge exit;
       tree def, rslt, ass, niter;
       block_stmt_iterator bsi;
+      optab op;

       /* If we do not know exact number of iterations of the loop, we cannot
         replace the final value.  */
@@ -2957,6 +2961,10 @@ scev_const_prop (void)
         the elimination of the final value may reveal.  Therefore, we now
         eliminate the final values of induction variables unconditionally.  */
       if (niter == chrec_dont_know)
+       continue;
+      op = optab_for_tree_code (TREE_CODE (niter), TREE_TYPE (niter));
+      if (op->handlers[TYPE_MODE (TREE_TYPE (niter))].insn_code == CODE_FOR_nothing
+        && EDGE_FREQUENCY (exit) > 500)
        continue;

       /* Ensure that it is possible to insert new statements somewhere.  */

Comment 4 Andrew Pinski 2007-05-22 22:04:35 UTC
> (probably the condition check is too conservative
> so this isn't fully cooked yet):

Way too conservative because even if you don't have the opcode, sometimes code can be produced without calling the libcall.
If you change the define to:
#define NSEC_PER_SEC  0x10000000UL

You will not get an call udividi3, though there is still a divide in unsigned long long, the division can be done exactly (via shifting).
Comment 5 Ian Lance Taylor 2007-05-22 23:18:06 UTC
The bug report as stated is clearly incorrect.  gcc is perfectly free to insert calls to __udivdi3.  This is not forbidden by any part of the C language standards.

To clarify for future readers, I gather that the actual problem here is that the code is written to assume that the quotient is small, and that it will be more efficient to use a loop than to actually divide.  So this is an optimization issue.  Normally eliminating loops is a good idea.  In this case it may not be.

It would of course be easy to prevent the optimization by declaring nsec to be volatile.  The question is whether the compiler can reasonably determine that the optimization is inappropriate in this particular case.
Comment 6 Ray Malitzke 2007-05-23 01:06:44 UTC
I did try changing #define 1000000000Ul to its rightful hexadecimal value #define 0x3b9aca00UL. the results are:

	.file	"rmgg.c"
.globl __udivdi3
	.text
	.p2align 4,,15
.globl rmg
	.type	rmg, @function
rmg:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%edi
	movl	$-1000000000, %edi
	pushl	%ebx
	subl	$48, %esp
	movl	nsec.1646, %eax
	movl	nsec.1646+4, %edx
	movl	%eax, -16(%ebp)
	movl	sec.1647, %eax
	movl	%edx, -12(%ebp)
	testl	%eax, %eax
	jg	.L10
	.p2align 4,,15
.L5:
	movl	-16(%ebp), %edx
	movl	-12(%ebp), %ecx
	addl	$1, %edx
	adcl	$0, %ecx
	cmpl	$0, %ecx
	jbe	.L11
.L7:
	addl	$-1000000000, %edx
	adcl	$-1, %ecx
	incl	%eax
	addl	$-999999999, -16(%ebp)
	movl	%edx, -32(%ebp)
	movl	$1000000000, %edx
	adcl	$-1, -12(%ebp)
	movl	%ecx, -28(%ebp)
	movl	%edx, 8(%esp)
	movl	-12(%ebp), %ecx
	movl	-16(%ebp), %edx
	movl	%eax, -20(%ebp)
	xorl	%eax, %eax
	movl	%eax, 12(%esp)
	movl	%ecx, 4(%esp)
	movl	%edx, (%esp)
	call	__udivdi3
	imull	$-1000000000, %edx, %ecx
	movl	%eax, %ebx
	subl	%eax, %ecx
	mull	%edi
	addl	%ecx, %edx
	movl	%eax, -40(%ebp)
	movl	-20(%ebp), %eax
	movl	%edx, -36(%ebp)
	movl	-40(%ebp), %edx
	addl	-32(%ebp), %edx
	movl	-36(%ebp), %ecx
	adcl	-28(%ebp), %ecx
	addl	%ebx, %eax
	movl	%edx, -16(%ebp)
	movl	%ecx, -12(%ebp)
	.p2align 4,,15
.L12:
	testl	%eax, %eax
	jle	.L5
.L10:
	movl	-16(%ebp), %edx
	movl	-12(%ebp), %ecx
	movl	%eax, sec.1647
	movl	%edx, nsec.1646
	movl	%ecx, nsec.1646+4
	addl	$48, %esp
	popl	%ebx
	popl	%edi
	popl	%ebp
	ret
	.p2align 4,,7
.L11:
	cmpl	$999999999, %edx
	ja	.L7
	movl	%edx, -16(%ebp)
	movl	%ecx, -12(%ebp)
	jmp	.L12
	.size	rmg, .-rmg
	.p2align 4,,15
.globl main
	.type	main, @function
main:
	leal	4(%esp), %ecx
	andl	$-16, %esp
	pushl	-4(%ecx)
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ecx
	subl	$4, %esp
	call	rmg
	popl	%ecx
	popl	%ecx
	popl	%ebp
	leal	-4(%ecx), %esp
	ret
	.size	main, .-main
	.local	sec.1647
	.comm	sec.1647,4,4
	.local	nsec.1646
	.comm	nsec.1646,8,8
	.ident	"GCC: (GNU) 4.3.0 20070522 (experimental)"
	.section	.note.GNU-stack,"",@progbits

I will try both Mr. Taylor volatile suggestion and if I can retrieve Mr Park's proposed patch in ascii I will try that one too

Thanks
Comment 7 Ray Malitzke 2007-05-23 02:09:11 UTC
Thank you Mr Taylor; your suggestion to use volatile certainly work in this drastically reduced test case. If it will work when nsec is part of a kernel structure I will leave to the experts. I, certainly, know better than to argue
with you, who wrote uucp and has been active on gcc for probably 15 years or more.

The *.s file and the results of time follow: (both obtained with -O1)

	.file	"rmgg.c"
	.text
	.p2align 4,,15
.globl rmg
	.type	rmg, @function
rmg:
	movl	sec.1647, %ecx
	pushl	%ebp
	movl	%esp, %ebp
	testl	%ecx, %ecx
	jg	.L18
.L6:
	movl	nsec.1646, %eax
	movl	nsec.1646+4, %edx
	addl	$1, %eax
	adcl	$0, %edx
	movl	%eax, nsec.1646
	movl	%edx, nsec.1646+4
	movl	nsec.1646, %eax
	movl	nsec.1646+4, %edx
	cmpl	$0, %edx
	jbe	.L15
.L13:
	movl	nsec.1646, %eax
	movl	nsec.1646+4, %edx
	addl	$-1000000000, %eax
	adcl	$-1, %edx
	incl	%ecx
	movl	%eax, nsec.1646
	movl	%edx, nsec.1646+4
	movl	nsec.1646, %eax
	movl	nsec.1646+4, %edx
	cmpl	$0, %edx
	ja	.L13
	.p2align 4,,15
.L15:
	cmpl	$999999999, %eax
	ja	.L13
	testl	%ecx, %ecx
	jle	.L6
.L18:
	popl	%ebp
	movl	%ecx, %eax
	movl	%ecx, sec.1647
	ret
	.size	rmg, .-rmg
	.p2align 4,,15
.globl main
	.type	main, @function
main:
	leal	4(%esp), %ecx
	andl	$-16, %esp
	pushl	-4(%ecx)
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ecx
	call	rmg
	popl	%ecx
	popl	%ebp
	leal	-4(%ecx), %esp
	ret
	.size	main, .-main
	.local	sec.1647
	.comm	sec.1647,4,4
	.local	nsec.1646
	.comm	nsec.1646,8,8
	.ident	"GCC: (GNU) 4.3.0 20070522 (experimental)"
	.section	.note.GNU-stack,"",@progbits




real	0m16.433s
user	0m16.425s
sys	0m0.004s
Comment 8 Richard Biener 2007-05-23 11:17:23 UTC
This particular case is indeed an optimization issue and __udivdi3 can be avoided
by using volatile as stated and verified.
Comment 9 Manuel López-Ibáñez 2007-05-23 13:15:32 UTC
(In reply to comment #3)
> 
> The meaning of __builtin_expect is as defined in the documentation.
> This particular transformation has nothing to do with __builtin_expect.
> The transformation happens regardless of __builtin_expect.
> 
> scev_const_prop eliminates the loop by replacing
> the final value with an expression using divide.
> Probably the right approach would be to prevent this from happening
> ifthe final value expression looks expensive
> (like DImode divide on targets without native DImode divide).
> 

Isn't there a way for __builtin_expect to modify this behaviour? After all, it is telling us that the loop is cheap. And the difference in computation time is not trivial at all.

Comment 10 Ray Malitzke 2007-05-23 13:17:46 UTC
Mr. Guenther!

The volatile fix would be fine, but (at least for me) does not work with the kernel. There is that little message:

kernel/time.c:479: warning: passing argument 3 of 'div_long_rem_signed' discards qualifiers from pointer target type.

and others like it, and, udivdi3 reappears.

Mr. Park!

The patch you kindly included in comment 3 presented two difficulties:

1. I Acould not extricate it cleanly enough from the html encoding apparently standard with bugzilla. (this is a Mozilla Product)

2. After some editing patch just accepted 1 hunk and upon checking it turned out that the svn derived tree-scalar.evolution.c did not match the enclosing lines around the lines to be added. I added those lines by hand (possibly imperfectly, enven on careful checking) the file compiled OK, but, runnign the gcc check sequence I got a stream of error. These errors disappeared on using a sequestered unpatched copy of the file. Hence, udivdi3 reappeared. If you see fit for me to test this not only on the reduced test case but on the actual kernel I suggest sending me a updated patch as a text attachment to my email.

Thanks to all for trying to help.

 
Comment 11 Ray Malitzke 2007-05-23 14:51:27 UTC
Mr. Ibanez!

Thank you (muchas gracias) for looking at the matter from a user's point of view and considering my arguments concerning __builtin_expect. You seem to be the first to look at the timings and amount of code generated. If you are interested I have equivalent data taken on a MAC with dual G4's. I did not send it so far because until you intervened I got mostly legalistic arguments and proposed fixes that do no solve the real problem of avoiding both udivdi3 and more importantly libgcc.
Comment 12 Richard Biener 2007-05-23 15:13:15 UTC
So this is now an enhancement request for sccp to honor loop roll count or
basic-block frequency and cost of the replacement.  Note the loop appears to be
peeled twice before sccp already, but peeling doesn't decay probabilities further.

Testcase:

int rmg(unsigned long long nsec)
{
   int sec = 0;
   nsec++;
   while (__builtin_expect(nsec >= 1000000000UL, 0)) {
      nsec -= 1000000000UL;
      ++sec;
   }
   return sec;
}

note this can be worked around with -fno-tree-scev-cprop as well.
Comment 13 Ray Malitzke 2007-05-24 14:08:55 UTC
Mr Guenther!

Thank you (herzlichen Dank) for the information about the hopefully disabling flag. If that information would have been posted after my initial intervention we could have saved a lot of bandwidth and storage space. I realize libgcc is one of the areas you are responsible for and libgcc definitely fills a need for a hosted implementation. Also, (at least for me) the optimization is strictly an internal matter for the gcc community and I have nothing to contribute in this regard.

To sum up; as a user, and, in the UNIX spirit (I started with the 7th edition), I just want freedom in choosing the facilities and features gcc has to offer. I hope that this or similar flags provide me with the capability to banish libgcc and similar facilities from my programs under the C free-standing specification.

Comment 14 Manuel López-Ibáñez 2007-05-24 16:27:35 UTC
(In reply to comment #13)
> 
> To sum up; as a user, and, in the UNIX spirit (I started with the 7th edition),
> I just want freedom in choosing the facilities and features gcc has to offer. I
> hope that this or similar flags provide me with the capability to banish libgcc
> and similar facilities from my programs under the C free-standing
> specification.
> 

The flag just disables an optimisation. If you want to disable optimisations just  use -O0. On the other hand, shouldn't -ffreestanding prevent udivdi3 ? What about -fno-builtin-udivdi3 ?

Comment 15 Manuel López-Ibáñez 2007-05-24 16:37:45 UTC
(In reply to comment #14)
> 
> The flag just disables an optimisation. If you want to disable optimisations
> just  use -O0. On the other hand, shouldn't -ffreestanding prevent udivdi3 ?
> What about -fno-builtin-udivdi3 ?
> 

OK. I found the answer myself: PR16470. So if you want a freestanding environment, use static libgcc. 
Comment 16 Zdenek Dvorak 2007-06-17 06:38:53 UTC
(In reply to comment #12)
> So this is now an enhancement request for sccp to honor loop roll count or
> basic-block frequency and cost of the replacement.

we used to take the cost of the replacement into account.  It caused so many missed-optimization PRs that I decided to just disable it.  The main problem is that while theoretically you can determine whether replacement is more costly then performing the computation in the loop (although even this is nontrivial in practice), it is very difficult to estimate the gains of enabling further optimizations.

One possible solution would be to annotate the division by the expected value of the result.  Division expanders then may decide whether to expand to machine instruction/libcall or to check for small values of the result in if-guards first.
Comment 17 Richard Biener 2007-10-12 13:44:01 UTC
*** Bug 31990 has been marked as a duplicate of this bug. ***
Comment 18 Manuel López-Ibáñez 2007-11-08 13:47:32 UTC
(In reply to comment #4)
> > (probably the condition check is too conservative
> > so this isn't fully cooked yet):
> 
> Way too conservative because even if you don't have the opcode, sometimes code
> can be produced without calling the libcall.
> If you change the define to:
> #define NSEC_PER_SEC  0x10000000UL
> 
> You will not get an call udividi3, though there is still a divide in unsigned
> long long, the division can be done exactly (via shifting).
> 

Just to understand the issue better. Which pass/function is responsible to detect that the division can be done exactly and replace it by a shift? Can't this be checked when the result is computed and moved outside the loop?

By the way, __builtin_expect generates a probability of >90% of exiting the loop. Shouldn't that be taken into account when moving anything outside the loop?
Comment 19 Manuel López-Ibáñez 2007-11-08 14:22:47 UTC
(In reply to comment #16)
> One possible solution would be to annotate the division by the expected value
> of the result.  Division expanders then may decide whether to expand to machine
> instruction/libcall or to check for small values of the result in if-guards
> first.
> 

Well, that won't achieve at all what the user wants, that is, to not get a call to __udividi3. 

There are 2 issues here:

1) GCC replaces the loop by division. This may or may not be more efficient. In this case, it is not. In many other cases, it is.

2) The user does not want a call to __udividi3.

If I understand correctly you need __udividi3 to do the division, am I wrong? Of course, you could "inline" it or link statically to avoid libgcc. The former will pessimize the code in most cases and the latter can be done by the user.  

Tricky. Is it possible to fix (2) without fixing (1)?
Comment 20 rakdver@kam.mff.cuni.cz 2007-11-08 14:46:53 UTC
Subject: Re:  [4.3 regression] udivdi3 counterproductive, unwarranted use

> ------- Comment #19 from manu at gcc dot gnu dot org  2007-11-08 14:22 -------
> (In reply to comment #16)
> > One possible solution would be to annotate the division by the expected value
> > of the result.  Division expanders then may decide whether to expand to machine
> > instruction/libcall or to check for small values of the result in if-guards
> > first.
> > 
> 
> Well, that won't achieve at all what the user wants, that is, to not get a call
> to __udividi3. 

However, it would avoid calling the division most of the time, i.e.,
for performance this would be a possible solution.

> There are 2 issues here:
> 
> 1) GCC replaces the loop by division. This may or may not be more efficient. In
> this case, it is not. In many other cases, it is.

The code performing the replacement should probably check what is the
expected # of iterations of the loop, and compute which way is more
efficient.  However, it is a bit difficult to get this work correctly.

> 2) The user does not want a call to __udividi3.

There is nothing wrong with calling __udividi3, once the division is
introduced.
Comment 21 Adrian Bunk 2007-11-09 16:26:49 UTC
Let's leave the right/wrong discussion and look at it more pragmatically:

Could gcc get some kind of --expensive-libgcc flag that tells gcc that libgcc calls are a bit more expensive than usually and should be avoided?

Comment 22 rguenther@suse.de 2007-11-09 16:33:17 UTC
Subject: Re:  [4.3 regression] udivdi3 counterproductive,
 unwarranted use

On Fri, 9 Nov 2007, bunk at stusta dot de wrote:

> ------- Comment #21 from bunk at stusta dot de  2007-11-09 16:26 -------
> Let's leave the right/wrong discussion and look at it more pragmatically:
> 
> Could gcc get some kind of --expensive-libgcc flag that tells gcc that libgcc
> calls are a bit more expensive than usually and should be avoided?

While this is technically possible we already face stability and
maintainability problems in other areas where we do so (for example
the conditionally available C99 math has in the past lead to many
internal compiler errors).  So I would prefer not to go this route.
Which does _not_ mean that we absolutely will not fix this issue
and try to avoid creating final value replacement that involves
a division/modulus.

But as there exist multiple work-arounds for this issue, a nice and
complete solution is certainly not very high priority.

Richard.
Comment 23 Adrian Bunk 2007-11-09 17:09:29 UTC
We need a way to globally prevent it in the kernel or it will be a repeating source of problems there.

Is -fno-tree-scev-cprop a reasonable (and not too expensive) workaround for the Linux kernel?
Comment 24 rguenther@suse.de 2007-11-09 17:11:47 UTC
Subject: Re:  [4.3 regression] udivdi3 counterproductive,
 unwarranted use

On Fri, 9 Nov 2007, bunk at stusta dot de wrote:

> ------- Comment #23 from bunk at stusta dot de  2007-11-09 17:09 -------
> We need a way to globally prevent it in the kernel or it will be a repeating
> source of problems there.
> 
> Is -fno-tree-scev-cprop a reasonable (and not too expensive) workaround for the
> Linux kernel?

Yes, as is using volatile on this particular loop variable.

Richard.
Comment 25 Adrian Bunk 2007-11-10 07:41:02 UTC
Adding workarounds in all affected places in the kernel would be horribly fragile, but I've confirmed your -fno-tree-scev-cprop suggestion works around it and I'll submit a patch to the Linux kernel to use it with gcc 4.3.
Comment 26 İsmail Dönmez 2007-11-20 21:19:00 UTC
*** Bug 34167 has been marked as a duplicate of this bug. ***
Comment 27 Mark Mitchell 2007-11-27 05:51:06 UTC
Ian, Diego, Richard G. --

I don't understand the state of this issue.  Do we still think we have a problem that needs fixing in GCC?  Has the Linux kernel worked around the issue?

I'm interested both in knowing whether or not the compiler has a bug and, if it does, whether than affects current versions of the kernel.

Thanks,

-- Mark
Comment 28 İsmail Dönmez 2007-11-27 05:55:56 UTC
Mark,

Linux 2.6 tree as of this minute doesn't compile with gcc 4.3 trunk. You need to add -fno-tree-scev-cprop to the KBUILD_CFLAGS, but this is not upstreamed yet and I am not sure if it'll be accepted in case it results in a performance regression.

Thanks.
Comment 29 Richard Biener 2007-11-27 09:43:36 UTC
This is IMHO at most a QOI issue - at Novell we mark timespec_add_ns's u64
parameter as volatile to work around this issue.  I expect upstream to adopt
a workaround as well.  Note that some targets in the kernel have parts of
libgcc implemented, but i?86 misses at least __udivdi3.
Comment 30 Mark Mitchell 2007-11-27 18:58:58 UTC
Subject: Re:  [4.3 regression] udivdi3 counterproductive,
 unwarranted use

rguenth at gcc dot gnu dot org wrote:
> ------- Comment #29 from rguenth at gcc dot gnu dot org  2007-11-27 09:43 -------
> This is IMHO at most a QOI issue - at Novell we mark timespec_add_ns's u64
> parameter as volatile to work around this issue.  I expect upstream to adopt
> a workaround as well.  Note that some targets in the kernel have parts of
> libgcc implemented, but i?86 misses at least __udivdi3.

I am not a kernel developer, but my feeling as a GCC developer is that
you must provide the entry points in libgcc whenever you are linking
code compiled with GCC.  In other words, that GCC should be free to use
functions from libgcc as it pleases.

Of course, it might be a GCC optimization bug to call __udivdi3; perhaps
it could generate more efficient code that doesn't call that function.

Do others agree?  That this is at most an optimization issue, but not a
correctness issue?

Comment 31 Adrian Bunk 2007-11-27 19:16:33 UTC
(In reply to comment #29)
> This is IMHO at most a QOI issue - at Novell we mark timespec_add_ns's u64
> parameter as volatile to work around this issue.  I expect upstream to adopt
> a workaround as well.  Note that some targets in the kernel have parts of
> libgcc implemented, but i?86 misses at least __udivdi3.

The problem is that this is very fragile - imagine what might happen
when a frequently used struct member gets changed from int to u64.

After all, when looking at current practice, the kernel will support
being built with gcc 4.3 until 2013 or 2014.
Comment 32 Adrian Bunk 2007-11-27 19:31:06 UTC
(In reply to comment #30)
>...
> I am not a kernel developer, but my feeling as a GCC developer is that
> you must provide the entry points in libgcc whenever you are linking
> code compiled with GCC.  In other words, that GCC should be free to use
> functions from libgcc as it pleases.
>...

Status quo up to gcc 4.2 was that gcc did not use them in the kernel on i386.

The root of the problem is that your feeling and what the kernel currently implements do not match, and some flamewars^Wdiscussions might be required for getting this sorted out in either gcc or the kernel...

Even if this specific issue in the kernel would turn out as a misoptimization, the general problem would still remain waiting to pop up at some later time at a  different place.
Comment 33 rakdver@kam.mff.cuni.cz 2007-11-27 19:42:40 UTC
Subject: Re:  [4.3 regression] udivdi3 counterproductive, unwarranted use

> > ------- Comment #29 from rguenth at gcc dot gnu dot org  2007-11-27 09:43 -------
> > This is IMHO at most a QOI issue - at Novell we mark timespec_add_ns's u64
> > parameter as volatile to work around this issue.  I expect upstream to adopt
> > a workaround as well.  Note that some targets in the kernel have parts of
> > libgcc implemented, but i?86 misses at least __udivdi3.
> 
> I am not a kernel developer, but my feeling as a GCC developer is that
> you must provide the entry points in libgcc whenever you are linking
> code compiled with GCC.  In other words, that GCC should be free to use
> functions from libgcc as it pleases.

I would agree with this.  Even if we somehow prevent gcc from using
__udivdi3 in this particular case, there is no guarantee that gcc will
not decide to use this or some other function from the libgcc runtime at
some other place (unless we decide to have some kind of -dont-use-libgcc
flag, which would however be a bit difficult to implement, and it would
make us fail to compile some valid codes).

> Of course, it might be a GCC optimization bug to call __udivdi3; perhaps
> it could generate more efficient code that doesn't call that function.

Definitely.  It is particularly annoying that this is the case where the
programmer went through the pain of avoiding the usage of division, just to
have it "optimized" back in.
Comment 34 Jakub Jelinek 2007-11-27 19:43:50 UTC
I certainly agree with Mark here, we can talk whether on a particular arch
with particular options this transformation is a win or not, but when the kernel
people decide to implement libgcc on their own and intentionally leave parts of
it out, they need to handle the consequences of that.
The addition of asm ("" : "+r" (nsec)); to the loop I suggested, perhaps hidden
in some macro, will act as an optimization barrier against it and won't (unlike volatile) cost anything.

Even when the current scev-cprop transformation might not be a win in the case of
__builtin_expect unlikely loop, the compiler e.g. could decide to transform this into:
if (__builtin_expect (nsec < NSEC_PER_SEC, 1))
  sec += nsec / NSEC_PER_SEC, nsec %= NSEC_PER_SEC;

or e.g.

if (__builtin_expect (nsec < NSEC_PER_SEC * 10, 1))
  while (__builtin_expect (nsec > NSEC_PER_SEC, 0))
    nsec -= NSEC_PER_SEC, sec++;
else
  sec += nsec / NSEC_PER_SEC, nsec %= NSEC_PER_SEC;

which would still reference __udivdi3, but wouldn't be (measurably) slower in
case nsec is really small, yet would be much faster if nsec happens to be big.
Comment 35 Mark Mitchell 2007-11-27 19:45:57 UTC
Subject: Re:  [4.3 regression] udivdi3 counterproductive,
 unwarranted use

bunk at stusta dot de wrote:

> Even if this specific issue in the kernel would turn out as a misoptimization,
> the general problem would still remain waiting to pop up at some later time at
> a  different place.

Indeed.  However, I think that the kernel developers should be aware
that GCC is not designed to avoid libgcc functions.  GCC fundamentally
assumes that it may call various functions from libgcc as it pleases.
(Sometimes it may do so for good reasons, sometimes it may be that it
does it suboptimally.)

Because there's nothing in GCC to keep it from "randomly" calling libgcc
functions, if the kernel wants to be robust against different versions
of GCC, it should provide definitions of these functions.  There's no
easy way to make GCC avoid these functions.

That's not meant to defend GCC calling this particular function in this
particular circumstance, of course.

Comment 36 Andrew Pinski 2007-12-26 01:29:20 UTC
*** Bug 33419 has been marked as a duplicate of this bug. ***
Comment 37 Mark Mitchell 2008-01-02 23:17:06 UTC
Downgrading to P4.  We seem to have consensus that this is now a GCC wrong-code bug.  I can't tell if we think this is an optimization bug or not; if we do think so, please bump this back to P3, with a test-case showing the poor optimization choice.  I will review and upgrade to P2.
Comment 38 Frank Ch. Eigler 2008-01-04 03:19:36 UTC
(In reply to comment #37)
> Downgrading to P4.  We seem to have consensus that this is [not] a GCC wrong-code
> bug.

Yeah, it seems to be a mistaken expectation of -ffreestanding not to
call libgcc.  Maybe a new option to that effect would help?
Comment 39 Mark Mitchell 2008-01-04 04:43:32 UTC
Subject: Re:  [4.3 regression] udivdi3 counterproductive,
 unwarranted use

fche at redhat dot com wrote:

>> Downgrading to P4.  We seem to have consensus that this is [not] a GCC wrong-code
>> bug.
> 
> Yeah, it seems to be a mistaken expectation of -ffreestanding not to
> call libgcc.  Maybe a new option to that effect would help?

I don't think there's a practical, platform-independent way for GCC to
avoid calling libgcc.  On some platforms, it has to do that for pretty
basic operations.  I think we just need to accept that the libgcc API is
part of what's required by the compiler; you could imagine that it is
"as-if" weak definitions of these functions were emitted in every
assembly file by the compiler itself.

Comment 40 рам╛б╩(Yi Tonglu) 2008-01-15 02:39:35 UTC
This bug cause linux kernel unable to compile. So I think it must be fixed before 4.3 is released
Comment 41 İsmail Dönmez 2008-01-15 02:42:48 UTC
(In reply to comment #40)
> This bug cause linux kernel unable to compile. So I think it must be fixed
> before 4.3 is released

Yes and there is a known workaround, see comment #28
Comment 42 Andrew Pinski 2008-01-15 05:47:08 UTC
(In reply to comment #40)
> This bug cause linux kernel unable to compile. So I think it must be fixed
> before 4.3 is released

Except there is two parts to this bug, first it is a bug in the linux kernel for not including this function.  Second the GCC bug is that it produces less than optimal code but that should not be a blocker for GCC 4.3.0.
Comment 43 Joseph S. Myers 2008-03-15 00:40:21 UTC
Update milestone after 4.3.0 release.
Comment 44 Richard Biener 2008-06-06 14:56:58 UTC
4.3.1 is being released, adjusting target milestone.
Comment 45 Joseph S. Myers 2008-08-27 22:02:06 UTC
4.3.2 is released, changing milestones to 4.3.3.
Comment 46 Steven Bosscher 2008-12-10 11:25:59 UTC
*** Bug 38453 has been marked as a duplicate of this bug. ***
Comment 47 Steven Bosscher 2008-12-10 11:42:50 UTC
Re. comment #37:
Mark, bug 38453 has a simple test case that shows the poor optimization choice for ARM-linux.  Also, there are now 4 bugs closed as duplicates of this one, so many users run into this and consider it important enough an issue to file a bug report about it.

Re. comment #16:
Zdenek, do you remember which revision / patch removed the cost check?  And do you recall (or can you recover) some of the missed-optimization bug report numbers?  I tried to find them with a Bugzilla query, but failed.

With the removal of the cost check, we've gone from missed-optimization bugs to too-aggressive-optimization bugs that even require hacks/workarounds from our users.  To me, it seems we have made the wrong trade-off, then.

In my opinion, this *is* an optimizer bug, and, actually, a much more important bug than some of the regressions that are P2/P3 now for gcc 4.3 and gcc 4.4.  

Comment 48 Steven Bosscher 2008-12-10 11:43:41 UTC
To P3 per comment #37.
Comment 49 Richard Biener 2008-12-10 11:45:30 UTC
P2, similar to all other missed-optimization regressions.
Comment 50 Steven Bosscher 2008-12-10 12:31:22 UTC
The cost check for final value replacement was removed in revision 122896 (from bug 33419, see http://gcc.gnu.org/viewcvs?view=rev&revision=122896)
Comment 51 Daniel Silverstone 2008-12-10 15:49:11 UTC
I'd like to raise here that bug 38453 which was marked as a duplicate of this is in fact an example not only of wrong optimisation, but of missed optimisation as well.

The compiler emits the loop in the bug report *and* then emits the umod call despite already having the same answer in a register.

If the compiler had only emitted the umod call then at least it wouldn't have done the work twice.

I think it's critical that there be costs applied to these builtins -- especially for platforms where they are a full function call away rather than being instructions.
Comment 52 Mark Mitchell 2008-12-10 17:11:45 UTC
Steven, thanks for your comments on this issue.  I agree with Richard G. that this is P2 -- but I also agree with you that it's a serious issue.  

Unfortunately, Bugzilla doesn't want to show me the test case attached to PR38453, so I can't look at that.  Is the issue there that we don't take advantage of the ARM __aeabi_uidivmod function that gives both the quotient and modulus?  Or is it that we're not giving CSE enough information to figure out that the value we have is already available?

Zdenek, it would certainly be helpful to have the original justification for your change available.  It does sound a bit like we've traded one set of problems for another, and if those are our only practical options, we need to decide which set of problems is worse.

Thanks,

-- Mark
Comment 53 rakdver@kam.mff.cuni.cz 2008-12-10 18:15:04 UTC
Subject: Re:  [4.3/4.4 regression] udivdi3 counterproductive, unwarranted use

Hi,

> Re. comment #16:
> Zdenek, do you remember which revision / patch removed the cost check?

rev. 122896

> And do
> you recall (or can you recover) some of the missed-optimization bug report
> numbers?  I tried to find them with a Bugzilla query, but failed.

No; the PRs addressed by that patch seem unrelated.
Comment 54 rakdver@kam.mff.cuni.cz 2008-12-10 18:23:36 UTC
Subject: Re:  [4.3/4.4 Regression] final value replacement too aggressive for e.g. targets with no native div/mod insns

> Zdenek, it would certainly be helpful to have the original justification for
> your change available.

Unfortunately, I do not remember the exact problems (it is almost two
years since that change was made).  Adding back expression_expensive_p
in some conservative form (e.g., checking that the computation involves
division, perhaps excluding divisions by a power of two) seems like
the easiest way to fix this problem now.
Comment 55 Steven Bosscher 2008-12-10 21:27:34 UTC
// This is the test case from PR38453.
// See comment #0 of that bug for further information:
// http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38453#c0

typedef struct 
{
  int lc;
  int pb;
} bar;

void foo(bar *propsRes, const unsigned char *propsData)
{
    unsigned char prop0;
    prop0 = propsData[0];

    while (prop0 >= 45) {
        propsRes->pb++;
        prop0-=45;
    }
    propsRes->lc = prop0;
}


int main(int argc, char **argv)
{
    bar propsRes;
    unsigned char propsData[1024];
    foo(&propsRes, propsData);
}
Comment 56 Steven Bosscher 2008-12-10 21:44:21 UTC
Re. comment #52:

I've pasted the test case in the audit trail here as plain text -- it's pretty small and it shows the problem nicely.  The issue is that with -Os, on all targets, the line,

  propsRes->lc = prop0;

is replaced with:

  propsRes->lc = (int) (int) (prop0.24 % 45); // with prop0.24 = propsData[0];

so there is no div/mod instruction in the original code, but one appears as a result of the final value replacement optimization.  With -O2 there is also a div instruction, but I am not sure where it comes from (I suspect loop header copying, but it's inconsequential for this issue anyway).

The problem is not that we don't give CSE enough information. The problem is reported for a GCC 4.3 based compiler, which still has libcall notes.  So at least in theory, CSE in GCC 4.3 should have all the information available to optimize away the mod instruction.  Apparently the result is just not available.

The bottom line, then, is still that the final value replacement has introduced an operation at the tree level that doesn't have a matching RTL instruction.  In this case a mod insn that makes gcc produce a libcall to umodsi.

I think Zdenek is right in comment #54: We should reincarnate expression_expensive_p in some form such that it takes into account (or makes a reasonable guess about) which instructions are particularly expensive.  Apparently, the old cost check was too conservative (it tested the expression cost against spill cost, ignoring further optimization opportunities).  So we need to test for something else than before.

Maybe we should not replace if the final value is not a constant?  Or if it is an expression involving div/mod?  Can we test somehow how many insns the final value replacement computation would take (maybe re-use the inliner heuristics)?
Comment 57 rakdver@kam.mff.cuni.cz 2008-12-10 22:02:26 UTC
Subject: Re:  [4.3/4.4 Regression] final value replacement too aggressive for e.g. targets with no native div/mod insns

> I think Zdenek is right in comment #54: We should reincarnate
> expression_expensive_p in some form such that it takes into account (or makes a
> reasonable guess about) which instructions are particularly expensive. 
> Apparently, the old cost check was too conservative (it tested the expression
> cost against spill cost, ignoring further optimization opportunities).  So we
> need to test for something else than before.
> 
> Maybe we should not replace if the final value is not a constant?

That is way too conservative (replacing the final values like say "n + 1"
should usually be profitable).

> Or if it is an expression involving div/mod? 

That is what I currently plan to do.

> Can we test somehow how many insns the final
> value replacement computation would take

I think that has the same problems as the original solution (it is
not possible to find a bound that would both prevent expensive things to be
moved out and to allow the non-expensive ones, as the exact costs are
target-specific and we are not estimating them precisely enough).

> (maybe re-use the inliner heuristics)?

That would be difficult, as estimate_num_insns works on gimple
statements, not on expressions.
Comment 58 Zdenek Dvorak 2008-12-10 22:55:12 UTC
(In reply to comment #56)
> Re. comment #52:
> 
> I've pasted the test case in the audit trail here as plain text -- it's pretty
> small and it shows the problem nicely.  The issue is that with -Os, on all
> targets, the line,
> 
>   propsRes->lc = prop0;
> 
> is replaced with:
> 
>   propsRes->lc = (int) (int) (prop0.24 % 45); // with prop0.24 = propsData[0];
> 
> so there is no div/mod instruction in the original code, but one appears as a
> result of the final value replacement optimization.  With -O2 there is also a
> div instruction, but I am not sure where it comes from (I suspect loop header
> copying, but it's inconsequential for this issue anyway).

actually, the div instruction also comes from the final value replacement.  We eliminate the loop completely and replace it with the following:

D.1286 = prop0 + 211;
prop0 = D.1286 % 45;
propsRes->pb = [plus_expr] (propsRes->pb + 1) + (int) (D.1286 / 45);
Comment 59 Mark Mitchell 2008-12-11 00:14:46 UTC
Steven, Zdenek --

Is there any way to teach the compiler to use the ARM __aeabi_divmod routine, which provides both the quotient and remainder?  At least with -Os, that is probably optimal.  In other words, once we get to:

prop0 = D.1286 % 45;
propsRes->pb = [plus_expr] (propsRes->pb + 1) + (int) (D.1286 / 45);

can we notice that we have both "D.1286 % 45" and "D.1286 / 45" and use the divmod routine?

Furthermore, if we want to generate the loop instead, are the alternatives you're considering sufficient to change code like:

  ... x / 45 ...

into the loop form?  In other words, rather than just considering this a problem of transforming the hand-written loop into a division operation, shouldn't we also be considering the problem of transforming division into the loop form?

The style:

  a = x / N;
  b = x % N;

is pretty common; recognizing that pattern and generating a loop, or libcall to __aeabi_divmod, seems like a good thing.  One could argue that __aeabi_divmod itself should be responsible for noticing divisors that more profitably use repeated subtraction than other methods, on CPUs that don't have a hardware divide instruction.

Do we need a divmod optab?  Or, should we expect that if we provide an appropriate pattern in the ARM back end, combine will be able to smush the division and remainder operations together?

I guess where I'm going with this line of thinking is to argue that the transformation to division is fine, and that we should treat that as canonical -- but that we need a way to change divisions into repeated divisions if appropriate.

-- Mark
Comment 60 Steven Bosscher 2008-12-11 00:27:26 UTC
IMHO I the transformation to division is not fine.  I would argue this is the core issue in this problem report.

You are right that a combination of div and mod is quite common in real-world code.  Right now, GCC can catch that when the port emits a divmod libcall by default instead of just a div and/or a mod.  However, I consider this a separate issue.

GCC currently does not recognize div/mod or mod/div as patterns, because at the tree level this would show up as two separate expressions (possibly with statements between the div and mod expresssions), and expand works one statement at a time.

Some targets (e.g. x86) call a divmod libcall by default for a div or a mod instruction.  If there is a div/mod or mod/div sequence in the code, GCC optimizes away one of the libcalls via the REG_EQUAL notes, IIRC.

But our concern now should be to avoid this situation where gcc "invents" a div or mod instruction in the first place.
Comment 61 rakdver@kam.mff.cuni.cz 2008-12-11 00:28:43 UTC
Subject: Re:  [4.3/4.4 Regression] final value replacement too aggressive for e.g. targets with no native div/mod insns

> Furthermore, if we want to generate the loop instead, are the alternatives
> you're considering sufficient to change code like:
> 
>   ... x / 45 ...
> 
> into the loop form?  

Although I think this is somewhat unrelated, we actually already do this
(see value-prof.c).

> The style:
> 
>   a = x / N;
>   b = x % N;
> 
> is pretty common; recognizing that pattern and generating a loop, or libcall to
> __aeabi_divmod, seems like a good thing.  One could argue that __aeabi_divmod
> itself should be responsible for noticing divisors that more profitably use
> repeated subtraction than other methods, on CPUs that don't have a hardware
> divide instruction.

The problem is that "profitably" depends on how large x is (i.e., using
the loop is only profitable when x is small).  As mentioned, we already
do the transformation in profile-driven optimizer, when we have this
information.

> Do we need a divmod optab?  Or, should we expect that if we provide an
> appropriate pattern in the ARM back end, combine will be able to smush the
> division and remainder operations together?

I think it should be able to (IIRC, we do so on x86).

> I guess where I'm going with this line of thinking is to argue that the
> transformation to division is fine, and that we should treat that as canonical
> -- but that we need a way to change divisions into repeated divisions if
> appropriate.

Users usually use this pattern (division by repeated subtraction) only
if the result is small; so transforming to actual division makes the
code worse (and understandably, the user angry :-) Unless profiling is
used, compiler does not have enough information to determine which form
is better, so I think we should prefer to keep whatever form was
originally in the program.
Comment 62 Mark Mitchell 2008-12-11 00:42:09 UTC
I take Zdenek's point about the transformation from division to a loop being profitable only if x is small.  But, that might argue that if we see the loop, we still transform it into the division form -- but with a note that allows the value-profiling code to turn it back.  The advantage to that is we can then take advantage of the fact that it's division otherwise; for example, CSE it with a manually written division, or eliminate it if multiplied by the same value we divided by, etc.

There are a bunch of issues somewhat tangled up in here.  In any case, I am also concerned about the fact that if the user writes division and modulus manually, we're not taking advantage of the ABI divmod function.  IIUC, it sounds like we could fix that by having the ARM back end always use the divmod function as on x86.  Is that right?

-- Mark
Comment 63 Steven Bosscher 2008-12-11 07:03:59 UTC
Re. comment #62:
Transforming the code and adding notes to allow the compiler to undo the transformation is not an option with the available infrastructure in GCC.  You'd have to add some kind of note (something like REG_EQUAL) to GIMPLE, and then, maybe it is possible -- but still only when profile information is available (i.e. practically never for -Os or for the kernel).

I agree with Zdenek that the best solution for now, to fix this regression, is to honor the user's decision to use a loop.  For anything more fancy than that, patches welcome ;-)

You also have to keep in mind that the issue in this problem report is that GCC introduces a div.  That is the regression here, and the only thing we should be addressing in this PR, IMHO.

The issue is *not* that GCC could do better at recognizing situations where divmod could be used.  GCC has never been able to do that, so there is no regression in that regard.  If someone wishes to add some kind of idiom recognition for divmod to GCC, and/or optimization hints from GIMPLE to RTL via a GIMPLE-equivalent of REG_EQUAL, then he/she should open new PRs for that, marked "enhancement".
Comment 64 Mark Mitchell 2008-12-11 17:37:20 UTC
I agree that the most practical short-term solution is to avoid transforming the loop into a division.

However, I'm also interested in what we think the right long-term solution is.  I'm not convinced that our goal should be to preserve the form used by the user; rather, I'm inclined to say that our goal should be to (a) recognize the loop as division, and, (b) generate loops to implement division when profitable.  It's reasonable is to say that the form user by the user influences the decision about profitability made in (b).
Comment 65 Zdenek Dvorak 2008-12-12 20:34:09 UTC
Subject: Bug 32044

Author: rakdver
Date: Fri Dec 12 20:32:47 2008
New Revision: 142719

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=142719
Log:
	PR tree-optimization/32044
	* tree-scalar-evolution.h (expression_expensive_p): Declare.
	* tree-scalar-evolution.c (expression_expensive_p): New function.
	(scev_const_prop): Avoid introducing expensive expressions.
	* tree-ssa-loop-ivopts.c (may_eliminate_iv): Ditto.

	* gcc.dg/pr34027-1.c: Change outcome.
	* gcc.dg/tree-ssa/pr32044.c: New test.


Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/pr34027-1.c
    trunk/gcc/tree-scalar-evolution.c
    trunk/gcc/tree-scalar-evolution.h
    trunk/gcc/tree-ssa-loop-ivopts.c

Comment 66 Zdenek Dvorak 2008-12-12 20:42:22 UTC
(In reply to comment #64)
> I agree that the most practical short-term solution is to avoid transforming
> the loop into a division.
> 
> However, I'm also interested in what we think the right long-term solution is. 
> I'm not convinced that our goal should be to preserve the form used by the
> user; rather, I'm inclined to say that our goal should be to (a) recognize the
> loop as division, and, (b) generate loops to implement division when
> profitable.  It's reasonable is to say that the form user by the user
> influences the decision about profitability made in (b).

Agreed; there are also other parts of tree->rtl expansion that could benefit from using the information about the distribution of the inputs, whether obtained from value profiling, VRP, scev analysis, guesses based on the structure of the program (as in this case), or possibly hints by the user (builtin_expect).  I do not have time to work on that now, though (I think it would make a rather nice gsoc project, or a student project in general).
Comment 67 Richard Biener 2009-01-24 10:19:39 UTC
GCC 4.3.3 is being released, adjusting target milestone.
Comment 68 Paolo Bonzini 2009-02-03 16:32:20 UTC
This was fixed on trunk (see comment #65).

Open another bug for long-term solutions, this is cluttering the 4.4 regressions. :-)
Comment 69 Richard Biener 2009-04-22 15:15:13 UTC
I am going to do a backport of this to the 4.3 branch.
Comment 70 Richard Biener 2009-05-26 10:17:34 UTC
Subject: Bug 32044

Author: rguenth
Date: Tue May 26 10:17:19 2009
New Revision: 147865

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147865
Log:
2009-05-26  Richard Guenther  <rguenther@suse.de>

	Backport from mainline
	2008-12-12  Zdenek Dvorak  <ook@ucw.cz>

	PR tree-optimization/32044
	* tree-scalar-evolution.h (expression_expensive_p): Declare.
	* tree-scalar-evolution.c (expression_expensive_p): New function.
	(scev_const_prop): Avoid introducing expensive expressions.
	* tree-ssa-loop-ivopts.c (may_eliminate_iv): Ditto.

	* gcc.dg/pr34027-1.c: Change outcome.
	* gcc.dg/tree-ssa/pr32044.c: New test.

Added:
    branches/gcc-4_3-branch/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
Modified:
    branches/gcc-4_3-branch/gcc/ChangeLog
    branches/gcc-4_3-branch/gcc/testsuite/ChangeLog
    branches/gcc-4_3-branch/gcc/testsuite/gcc.dg/pr34027-1.c
    branches/gcc-4_3-branch/gcc/tree-scalar-evolution.c
    branches/gcc-4_3-branch/gcc/tree-scalar-evolution.h
    branches/gcc-4_3-branch/gcc/tree-ssa-loop-ivopts.c

Comment 71 Richard Biener 2009-05-26 10:17:44 UTC
Fixed.