First I am herewith re-afirming my formal request for Mr. Pinsk to refrain from having anything to do with my submissions. Now to the heart of the matter: According to my (admittedly second hand (Fifth Edition of "C A Reference Manual" by Samuel P. Harbison III & Guy L. Stelle Jr) reading; GCC, by not providing a means to disable the use of libgcc (including udivdi3) is not in strict conformance with the C standard for free-standing use through C99. __udivdi3 is a reserved identifier and hence non-conforming. The irony is that, besides, being non-conforming and prejudicing free standing applications that aim for maximum portability, it is highly counterproductive in its own right. Also, the forced and silent use of libgcc (lld does not show it being used) violates one of the fundamental principles of both UNIX and C. Namely that the user (certainly root) is to be in full and absolute command of the system without hidden reinterpretation of his commands or MS type questions. As a practical matter the use of __builtin_expect could be taken as signal to allow only reordering of instructions (to avoid pipeline stalls and reloading of caches) are to be avoided in the marked unlikely cases. Any fundamental changes like exchanging a while and a subtraction for a non-hardware divide should no occur If anybody at GCC wants to know what others (including L. Torvalds and A. Morton) think; checking Google on udivdi3 might be instructive. What follows are the result of tests using current versions of gcc-4.3 and 4.2.1. I believe the results speak for themselves. Besides the data for x86 I also have quite similar data for powerpc G4, which I will make available as a follow on. Program #define NSEC_PER_SEC 1000000000UL int rmg(void); int main(void) { /* int sec; */ return rmg(); } int rmg(void) { static unsigned long long nsec = 0; static int sec = 0; while (sec < 1 ) { nsec++; while (__builtin_expect(nsec >= NSEC_PER_SEC, 0)) { nsec -= NSEC_PER_SEC; ++sec; } } return sec; } gcc_43 -O0 -rwxr-xr-x 1 root root 8478 2007-05-22 08:23 rmgg_O0 -rw-r--r-- 1 root root 1238 2007-05-22 08:18 rmgg_O0.s real 0m27.613s user 0m27.607s sys 0m0.003s gcc_43 -O1 -rwxr-xr-x 1 root root 12586 2007-05-22 08:25 rmgg_O1 -rw-r--r-- 1 root root 1572 2007-05-22 08:25 rmgg_O1.s real 0m12.776s user 0m12.775s sys 0m0.003s gcc_43 -O2 -rwxr-xr-x 1 root root 12586 2007-05-22 08:27 rmgg_O2 -rw-r--r-- 1 root root 1874 2007-05-22 08:27 rmgg_O2.s real 0m16.415s user 0m16.414s sys 0m0.004s gcc_43 -Os -rwxr-xr-x 1 root root 12586 2007-05-22 08:29 rmgg_Os -rw-r--r-- 1 root root 1925 2007-05-22 08:29 rmgg_Os.s real 2m8.817s user 2m8.831s sys 0m0.003s Program #define NSEC_PER_SEC 1000000000UL int rmg(void); int main(void) { /* int sec; */ return rmg(); } int rmg(void) { static unsigned long long nsec = 0; static int sec = 0; while (sec < 1 ) { nsec++; while (__builtin_expect(nsec >= NSEC_PER_SEC, 0)) { nsec -= NSEC_PER_SEC; ++sec; } } return sec; } gcc_42 -O0 -rwxr-xr-x 1 root root 8471 2007-05-21 16:46 rmgg_O0 -rw-r--r-- 1 root root 1236 2007-05-21 16:41 rmgg_O0.s time ./rmgg_O0 real 0m27.678s user 0m27.680s sys 0m0.002s Script done on Mon 21 May 2007 04:53:29 PM EDT gcc_42 -O1 -rwxr-xr-x 1 root root 8471 2007-05-21 16:41 rmgg_O1 -rw-r--r-- 1 root root 1572 2007-05-22 09:39 rmgg_O1.s Script started on Mon 21 May 2007 04:56:20 PM EDT time ./rmgg_O1 real 0m12.771s user 0m12.767s sys 0m0.003s Script done on Mon 21 May 2007 04:56:55 PM EDT gcc_42 -O2 -rwxr-xr-x 1 root root 8471 2007-05-21 16:41 rmgg_O2 -rw-r--r-- 1 root root 1262 2007-05-21 17:41 rmgg_O2.s Script started on Mon 21 May 2007 04:57:14 PM EDT time ./rmgg_O2 real 0m12.532s user 0m12.531s sys 0m0.003s Script done on Mon 21 May 2007 04:58:18 PM EDT gcc -Os -rwxr-xr-x 1 root root 8471 2007-05-21 16:41 rmgg_Os -rw-r--r-- 1 root root 1017 2007-05-21 16:40 rmgg_Os.s Script started on Mon 21 May 2007 04:58:30 PM EDT time ./rmgg_O2 real 0m12.571s user 0m12.562s sys 0m0.004s Script done on Mon 21 May 2007 04:59:11 PM EDT
Created attachment 13601 [details] *.s files I believe that the *.s files in this case a superior to the *.i files
Created attachment 13602 [details] *.s files for gcc-4.2 *.s files generated by gcc-4.2.1 as more responsive to the intent and superior in performance according to the selected option. -march=pentium3 make no difference.
(In reply to comment #0) > First I am herewith re-afirming my formal request for Mr. Pinsk to refrain from > having anything to do with my submissions. Please refrain from insulting others. > According to my (admittedly second hand (Fifth Edition of "C A Reference > Manual" > by Samuel P. Harbison III & Guy L. Stelle Jr) reading; GCC, by not providing a > means to disable the use of libgcc (including udivdi3) is not in strict > conformance with the C standard for free-standing use through C99. __udivdi3 is > a reserved identifier > and hence non-conforming. If you want to play the language laywer, please read the C standard and tell me which clause this gcc behavior violates. > As a > practical matter the use of __builtin_expect could be taken as signal to > allow only reordering of instructions (to avoid pipeline stalls and > reloading of caches) are to be avoided in the marked unlikely cases. Any > fundamental changes like exchanging a while and a subtraction for a > non-hardware > divide should no occur The meaning of __builtin_expect is as defined in the documentation. This particular transformation has nothing to do with __builtin_expect. The transformation happens regardless of __builtin_expect. scev_const_prop eliminates the loop by replacing the final value with an expression using divide. Probably the right approach would be to prevent this from happening ifthe final value expression looks expensive (like DImode divide on targets without native DImode divide). The following patch avoids the situation (probably the condition check is too conservative so this isn't fully cooked yet): diff -r f01bb94713f4 gcc/tree-scalar-evolution.c --- a/gcc/tree-scalar-evolution.c Mon May 14 13:52:18 2007 +0000 +++ b/gcc/tree-scalar-evolution.c Tue May 22 20:09:08 2007 +0000 @@ -252,6 +252,9 @@ 02110-1301, USA. */ #include "tree-pass.h" #include "flags.h" #include "params.h" +/* For optab access. */ +#include "expr.h" +#include "optabs.h" static tree analyze_scalar_evolution_1 (struct loop *, tree, tree); @@ -2942,6 +2945,7 @@ scev_const_prop (void) edge exit; tree def, rslt, ass, niter; block_stmt_iterator bsi; + optab op; /* If we do not know exact number of iterations of the loop, we cannot replace the final value. */ @@ -2957,6 +2961,10 @@ scev_const_prop (void) the elimination of the final value may reveal. Therefore, we now eliminate the final values of induction variables unconditionally. */ if (niter == chrec_dont_know) + continue; + op = optab_for_tree_code (TREE_CODE (niter), TREE_TYPE (niter)); + if (op->handlers[TYPE_MODE (TREE_TYPE (niter))].insn_code == CODE_FOR_nothing + && EDGE_FREQUENCY (exit) > 500) continue; /* Ensure that it is possible to insert new statements somewhere. */
> (probably the condition check is too conservative > so this isn't fully cooked yet): Way too conservative because even if you don't have the opcode, sometimes code can be produced without calling the libcall. If you change the define to: #define NSEC_PER_SEC 0x10000000UL You will not get an call udividi3, though there is still a divide in unsigned long long, the division can be done exactly (via shifting).
The bug report as stated is clearly incorrect. gcc is perfectly free to insert calls to __udivdi3. This is not forbidden by any part of the C language standards. To clarify for future readers, I gather that the actual problem here is that the code is written to assume that the quotient is small, and that it will be more efficient to use a loop than to actually divide. So this is an optimization issue. Normally eliminating loops is a good idea. In this case it may not be. It would of course be easy to prevent the optimization by declaring nsec to be volatile. The question is whether the compiler can reasonably determine that the optimization is inappropriate in this particular case.
I did try changing #define 1000000000Ul to its rightful hexadecimal value #define 0x3b9aca00UL. the results are: .file "rmgg.c" .globl __udivdi3 .text .p2align 4,,15 .globl rmg .type rmg, @function rmg: pushl %ebp movl %esp, %ebp pushl %edi movl $-1000000000, %edi pushl %ebx subl $48, %esp movl nsec.1646, %eax movl nsec.1646+4, %edx movl %eax, -16(%ebp) movl sec.1647, %eax movl %edx, -12(%ebp) testl %eax, %eax jg .L10 .p2align 4,,15 .L5: movl -16(%ebp), %edx movl -12(%ebp), %ecx addl $1, %edx adcl $0, %ecx cmpl $0, %ecx jbe .L11 .L7: addl $-1000000000, %edx adcl $-1, %ecx incl %eax addl $-999999999, -16(%ebp) movl %edx, -32(%ebp) movl $1000000000, %edx adcl $-1, -12(%ebp) movl %ecx, -28(%ebp) movl %edx, 8(%esp) movl -12(%ebp), %ecx movl -16(%ebp), %edx movl %eax, -20(%ebp) xorl %eax, %eax movl %eax, 12(%esp) movl %ecx, 4(%esp) movl %edx, (%esp) call __udivdi3 imull $-1000000000, %edx, %ecx movl %eax, %ebx subl %eax, %ecx mull %edi addl %ecx, %edx movl %eax, -40(%ebp) movl -20(%ebp), %eax movl %edx, -36(%ebp) movl -40(%ebp), %edx addl -32(%ebp), %edx movl -36(%ebp), %ecx adcl -28(%ebp), %ecx addl %ebx, %eax movl %edx, -16(%ebp) movl %ecx, -12(%ebp) .p2align 4,,15 .L12: testl %eax, %eax jle .L5 .L10: movl -16(%ebp), %edx movl -12(%ebp), %ecx movl %eax, sec.1647 movl %edx, nsec.1646 movl %ecx, nsec.1646+4 addl $48, %esp popl %ebx popl %edi popl %ebp ret .p2align 4,,7 .L11: cmpl $999999999, %edx ja .L7 movl %edx, -16(%ebp) movl %ecx, -12(%ebp) jmp .L12 .size rmg, .-rmg .p2align 4,,15 .globl main .type main, @function main: leal 4(%esp), %ecx andl $-16, %esp pushl -4(%ecx) pushl %ebp movl %esp, %ebp pushl %ecx subl $4, %esp call rmg popl %ecx popl %ecx popl %ebp leal -4(%ecx), %esp ret .size main, .-main .local sec.1647 .comm sec.1647,4,4 .local nsec.1646 .comm nsec.1646,8,8 .ident "GCC: (GNU) 4.3.0 20070522 (experimental)" .section .note.GNU-stack,"",@progbits I will try both Mr. Taylor volatile suggestion and if I can retrieve Mr Park's proposed patch in ascii I will try that one too Thanks
Thank you Mr Taylor; your suggestion to use volatile certainly work in this drastically reduced test case. If it will work when nsec is part of a kernel structure I will leave to the experts. I, certainly, know better than to argue with you, who wrote uucp and has been active on gcc for probably 15 years or more. The *.s file and the results of time follow: (both obtained with -O1) .file "rmgg.c" .text .p2align 4,,15 .globl rmg .type rmg, @function rmg: movl sec.1647, %ecx pushl %ebp movl %esp, %ebp testl %ecx, %ecx jg .L18 .L6: movl nsec.1646, %eax movl nsec.1646+4, %edx addl $1, %eax adcl $0, %edx movl %eax, nsec.1646 movl %edx, nsec.1646+4 movl nsec.1646, %eax movl nsec.1646+4, %edx cmpl $0, %edx jbe .L15 .L13: movl nsec.1646, %eax movl nsec.1646+4, %edx addl $-1000000000, %eax adcl $-1, %edx incl %ecx movl %eax, nsec.1646 movl %edx, nsec.1646+4 movl nsec.1646, %eax movl nsec.1646+4, %edx cmpl $0, %edx ja .L13 .p2align 4,,15 .L15: cmpl $999999999, %eax ja .L13 testl %ecx, %ecx jle .L6 .L18: popl %ebp movl %ecx, %eax movl %ecx, sec.1647 ret .size rmg, .-rmg .p2align 4,,15 .globl main .type main, @function main: leal 4(%esp), %ecx andl $-16, %esp pushl -4(%ecx) pushl %ebp movl %esp, %ebp pushl %ecx call rmg popl %ecx popl %ebp leal -4(%ecx), %esp ret .size main, .-main .local sec.1647 .comm sec.1647,4,4 .local nsec.1646 .comm nsec.1646,8,8 .ident "GCC: (GNU) 4.3.0 20070522 (experimental)" .section .note.GNU-stack,"",@progbits real 0m16.433s user 0m16.425s sys 0m0.004s
This particular case is indeed an optimization issue and __udivdi3 can be avoided by using volatile as stated and verified.
(In reply to comment #3) > > The meaning of __builtin_expect is as defined in the documentation. > This particular transformation has nothing to do with __builtin_expect. > The transformation happens regardless of __builtin_expect. > > scev_const_prop eliminates the loop by replacing > the final value with an expression using divide. > Probably the right approach would be to prevent this from happening > ifthe final value expression looks expensive > (like DImode divide on targets without native DImode divide). > Isn't there a way for __builtin_expect to modify this behaviour? After all, it is telling us that the loop is cheap. And the difference in computation time is not trivial at all.
Mr. Guenther! The volatile fix would be fine, but (at least for me) does not work with the kernel. There is that little message: kernel/time.c:479: warning: passing argument 3 of 'div_long_rem_signed' discards qualifiers from pointer target type. and others like it, and, udivdi3 reappears. Mr. Park! The patch you kindly included in comment 3 presented two difficulties: 1. I Acould not extricate it cleanly enough from the html encoding apparently standard with bugzilla. (this is a Mozilla Product) 2. After some editing patch just accepted 1 hunk and upon checking it turned out that the svn derived tree-scalar.evolution.c did not match the enclosing lines around the lines to be added. I added those lines by hand (possibly imperfectly, enven on careful checking) the file compiled OK, but, runnign the gcc check sequence I got a stream of error. These errors disappeared on using a sequestered unpatched copy of the file. Hence, udivdi3 reappeared. If you see fit for me to test this not only on the reduced test case but on the actual kernel I suggest sending me a updated patch as a text attachment to my email. Thanks to all for trying to help.
Mr. Ibanez! Thank you (muchas gracias) for looking at the matter from a user's point of view and considering my arguments concerning __builtin_expect. You seem to be the first to look at the timings and amount of code generated. If you are interested I have equivalent data taken on a MAC with dual G4's. I did not send it so far because until you intervened I got mostly legalistic arguments and proposed fixes that do no solve the real problem of avoiding both udivdi3 and more importantly libgcc.
So this is now an enhancement request for sccp to honor loop roll count or basic-block frequency and cost of the replacement. Note the loop appears to be peeled twice before sccp already, but peeling doesn't decay probabilities further. Testcase: int rmg(unsigned long long nsec) { int sec = 0; nsec++; while (__builtin_expect(nsec >= 1000000000UL, 0)) { nsec -= 1000000000UL; ++sec; } return sec; } note this can be worked around with -fno-tree-scev-cprop as well.
Mr Guenther! Thank you (herzlichen Dank) for the information about the hopefully disabling flag. If that information would have been posted after my initial intervention we could have saved a lot of bandwidth and storage space. I realize libgcc is one of the areas you are responsible for and libgcc definitely fills a need for a hosted implementation. Also, (at least for me) the optimization is strictly an internal matter for the gcc community and I have nothing to contribute in this regard. To sum up; as a user, and, in the UNIX spirit (I started with the 7th edition), I just want freedom in choosing the facilities and features gcc has to offer. I hope that this or similar flags provide me with the capability to banish libgcc and similar facilities from my programs under the C free-standing specification.
(In reply to comment #13) > > To sum up; as a user, and, in the UNIX spirit (I started with the 7th edition), > I just want freedom in choosing the facilities and features gcc has to offer. I > hope that this or similar flags provide me with the capability to banish libgcc > and similar facilities from my programs under the C free-standing > specification. > The flag just disables an optimisation. If you want to disable optimisations just use -O0. On the other hand, shouldn't -ffreestanding prevent udivdi3 ? What about -fno-builtin-udivdi3 ?
(In reply to comment #14) > > The flag just disables an optimisation. If you want to disable optimisations > just use -O0. On the other hand, shouldn't -ffreestanding prevent udivdi3 ? > What about -fno-builtin-udivdi3 ? > OK. I found the answer myself: PR16470. So if you want a freestanding environment, use static libgcc.
(In reply to comment #12) > So this is now an enhancement request for sccp to honor loop roll count or > basic-block frequency and cost of the replacement. we used to take the cost of the replacement into account. It caused so many missed-optimization PRs that I decided to just disable it. The main problem is that while theoretically you can determine whether replacement is more costly then performing the computation in the loop (although even this is nontrivial in practice), it is very difficult to estimate the gains of enabling further optimizations. One possible solution would be to annotate the division by the expected value of the result. Division expanders then may decide whether to expand to machine instruction/libcall or to check for small values of the result in if-guards first.
*** Bug 31990 has been marked as a duplicate of this bug. ***
(In reply to comment #4) > > (probably the condition check is too conservative > > so this isn't fully cooked yet): > > Way too conservative because even if you don't have the opcode, sometimes code > can be produced without calling the libcall. > If you change the define to: > #define NSEC_PER_SEC 0x10000000UL > > You will not get an call udividi3, though there is still a divide in unsigned > long long, the division can be done exactly (via shifting). > Just to understand the issue better. Which pass/function is responsible to detect that the division can be done exactly and replace it by a shift? Can't this be checked when the result is computed and moved outside the loop? By the way, __builtin_expect generates a probability of >90% of exiting the loop. Shouldn't that be taken into account when moving anything outside the loop?
(In reply to comment #16) > One possible solution would be to annotate the division by the expected value > of the result. Division expanders then may decide whether to expand to machine > instruction/libcall or to check for small values of the result in if-guards > first. > Well, that won't achieve at all what the user wants, that is, to not get a call to __udividi3. There are 2 issues here: 1) GCC replaces the loop by division. This may or may not be more efficient. In this case, it is not. In many other cases, it is. 2) The user does not want a call to __udividi3. If I understand correctly you need __udividi3 to do the division, am I wrong? Of course, you could "inline" it or link statically to avoid libgcc. The former will pessimize the code in most cases and the latter can be done by the user. Tricky. Is it possible to fix (2) without fixing (1)?
Subject: Re: [4.3 regression] udivdi3 counterproductive, unwarranted use > ------- Comment #19 from manu at gcc dot gnu dot org 2007-11-08 14:22 ------- > (In reply to comment #16) > > One possible solution would be to annotate the division by the expected value > > of the result. Division expanders then may decide whether to expand to machine > > instruction/libcall or to check for small values of the result in if-guards > > first. > > > > Well, that won't achieve at all what the user wants, that is, to not get a call > to __udividi3. However, it would avoid calling the division most of the time, i.e., for performance this would be a possible solution. > There are 2 issues here: > > 1) GCC replaces the loop by division. This may or may not be more efficient. In > this case, it is not. In many other cases, it is. The code performing the replacement should probably check what is the expected # of iterations of the loop, and compute which way is more efficient. However, it is a bit difficult to get this work correctly. > 2) The user does not want a call to __udividi3. There is nothing wrong with calling __udividi3, once the division is introduced.
Let's leave the right/wrong discussion and look at it more pragmatically: Could gcc get some kind of --expensive-libgcc flag that tells gcc that libgcc calls are a bit more expensive than usually and should be avoided?
Subject: Re: [4.3 regression] udivdi3 counterproductive, unwarranted use On Fri, 9 Nov 2007, bunk at stusta dot de wrote: > ------- Comment #21 from bunk at stusta dot de 2007-11-09 16:26 ------- > Let's leave the right/wrong discussion and look at it more pragmatically: > > Could gcc get some kind of --expensive-libgcc flag that tells gcc that libgcc > calls are a bit more expensive than usually and should be avoided? While this is technically possible we already face stability and maintainability problems in other areas where we do so (for example the conditionally available C99 math has in the past lead to many internal compiler errors). So I would prefer not to go this route. Which does _not_ mean that we absolutely will not fix this issue and try to avoid creating final value replacement that involves a division/modulus. But as there exist multiple work-arounds for this issue, a nice and complete solution is certainly not very high priority. Richard.
We need a way to globally prevent it in the kernel or it will be a repeating source of problems there. Is -fno-tree-scev-cprop a reasonable (and not too expensive) workaround for the Linux kernel?
Subject: Re: [4.3 regression] udivdi3 counterproductive, unwarranted use On Fri, 9 Nov 2007, bunk at stusta dot de wrote: > ------- Comment #23 from bunk at stusta dot de 2007-11-09 17:09 ------- > We need a way to globally prevent it in the kernel or it will be a repeating > source of problems there. > > Is -fno-tree-scev-cprop a reasonable (and not too expensive) workaround for the > Linux kernel? Yes, as is using volatile on this particular loop variable. Richard.
Adding workarounds in all affected places in the kernel would be horribly fragile, but I've confirmed your -fno-tree-scev-cprop suggestion works around it and I'll submit a patch to the Linux kernel to use it with gcc 4.3.
*** Bug 34167 has been marked as a duplicate of this bug. ***
Ian, Diego, Richard G. -- I don't understand the state of this issue. Do we still think we have a problem that needs fixing in GCC? Has the Linux kernel worked around the issue? I'm interested both in knowing whether or not the compiler has a bug and, if it does, whether than affects current versions of the kernel. Thanks, -- Mark
Mark, Linux 2.6 tree as of this minute doesn't compile with gcc 4.3 trunk. You need to add -fno-tree-scev-cprop to the KBUILD_CFLAGS, but this is not upstreamed yet and I am not sure if it'll be accepted in case it results in a performance regression. Thanks.
This is IMHO at most a QOI issue - at Novell we mark timespec_add_ns's u64 parameter as volatile to work around this issue. I expect upstream to adopt a workaround as well. Note that some targets in the kernel have parts of libgcc implemented, but i?86 misses at least __udivdi3.
Subject: Re: [4.3 regression] udivdi3 counterproductive, unwarranted use rguenth at gcc dot gnu dot org wrote: > ------- Comment #29 from rguenth at gcc dot gnu dot org 2007-11-27 09:43 ------- > This is IMHO at most a QOI issue - at Novell we mark timespec_add_ns's u64 > parameter as volatile to work around this issue. I expect upstream to adopt > a workaround as well. Note that some targets in the kernel have parts of > libgcc implemented, but i?86 misses at least __udivdi3. I am not a kernel developer, but my feeling as a GCC developer is that you must provide the entry points in libgcc whenever you are linking code compiled with GCC. In other words, that GCC should be free to use functions from libgcc as it pleases. Of course, it might be a GCC optimization bug to call __udivdi3; perhaps it could generate more efficient code that doesn't call that function. Do others agree? That this is at most an optimization issue, but not a correctness issue?
(In reply to comment #29) > This is IMHO at most a QOI issue - at Novell we mark timespec_add_ns's u64 > parameter as volatile to work around this issue. I expect upstream to adopt > a workaround as well. Note that some targets in the kernel have parts of > libgcc implemented, but i?86 misses at least __udivdi3. The problem is that this is very fragile - imagine what might happen when a frequently used struct member gets changed from int to u64. After all, when looking at current practice, the kernel will support being built with gcc 4.3 until 2013 or 2014.
(In reply to comment #30) >... > I am not a kernel developer, but my feeling as a GCC developer is that > you must provide the entry points in libgcc whenever you are linking > code compiled with GCC. In other words, that GCC should be free to use > functions from libgcc as it pleases. >... Status quo up to gcc 4.2 was that gcc did not use them in the kernel on i386. The root of the problem is that your feeling and what the kernel currently implements do not match, and some flamewars^Wdiscussions might be required for getting this sorted out in either gcc or the kernel... Even if this specific issue in the kernel would turn out as a misoptimization, the general problem would still remain waiting to pop up at some later time at a different place.
Subject: Re: [4.3 regression] udivdi3 counterproductive, unwarranted use > > ------- Comment #29 from rguenth at gcc dot gnu dot org 2007-11-27 09:43 ------- > > This is IMHO at most a QOI issue - at Novell we mark timespec_add_ns's u64 > > parameter as volatile to work around this issue. I expect upstream to adopt > > a workaround as well. Note that some targets in the kernel have parts of > > libgcc implemented, but i?86 misses at least __udivdi3. > > I am not a kernel developer, but my feeling as a GCC developer is that > you must provide the entry points in libgcc whenever you are linking > code compiled with GCC. In other words, that GCC should be free to use > functions from libgcc as it pleases. I would agree with this. Even if we somehow prevent gcc from using __udivdi3 in this particular case, there is no guarantee that gcc will not decide to use this or some other function from the libgcc runtime at some other place (unless we decide to have some kind of -dont-use-libgcc flag, which would however be a bit difficult to implement, and it would make us fail to compile some valid codes). > Of course, it might be a GCC optimization bug to call __udivdi3; perhaps > it could generate more efficient code that doesn't call that function. Definitely. It is particularly annoying that this is the case where the programmer went through the pain of avoiding the usage of division, just to have it "optimized" back in.
I certainly agree with Mark here, we can talk whether on a particular arch with particular options this transformation is a win or not, but when the kernel people decide to implement libgcc on their own and intentionally leave parts of it out, they need to handle the consequences of that. The addition of asm ("" : "+r" (nsec)); to the loop I suggested, perhaps hidden in some macro, will act as an optimization barrier against it and won't (unlike volatile) cost anything. Even when the current scev-cprop transformation might not be a win in the case of __builtin_expect unlikely loop, the compiler e.g. could decide to transform this into: if (__builtin_expect (nsec < NSEC_PER_SEC, 1)) sec += nsec / NSEC_PER_SEC, nsec %= NSEC_PER_SEC; or e.g. if (__builtin_expect (nsec < NSEC_PER_SEC * 10, 1)) while (__builtin_expect (nsec > NSEC_PER_SEC, 0)) nsec -= NSEC_PER_SEC, sec++; else sec += nsec / NSEC_PER_SEC, nsec %= NSEC_PER_SEC; which would still reference __udivdi3, but wouldn't be (measurably) slower in case nsec is really small, yet would be much faster if nsec happens to be big.
Subject: Re: [4.3 regression] udivdi3 counterproductive, unwarranted use bunk at stusta dot de wrote: > Even if this specific issue in the kernel would turn out as a misoptimization, > the general problem would still remain waiting to pop up at some later time at > a different place. Indeed. However, I think that the kernel developers should be aware that GCC is not designed to avoid libgcc functions. GCC fundamentally assumes that it may call various functions from libgcc as it pleases. (Sometimes it may do so for good reasons, sometimes it may be that it does it suboptimally.) Because there's nothing in GCC to keep it from "randomly" calling libgcc functions, if the kernel wants to be robust against different versions of GCC, it should provide definitions of these functions. There's no easy way to make GCC avoid these functions. That's not meant to defend GCC calling this particular function in this particular circumstance, of course.
*** Bug 33419 has been marked as a duplicate of this bug. ***
Downgrading to P4. We seem to have consensus that this is now a GCC wrong-code bug. I can't tell if we think this is an optimization bug or not; if we do think so, please bump this back to P3, with a test-case showing the poor optimization choice. I will review and upgrade to P2.
(In reply to comment #37) > Downgrading to P4. We seem to have consensus that this is [not] a GCC wrong-code > bug. Yeah, it seems to be a mistaken expectation of -ffreestanding not to call libgcc. Maybe a new option to that effect would help?
Subject: Re: [4.3 regression] udivdi3 counterproductive, unwarranted use fche at redhat dot com wrote: >> Downgrading to P4. We seem to have consensus that this is [not] a GCC wrong-code >> bug. > > Yeah, it seems to be a mistaken expectation of -ffreestanding not to > call libgcc. Maybe a new option to that effect would help? I don't think there's a practical, platform-independent way for GCC to avoid calling libgcc. On some platforms, it has to do that for pretty basic operations. I think we just need to accept that the libgcc API is part of what's required by the compiler; you could imagine that it is "as-if" weak definitions of these functions were emitted in every assembly file by the compiler itself.
This bug cause linux kernel unable to compile. So I think it must be fixed before 4.3 is released
(In reply to comment #40) > This bug cause linux kernel unable to compile. So I think it must be fixed > before 4.3 is released Yes and there is a known workaround, see comment #28
(In reply to comment #40) > This bug cause linux kernel unable to compile. So I think it must be fixed > before 4.3 is released Except there is two parts to this bug, first it is a bug in the linux kernel for not including this function. Second the GCC bug is that it produces less than optimal code but that should not be a blocker for GCC 4.3.0.
Update milestone after 4.3.0 release.
4.3.1 is being released, adjusting target milestone.
4.3.2 is released, changing milestones to 4.3.3.
*** Bug 38453 has been marked as a duplicate of this bug. ***
Re. comment #37: Mark, bug 38453 has a simple test case that shows the poor optimization choice for ARM-linux. Also, there are now 4 bugs closed as duplicates of this one, so many users run into this and consider it important enough an issue to file a bug report about it. Re. comment #16: Zdenek, do you remember which revision / patch removed the cost check? And do you recall (or can you recover) some of the missed-optimization bug report numbers? I tried to find them with a Bugzilla query, but failed. With the removal of the cost check, we've gone from missed-optimization bugs to too-aggressive-optimization bugs that even require hacks/workarounds from our users. To me, it seems we have made the wrong trade-off, then. In my opinion, this *is* an optimizer bug, and, actually, a much more important bug than some of the regressions that are P2/P3 now for gcc 4.3 and gcc 4.4.
To P3 per comment #37.
P2, similar to all other missed-optimization regressions.
The cost check for final value replacement was removed in revision 122896 (from bug 33419, see http://gcc.gnu.org/viewcvs?view=rev&revision=122896)
I'd like to raise here that bug 38453 which was marked as a duplicate of this is in fact an example not only of wrong optimisation, but of missed optimisation as well. The compiler emits the loop in the bug report *and* then emits the umod call despite already having the same answer in a register. If the compiler had only emitted the umod call then at least it wouldn't have done the work twice. I think it's critical that there be costs applied to these builtins -- especially for platforms where they are a full function call away rather than being instructions.
Steven, thanks for your comments on this issue. I agree with Richard G. that this is P2 -- but I also agree with you that it's a serious issue. Unfortunately, Bugzilla doesn't want to show me the test case attached to PR38453, so I can't look at that. Is the issue there that we don't take advantage of the ARM __aeabi_uidivmod function that gives both the quotient and modulus? Or is it that we're not giving CSE enough information to figure out that the value we have is already available? Zdenek, it would certainly be helpful to have the original justification for your change available. It does sound a bit like we've traded one set of problems for another, and if those are our only practical options, we need to decide which set of problems is worse. Thanks, -- Mark
Subject: Re: [4.3/4.4 regression] udivdi3 counterproductive, unwarranted use Hi, > Re. comment #16: > Zdenek, do you remember which revision / patch removed the cost check? rev. 122896 > And do > you recall (or can you recover) some of the missed-optimization bug report > numbers? I tried to find them with a Bugzilla query, but failed. No; the PRs addressed by that patch seem unrelated.
Subject: Re: [4.3/4.4 Regression] final value replacement too aggressive for e.g. targets with no native div/mod insns > Zdenek, it would certainly be helpful to have the original justification for > your change available. Unfortunately, I do not remember the exact problems (it is almost two years since that change was made). Adding back expression_expensive_p in some conservative form (e.g., checking that the computation involves division, perhaps excluding divisions by a power of two) seems like the easiest way to fix this problem now.
// This is the test case from PR38453. // See comment #0 of that bug for further information: // http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38453#c0 typedef struct { int lc; int pb; } bar; void foo(bar *propsRes, const unsigned char *propsData) { unsigned char prop0; prop0 = propsData[0]; while (prop0 >= 45) { propsRes->pb++; prop0-=45; } propsRes->lc = prop0; } int main(int argc, char **argv) { bar propsRes; unsigned char propsData[1024]; foo(&propsRes, propsData); }
Re. comment #52: I've pasted the test case in the audit trail here as plain text -- it's pretty small and it shows the problem nicely. The issue is that with -Os, on all targets, the line, propsRes->lc = prop0; is replaced with: propsRes->lc = (int) (int) (prop0.24 % 45); // with prop0.24 = propsData[0]; so there is no div/mod instruction in the original code, but one appears as a result of the final value replacement optimization. With -O2 there is also a div instruction, but I am not sure where it comes from (I suspect loop header copying, but it's inconsequential for this issue anyway). The problem is not that we don't give CSE enough information. The problem is reported for a GCC 4.3 based compiler, which still has libcall notes. So at least in theory, CSE in GCC 4.3 should have all the information available to optimize away the mod instruction. Apparently the result is just not available. The bottom line, then, is still that the final value replacement has introduced an operation at the tree level that doesn't have a matching RTL instruction. In this case a mod insn that makes gcc produce a libcall to umodsi. I think Zdenek is right in comment #54: We should reincarnate expression_expensive_p in some form such that it takes into account (or makes a reasonable guess about) which instructions are particularly expensive. Apparently, the old cost check was too conservative (it tested the expression cost against spill cost, ignoring further optimization opportunities). So we need to test for something else than before. Maybe we should not replace if the final value is not a constant? Or if it is an expression involving div/mod? Can we test somehow how many insns the final value replacement computation would take (maybe re-use the inliner heuristics)?
Subject: Re: [4.3/4.4 Regression] final value replacement too aggressive for e.g. targets with no native div/mod insns > I think Zdenek is right in comment #54: We should reincarnate > expression_expensive_p in some form such that it takes into account (or makes a > reasonable guess about) which instructions are particularly expensive. > Apparently, the old cost check was too conservative (it tested the expression > cost against spill cost, ignoring further optimization opportunities). So we > need to test for something else than before. > > Maybe we should not replace if the final value is not a constant? That is way too conservative (replacing the final values like say "n + 1" should usually be profitable). > Or if it is an expression involving div/mod? That is what I currently plan to do. > Can we test somehow how many insns the final > value replacement computation would take I think that has the same problems as the original solution (it is not possible to find a bound that would both prevent expensive things to be moved out and to allow the non-expensive ones, as the exact costs are target-specific and we are not estimating them precisely enough). > (maybe re-use the inliner heuristics)? That would be difficult, as estimate_num_insns works on gimple statements, not on expressions.
(In reply to comment #56) > Re. comment #52: > > I've pasted the test case in the audit trail here as plain text -- it's pretty > small and it shows the problem nicely. The issue is that with -Os, on all > targets, the line, > > propsRes->lc = prop0; > > is replaced with: > > propsRes->lc = (int) (int) (prop0.24 % 45); // with prop0.24 = propsData[0]; > > so there is no div/mod instruction in the original code, but one appears as a > result of the final value replacement optimization. With -O2 there is also a > div instruction, but I am not sure where it comes from (I suspect loop header > copying, but it's inconsequential for this issue anyway). actually, the div instruction also comes from the final value replacement. We eliminate the loop completely and replace it with the following: D.1286 = prop0 + 211; prop0 = D.1286 % 45; propsRes->pb = [plus_expr] (propsRes->pb + 1) + (int) (D.1286 / 45);
Steven, Zdenek -- Is there any way to teach the compiler to use the ARM __aeabi_divmod routine, which provides both the quotient and remainder? At least with -Os, that is probably optimal. In other words, once we get to: prop0 = D.1286 % 45; propsRes->pb = [plus_expr] (propsRes->pb + 1) + (int) (D.1286 / 45); can we notice that we have both "D.1286 % 45" and "D.1286 / 45" and use the divmod routine? Furthermore, if we want to generate the loop instead, are the alternatives you're considering sufficient to change code like: ... x / 45 ... into the loop form? In other words, rather than just considering this a problem of transforming the hand-written loop into a division operation, shouldn't we also be considering the problem of transforming division into the loop form? The style: a = x / N; b = x % N; is pretty common; recognizing that pattern and generating a loop, or libcall to __aeabi_divmod, seems like a good thing. One could argue that __aeabi_divmod itself should be responsible for noticing divisors that more profitably use repeated subtraction than other methods, on CPUs that don't have a hardware divide instruction. Do we need a divmod optab? Or, should we expect that if we provide an appropriate pattern in the ARM back end, combine will be able to smush the division and remainder operations together? I guess where I'm going with this line of thinking is to argue that the transformation to division is fine, and that we should treat that as canonical -- but that we need a way to change divisions into repeated divisions if appropriate. -- Mark
IMHO I the transformation to division is not fine. I would argue this is the core issue in this problem report. You are right that a combination of div and mod is quite common in real-world code. Right now, GCC can catch that when the port emits a divmod libcall by default instead of just a div and/or a mod. However, I consider this a separate issue. GCC currently does not recognize div/mod or mod/div as patterns, because at the tree level this would show up as two separate expressions (possibly with statements between the div and mod expresssions), and expand works one statement at a time. Some targets (e.g. x86) call a divmod libcall by default for a div or a mod instruction. If there is a div/mod or mod/div sequence in the code, GCC optimizes away one of the libcalls via the REG_EQUAL notes, IIRC. But our concern now should be to avoid this situation where gcc "invents" a div or mod instruction in the first place.
Subject: Re: [4.3/4.4 Regression] final value replacement too aggressive for e.g. targets with no native div/mod insns > Furthermore, if we want to generate the loop instead, are the alternatives > you're considering sufficient to change code like: > > ... x / 45 ... > > into the loop form? Although I think this is somewhat unrelated, we actually already do this (see value-prof.c). > The style: > > a = x / N; > b = x % N; > > is pretty common; recognizing that pattern and generating a loop, or libcall to > __aeabi_divmod, seems like a good thing. One could argue that __aeabi_divmod > itself should be responsible for noticing divisors that more profitably use > repeated subtraction than other methods, on CPUs that don't have a hardware > divide instruction. The problem is that "profitably" depends on how large x is (i.e., using the loop is only profitable when x is small). As mentioned, we already do the transformation in profile-driven optimizer, when we have this information. > Do we need a divmod optab? Or, should we expect that if we provide an > appropriate pattern in the ARM back end, combine will be able to smush the > division and remainder operations together? I think it should be able to (IIRC, we do so on x86). > I guess where I'm going with this line of thinking is to argue that the > transformation to division is fine, and that we should treat that as canonical > -- but that we need a way to change divisions into repeated divisions if > appropriate. Users usually use this pattern (division by repeated subtraction) only if the result is small; so transforming to actual division makes the code worse (and understandably, the user angry :-) Unless profiling is used, compiler does not have enough information to determine which form is better, so I think we should prefer to keep whatever form was originally in the program.
I take Zdenek's point about the transformation from division to a loop being profitable only if x is small. But, that might argue that if we see the loop, we still transform it into the division form -- but with a note that allows the value-profiling code to turn it back. The advantage to that is we can then take advantage of the fact that it's division otherwise; for example, CSE it with a manually written division, or eliminate it if multiplied by the same value we divided by, etc. There are a bunch of issues somewhat tangled up in here. In any case, I am also concerned about the fact that if the user writes division and modulus manually, we're not taking advantage of the ABI divmod function. IIUC, it sounds like we could fix that by having the ARM back end always use the divmod function as on x86. Is that right? -- Mark
Re. comment #62: Transforming the code and adding notes to allow the compiler to undo the transformation is not an option with the available infrastructure in GCC. You'd have to add some kind of note (something like REG_EQUAL) to GIMPLE, and then, maybe it is possible -- but still only when profile information is available (i.e. practically never for -Os or for the kernel). I agree with Zdenek that the best solution for now, to fix this regression, is to honor the user's decision to use a loop. For anything more fancy than that, patches welcome ;-) You also have to keep in mind that the issue in this problem report is that GCC introduces a div. That is the regression here, and the only thing we should be addressing in this PR, IMHO. The issue is *not* that GCC could do better at recognizing situations where divmod could be used. GCC has never been able to do that, so there is no regression in that regard. If someone wishes to add some kind of idiom recognition for divmod to GCC, and/or optimization hints from GIMPLE to RTL via a GIMPLE-equivalent of REG_EQUAL, then he/she should open new PRs for that, marked "enhancement".
I agree that the most practical short-term solution is to avoid transforming the loop into a division. However, I'm also interested in what we think the right long-term solution is. I'm not convinced that our goal should be to preserve the form used by the user; rather, I'm inclined to say that our goal should be to (a) recognize the loop as division, and, (b) generate loops to implement division when profitable. It's reasonable is to say that the form user by the user influences the decision about profitability made in (b).
Subject: Bug 32044 Author: rakdver Date: Fri Dec 12 20:32:47 2008 New Revision: 142719 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=142719 Log: PR tree-optimization/32044 * tree-scalar-evolution.h (expression_expensive_p): Declare. * tree-scalar-evolution.c (expression_expensive_p): New function. (scev_const_prop): Avoid introducing expensive expressions. * tree-ssa-loop-ivopts.c (may_eliminate_iv): Ditto. * gcc.dg/pr34027-1.c: Change outcome. * gcc.dg/tree-ssa/pr32044.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/pr34027-1.c trunk/gcc/tree-scalar-evolution.c trunk/gcc/tree-scalar-evolution.h trunk/gcc/tree-ssa-loop-ivopts.c
(In reply to comment #64) > I agree that the most practical short-term solution is to avoid transforming > the loop into a division. > > However, I'm also interested in what we think the right long-term solution is. > I'm not convinced that our goal should be to preserve the form used by the > user; rather, I'm inclined to say that our goal should be to (a) recognize the > loop as division, and, (b) generate loops to implement division when > profitable. It's reasonable is to say that the form user by the user > influences the decision about profitability made in (b). Agreed; there are also other parts of tree->rtl expansion that could benefit from using the information about the distribution of the inputs, whether obtained from value profiling, VRP, scev analysis, guesses based on the structure of the program (as in this case), or possibly hints by the user (builtin_expect). I do not have time to work on that now, though (I think it would make a rather nice gsoc project, or a student project in general).
GCC 4.3.3 is being released, adjusting target milestone.
This was fixed on trunk (see comment #65). Open another bug for long-term solutions, this is cluttering the 4.4 regressions. :-)
I am going to do a backport of this to the 4.3 branch.
Subject: Bug 32044 Author: rguenth Date: Tue May 26 10:17:19 2009 New Revision: 147865 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147865 Log: 2009-05-26 Richard Guenther <rguenther@suse.de> Backport from mainline 2008-12-12 Zdenek Dvorak <ook@ucw.cz> PR tree-optimization/32044 * tree-scalar-evolution.h (expression_expensive_p): Declare. * tree-scalar-evolution.c (expression_expensive_p): New function. (scev_const_prop): Avoid introducing expensive expressions. * tree-ssa-loop-ivopts.c (may_eliminate_iv): Ditto. * gcc.dg/pr34027-1.c: Change outcome. * gcc.dg/tree-ssa/pr32044.c: New test. Added: branches/gcc-4_3-branch/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c Modified: branches/gcc-4_3-branch/gcc/ChangeLog branches/gcc-4_3-branch/gcc/testsuite/ChangeLog branches/gcc-4_3-branch/gcc/testsuite/gcc.dg/pr34027-1.c branches/gcc-4_3-branch/gcc/tree-scalar-evolution.c branches/gcc-4_3-branch/gcc/tree-scalar-evolution.h branches/gcc-4_3-branch/gcc/tree-ssa-loop-ivopts.c
Fixed.