56727 – Recursive call goes through the PLT unnecessarily

Bug 56727 - Recursive call goes through the PLT unnecessarily

Summary: Recursive call goes through the PLT unnecessarily

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.7.2

Importance:	P3 minor
Target Milestone:	8.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2013-03-25 19:16 UTC by Thiago Macieira
Modified:	2017-07-21 21:26 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:	7.0
Last reconfirmed:	2013-03-26 00:00:00

Attachments
Draft patch (560 bytes, patch) 2017-01-23 08:24 UTC, Yuri Gribov	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Thiago Macieira 2013-03-25 19:16:34 UTC

Consider the following code, compiled with -O2 -fPIC either in C or in C++:

===
void f(TYPE b)
{
    f(0);
}
===

if TYPE is a type of 32- or 64-bit width (int, unsigned, long, long long), GCC generates the following code (-m32, -mx32 or -m64):

===
f:
.L2:
        jmp     .L2
===

If TYPE is shorter than 32-bit (bool, _Bool, char, short), GCC generates the following code (-mx32, -m64):

===
f:
        xorl    %edi, %edi
        jmp     f@PLT
===

and much worse code for -m32.

For whatever reason, GCC decided to place the call via the PLT. That's a the missed optimisation: if this function was called, then the PLT must resolve back to itself. What's more, since the argument wasn't used, it's also unnecessary to set it.

The output happens without -fPIC, but in that case there is no PLT.

Tested on:
  GCC 4.7.2 (as shipped by Fedora 17)
  GCC 4.9 (trunk build from 20130318)

This is a contrived example (infinite recursion) that no one would write in their sane mind. But it might point to missed optimisations in legitimate recursive functions.

Comment 1 Andrew Pinski 2013-03-25 19:18:17 UTC

This is due to promoting of the type to int on some targets (x86 is the main one).

Comment 2 Richard Biener 2013-03-26 14:26:21 UTC

Confirmed.  We don't optimize callgraph cycles with one externally visible
entry that way.  And I believe we currently have no way of annotating a
single call to resolve locally.  Honza?

Comment 3 Jan Hubicka 2013-03-26 16:12:27 UTC

> Confirmed.  We don't optimize callgraph cycles with one externally visible
> entry that way.  And I believe we currently have no way of annotating a
> single call to resolve locally.  Honza?

The canonical way to do it is to create a static alias and the redirect call
to the alias. We do that in FE for some specific examples (like thunks) but not
in general.  Doing so would indeed make sense.

Honza

Comment 4 Jakub Jelinek 2013-03-26 17:05:12 UTC

Note it would need to be done with lots of care, because you can e.g. have aliases to the function and in that case you should go through the PLT:
__attribute__((noinline, noclone))
void f(short b)                                                                                                                                  
{                                                                                                                                               
  f(0);                                                                                                                                       
}                                                                                                                                               
static void g (short) __attribute__ ((alias ("f")));
void
h ()
{
  g (0);
}

Because the global scope f can be in different shared library, but if you call h () and is defined only in this CU, you call this CU's f, no the globally visible one.

Comment 5 Yuri Gribov 2017-01-23 08:24:02 UTC

Created attachment 40562 [details]
Draft patch

(In reply to Thiago Macieira from comment #0)
> What's more, since the argument wasn't used, it's also
> unnecessary to set it.

This may be due to contrivedness of example (endless loop, etc.). The argument seems to be forwarded in normal cases.

(In reply to Jan Hubicka from comment #3)
> The canonical way to do it is to create a static alias and the redirect call
> to the alias. We do that in FE for some specific examples (like thunks) but
> not in general.  Doing so would indeed make sense.

Would something like attached draft patch be ok?

(In reply to Jakub Jelinek from comment #4)
> Note it would need to be done with lots of care, because you can e.g. have
> aliases to the function and in that case you should go through the PLT:

Note that we already violate this. E.g. in Thiago's reprocase compiler shouldn't have figured out
that there is and endless loop. I can also see that recursive factorial implementation is expanded to loop
even under -fPIC. Should I fix this?

Comment 6 Alexander Monakov 2017-01-23 12:39:01 UTC

Note that even without symbol aliases, such calls are not necessarily self-recursive when 'f' is first called via dlsym with RTLD_NEXT or a specific module as the first argument.  In that case the result of dlsym won't necessarily be the prevailing definition of 'f' placed into the PLT.

Comment 7 Yuri Gribov 2017-01-23 19:05:44 UTC

(In reply to Alexander Monakov from comment #6)
> Note that even without symbol aliases, such calls are not necessarily
> self-recursive when 'f' is first called via dlsym with RTLD_NEXT or a
> specific module as the first argument.

Right.  It seems -fno-semantic-interposition is the only option to disable strict interposition rules.

So could someon close this as wontfix?

Comment 8 Alexander Monakov 2017-01-23 21:35:18 UTC

Well, if my argument is correct, then GCC generates wrong code for the very first example in comment #0. If that is deliberate as a compromise even though otherwise GCC suppresses optimizations to honor possible ELF interposition, I think this ought to be documented?  To my knowledge, that is the sole instance where optimization doesn't fully honor ELF interposition possibility.

Comment 9 Yuri Gribov 2017-01-24 09:43:29 UTC

(In reply to Alexander Monakov from comment #8)
> Well, if my argument is correct, then GCC generates wrong code for the very
> first example in comment #0.

I believe it does (see my #5, most probly author of some pass failed to check for interposition).  Note that simple recursive factorial is expanded to loop too which is a more impressive instance of this bug.

> To my knowledge, that
> is the sole instance where optimization doesn't fully honor ELF
> interposition possibility.

+1

Comment 10 Jakub Jelinek 2017-02-09 15:35:12 UTC

(In reply to Yuri Gribov from comment #9)
> (In reply to Alexander Monakov from comment #8)
> > Well, if my argument is correct, then GCC generates wrong code for the very
> > first example in comment #0.
> 
> I believe it does (see my #5, most probly author of some pass failed to
> check for interposition).  Note that simple recursive factorial is expanded
> to loop too which is a more impressive instance of this bug.
> 
> > To my knowledge, that
> > is the sole instance where optimization doesn't fully honor ELF
> > interposition possibility.
> 
> +1

Don't we also inline any beneficial inline functions at -O3 even if they could be interposed (definitely not suggesting we stop doing that, that would totally kill compiler performance)?  I'd say we shouldn't care about the semantic interposition for self-recursion either, just make sure we don't crash on such code.

Comment 11 Florian Weimer 2017-02-09 15:41:23 UTC

(In reply to Jakub Jelinek from comment #10)

> Don't we also inline any beneficial inline functions at -O3 even if they
> could be interposed (definitely not suggesting we stop doing that, that
> would totally kill compiler performance)?  I'd say we shouldn't care about
> the semantic interposition for self-recursion either, just make sure we
> don't crash on such code.

At least we should make things consistent and use direct calls for all local locals which could potentially be subject to inlining.  Otherwise, interposition could break with apparently unrelated code changes.

Comment 12 Yuri Gribov 2017-02-09 15:48:47 UTC

(In reply to Jakub Jelinek from comment #10)
> (In reply to Yuri Gribov from comment #9)
> > (In reply to Alexander Monakov from comment #8)
> > > Well, if my argument is correct, then GCC generates wrong code for the very
> > > first example in comment #0.
> > 
> > I believe it does (see my #5, most probly author of some pass failed to
> > check for interposition).  Note that simple recursive factorial is expanded
> > to loop too which is a more impressive instance of this bug.
> > 
> > > To my knowledge, that
> > > is the sole instance where optimization doesn't fully honor ELF
> > > interposition possibility.
> > 
> > +1
> 
> Don't we also inline any beneficial inline functions at -O3 even if they
> could be interposed (definitely not suggesting we stop doing that, that
> would totally kill compiler performance)?

Inlining inline functions is fine due to ODR rule.

Comment 13 Jakub Jelinek 2017-02-09 15:53:58 UTC

(In reply to Yuri Gribov from comment #12)
> Inlining inline functions is fine due to ODR rule.

ODR doesn't apply just to inline functions.  So all semantic interposition, except for the case when both functions do the very same thing, is ODR violation.
But many programs and libraries rely on it heavily.

Comment 14 Jan Hubicka 2017-02-09 15:55:47 UTC

For the draft patch you need to check for aliases.  If global symbol is indeed the only way to reach the function, then the transformation is IMO valid.

As for tailcall, we have recursive_call_p predicate that uses symtab_node::semantically_equivalent_p which returns true, because it is called for t and t. It is valid for testcase as written, because we know there is no alias.

However

void f(int b)
{
    f(0);
}

void g(int b) __attribute__((alias(("f"))));

is indeed misopitmized and the bug is that recursive_call_p needs to check all aliases of the function being semantically equivalent to function itself.  I will fix that.

For the optimization redirecting to noninterposable aliases, I would be fine with the patch if it was extended by the check verifying that there is at most one externally visible symbol aliasing the definition or that they are all non-interposable.

Honza

Comment 15 Jan Hubicka 2017-02-11 17:56:34 UTC

Author: hubicka
Date: Sat Feb 11 17:56:02 2017
New Revision: 245359

URL: https://gcc.gnu.org/viewcvs?rev=245359&root=gcc&view=rev
Log:

	PR tree-ssa/56727
	* gcc.dg/tree-ssa/pr56727.c: New testcase.
	* ipa-utils.c (recursive_call_p): Be more careful about interposition.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr56727.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-utils.c
    trunk/gcc/testsuite/ChangeLog

Comment 16 Yury Gribov 2017-07-21 19:49:26 UTC

Author: ygribov
Date: Fri Jul 21 19:48:51 2017
New Revision: 250442

URL: https://gcc.gnu.org/viewcvs?rev=250442&root=gcc&view=rev
Log:
2017-07-21  Yury Gribov  <tetra2005@gmail.com>

	PR middle-end/56727
	* ipa-visibility (function_and_variable_visibility): Convert
	recursive PLT call to direct call if appropriate.

	* gcc.dg/pr56727-1.c: New test.
	* gcc.dg/pr56727-2.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/pr56727-1.c
    trunk/gcc/testsuite/gcc.dg/pr56727-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-visibility.c
    trunk/gcc/testsuite/ChangeLog

Comment 17 Yury Gribov 2017-07-21 19:50:15 UTC

Fixed.