This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH 4/4] define ASM_OUTPUT_LABEL to the name of a function
- From: Richard Sandiford <richard dot sandiford at arm dot com>
- To: Trevor Saunders <tbsaunde at tbsaunde dot org>
- Cc: David Malcolm <dmalcolm at redhat dot com>, Richard Biener <richard dot guenther at gmail dot com>, tbsaunde+gcc at tbsaunde dot org, GCC Patches <gcc-patches at gcc dot gnu dot org>
- Date: Fri, 07 Aug 2015 10:45:57 +0100
- Subject: Re: [PATCH 4/4] define ASM_OUTPUT_LABEL to the name of a function
- Authentication-results: sourceware.org; auth=none
- References: <1437791820-17927-1-git-send-email-tbsaunde+gcc at tbsaunde dot org> <1437791820-17927-5-git-send-email-tbsaunde+gcc at tbsaunde dot org> <CAFiYyc2eP1e=yBUVwVQbCRHn1jsB3oE3b8exV6qfEYZm9F9WVQ at mail dot gmail dot com> <20150805105650 dot GA27755 at tsaunders-iceball dot corp dot tor1 dot mozilla dot com> <CAFiYyc0UQXp=X__=gynjd7ibBOOebAapJzvyeP+mUEn7Nc3XhA at mail dot gmail dot com> <1438788499 dot 21752 dot 39 dot camel at surprise> <1438788868 dot 21752 dot 45 dot camel at surprise> <20150805202212 dot GA6847 at tsaunders-iceball dot corp dot tor1 dot mozilla dot com> <1438886869 dot 21752 dot 62 dot camel at surprise> <87io8sp5tn dot fsf at googlemail dot com> <20150807043113 dot GA24176 at tsaunders-iceball dot corp dot tor1 dot mozilla dot com>
Trevor Saunders <tbsaunde@tbsaunde.org> writes:
> On Thu, Aug 06, 2015 at 08:36:36PM +0100, Richard Sandiford wrote:
>> An integrated assembler or tighter asm output would be nice, but when
>> I last checked LLVM was usually faster than GCC even when compiling to asm,
>> even though LLVM does use indirection (in the form of virtual functions)
>> for its output routines. I don't think indirect function calls themselves
>> are the problem -- as long as we get the abstraction right :-)
>
> yeah, last time I looked (tbf a while ago) the C++ front end took up by
> far the largest part of the time. So it may not be terribly important,
> but it would still be nice to figure out what a good design looks like.
I tried getting final to output the code a large number of times.
Obviously just sticking "for (i = 0; i < n; ++i)" around something
isn't the best way of measuring performance (for all the usual reasons)
but it was interesting even so. A lot of the time is taken in calls to
strlen and in assemble_name itself (called by ASM_OUTPUT_LABEL).
Each time we call assemble_name we do:
real_name = targetm.strip_name_encoding (name);
id = maybe_get_identifier (real_name);
if (id)
{
tree id_orig = id;
mark_referenced (id);
ultimate_transparent_alias_target (&id);
if (id != id_orig)
name = IDENTIFIER_POINTER (id);
gcc_assert (! TREE_CHAIN (id));
}
Doing an identifier lookup every time we output a reference to a label
is pretty expensive. Especially when many of the labels we're dealing
with are internal ones (basic block labels, debug labels, etc.) for which
the lookup is bound to fail.
So if compile-time for asm output is a concern, that seems like a good
place to start. We should try harder to keep track of the identifier
behind a name (when there is one) and avoid this overhead for
internal labels.
Converting ASM_OUTPUT_LABEL to an indirect function call was in the
noise even with my for-loop hack. The execution time of the hook is
dominated by assemble_name itself. I hope patches like yours aren't
held up simply because they have the equivalent of a virtual function.
Also, although we seem to be paranoid about virtual functions and
indirect calls, it's worth remembering that on most targets every
call to fputs(_unlocked), fwrite(_unlocked) and strlen is a PLT call.
Our current code calls fputs several times for one line of assembly,
including for short strings like register names. This is doubly
inefficient because:
(a) we could reduce the number of PLT calls by doing the buffering
ourselves and
(b) the names of those registers are known at compile time (or at least
at start-up time) and are short, but we call strlen() on them
each time we write them out.
E.g. for the attached microbenchmark I get:
Time taken, normalised to VERSION==1
VERSION==1: 1.000
VERSION==2: 1.377
VERSION==3: 3.202 (1.638 with -minline-all-stringops)
VERSION==4: 4.242 (2.921 with -minline-all-stringops)
VERSION==5: 4.526
VERSION==6: 4.543
VERSION==7: 10.884
where the results for 5 vs. 6 are in the noise.
The 5->4 gain is by doing the buffering ourselves. The 4->3 gain is for
keeping track of the string length rather than recomputing it each time.
This suggests that if we're serious about trying to speed up the asm output,
it would be worth adding an equivalent of LLVM's StringRef that pairs a
const char * string with its length.
Thanks,
Richard
#define _GNU_SOURCE 1
#include <stdio.h>
#include <string.h>
#include <iostream>
struct S
{
S () : end (buffer) {}
~S ()
{
fwrite_unlocked (buffer, end - buffer, 1, stdout);
}
#if VERSION == 3
void __attribute__((noinline))
#else
void
#endif
write (const char *x, size_t len)
{
if (__builtin_expect (buffer + sizeof (buffer) - end < len, 0))
{
fwrite_unlocked (buffer, end - buffer, 1, stdout);
end = buffer;
}
memcpy (end, x, len);
end += len;
}
#if VERSION == 1 || VERSION == 3
template <size_t N>
void
write (const char (&x)[N])
{
write (x, N - 1);
}
#elif VERSION == 2
template <size_t N>
void __attribute__((noinline))
write (const char (&x)[N])
{
write (x, N - 1);
}
#else
void __attribute__((noinline))
write (const char *x)
{
write (x, strlen (x));
}
#endif
char buffer[4096];
char *end;
};
int
main ()
{
S s;
for (int i = 0; i < 100000000; ++i)
{
#if VERSION <= 4
s.write ("Hello!");
#elif VERSION == 5
fputs_unlocked ("Hello!", stdout);
#elif VERSION == 6
fwrite_unlocked ("Hello!", 6, 1, stdout);
#elif VERSION == 7
std::cout << "Hello!";
#else
#error Please define VERSION
#endif
}
return 0;
}