This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH 4/4] define ASM_OUTPUT_LABEL to the name of a function
- From: Trevor Saunders <tbsaunde at tbsaunde dot org>
- To: David Malcolm <dmalcolm at redhat dot com>, Richard Biener <richard dot guenther at gmail dot com>, tbsaunde+gcc at tbsaunde dot org, GCC Patches <gcc-patches at gcc dot gnu dot org>, richard dot sandiford at arm dot com
- Date: Fri, 7 Aug 2015 09:50:33 -0400
- Subject: Re: [PATCH 4/4] define ASM_OUTPUT_LABEL to the name of a function
- Authentication-results: sourceware.org; auth=none
- References: <CAFiYyc2eP1e=yBUVwVQbCRHn1jsB3oE3b8exV6qfEYZm9F9WVQ at mail dot gmail dot com> <20150805105650 dot GA27755 at tsaunders-iceball dot corp dot tor1 dot mozilla dot com> <CAFiYyc0UQXp=X__=gynjd7ibBOOebAapJzvyeP+mUEn7Nc3XhA at mail dot gmail dot com> <1438788499 dot 21752 dot 39 dot camel at surprise> <1438788868 dot 21752 dot 45 dot camel at surprise> <20150805202212 dot GA6847 at tsaunders-iceball dot corp dot tor1 dot mozilla dot com> <1438886869 dot 21752 dot 62 dot camel at surprise> <87io8sp5tn dot fsf at googlemail dot com> <20150807043113 dot GA24176 at tsaunders-iceball dot corp dot tor1 dot mozilla dot com> <87k2t7ig8a dot fsf at e105548-lin dot cambridge dot arm dot com>
On Fri, Aug 07, 2015 at 10:45:57AM +0100, Richard Sandiford wrote:
> Trevor Saunders <tbsaunde@tbsaunde.org> writes:
> > On Thu, Aug 06, 2015 at 08:36:36PM +0100, Richard Sandiford wrote:
> >> An integrated assembler or tighter asm output would be nice, but when
> >> I last checked LLVM was usually faster than GCC even when compiling to asm,
> >> even though LLVM does use indirection (in the form of virtual functions)
> >> for its output routines. I don't think indirect function calls themselves
> >> are the problem -- as long as we get the abstraction right :-)
> >
> > yeah, last time I looked (tbf a while ago) the C++ front end took up by
> > far the largest part of the time. So it may not be terribly important,
> > but it would still be nice to figure out what a good design looks like.
>
> I tried getting final to output the code a large number of times.
> Obviously just sticking "for (i = 0; i < n; ++i)" around something
> isn't the best way of measuring performance (for all the usual reasons)
> but it was interesting even so. A lot of the time is taken in calls to
> strlen and in assemble_name itself (called by ASM_OUTPUT_LABEL).
yeah, this data looks great. I find it interesting that you say we
spend so much time outputting labels as opposed to instructions.
> Each time we call assemble_name we do:
>
> real_name = targetm.strip_name_encoding (name);
>
> id = maybe_get_identifier (real_name);
> if (id)
> {
> tree id_orig = id;
>
> mark_referenced (id);
> ultimate_transparent_alias_target (&id);
> if (id != id_orig)
> name = IDENTIFIER_POINTER (id);
> gcc_assert (! TREE_CHAIN (id));
> }
>
> Doing an identifier lookup every time we output a reference to a label
> is pretty expensive. Especially when many of the labels we're dealing
> with are internal ones (basic block labels, debug labels, etc.) for which
> the lookup is bound to fail.
well, there's ASm_OUTPUT_INTERNAL_LABEL, and I think something similar
for debug labels. I guess we don't always use those where we could. Or
maybe the problem is we have places where we need to look at data to
find out. Maybe it would make sense to have the generally used
output_label routine take a tree / rtx, and check if its a internal or
debug label and dispatch appropriately.
> So if compile-time for asm output is a concern, that seems like a good
> place to start. We should try harder to keep track of the identifier
> behind a name (when there is one) and avoid this overhead for
> internal labels.
>
> Converting ASM_OUTPUT_LABEL to an indirect function call was in the
> noise even with my for-loop hack. The execution time of the hook is
> dominated by assemble_name itself. I hope patches like yours aren't
> held up simply because they have the equivalent of a virtual function.
Well, I think it makes sense to reroll this series, but I think I'll
keep working on trying to replace these macros with something else.
> Also, although we seem to be paranoid about virtual functions and
> indirect calls, it's worth remembering that on most targets every
> call to fputs(_unlocked), fwrite(_unlocked) and strlen is a PLT call.
> Our current code calls fputs several times for one line of assembly,
> including for short strings like register names. This is doubly
> inefficient because:
>
> (a) we could reduce the number of PLT calls by doing the buffering
> ourselves and
yeah, I mentioned that earlier, but its great to have data showing its a
win! I think its also probably important to enabling the other
optimizations below.
> (b) the names of those registers are known at compile time (or at least
> at start-up time) and are short, but we call strlen() on them
> each time we write them out.
yeah, that seems like something that should be fixed, but I'm not sure
off hand where to look for the code doing this.
> E.g. for the attached microbenchmark I get:
>
> Time taken, normalised to VERSION==1
>
> VERSION==1: 1.000
> VERSION==2: 1.377
> VERSION==3: 3.202 (1.638 with -minline-all-stringops)
> VERSION==4: 4.242 (2.921 with -minline-all-stringops)
> VERSION==5: 4.526
> VERSION==6: 4.543
> VERSION==7: 10.884
>
> where the results for 5 vs. 6 are in the noise.
>
> The 5->4 gain is by doing the buffering ourselves. The 4->3 gain is for
> keeping track of the string length rather than recomputing it each time.
>
> This suggests that if we're serious about trying to speed up the asm output,
> it would be worth adding an equivalent of LLVM's StringRef that pairs a
> const char * string with its length.
I've thought a tiny bit about working on that, so its nice to have data.
Trev
>
> Thanks,
> Richard
>
> #define _GNU_SOURCE 1
>
> #include <stdio.h>
> #include <string.h>
> #include <iostream>
>
> struct S
> {
> S () : end (buffer) {}
>
> ~S ()
> {
> fwrite_unlocked (buffer, end - buffer, 1, stdout);
> }
>
> #if VERSION == 3
> void __attribute__((noinline))
> #else
> void
> #endif
> write (const char *x, size_t len)
> {
> if (__builtin_expect (buffer + sizeof (buffer) - end < len, 0))
> {
> fwrite_unlocked (buffer, end - buffer, 1, stdout);
> end = buffer;
> }
> memcpy (end, x, len);
> end += len;
> }
>
> #if VERSION == 1 || VERSION == 3
> template <size_t N>
> void
> write (const char (&x)[N])
> {
> write (x, N - 1);
> }
> #elif VERSION == 2
> template <size_t N>
> void __attribute__((noinline))
> write (const char (&x)[N])
> {
> write (x, N - 1);
> }
> #else
> void __attribute__((noinline))
> write (const char *x)
> {
> write (x, strlen (x));
> }
> #endif
> char buffer[4096];
> char *end;
> };
>
> int
> main ()
> {
> S s;
> for (int i = 0; i < 100000000; ++i)
> {
> #if VERSION <= 4
> s.write ("Hello!");
> #elif VERSION == 5
> fputs_unlocked ("Hello!", stdout);
> #elif VERSION == 6
> fwrite_unlocked ("Hello!", 6, 1, stdout);
> #elif VERSION == 7
> std::cout << "Hello!";
> #else
> #error Please define VERSION
> #endif
> }
> return 0;
> }