This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Speedup area - Strings/Identifiers

From: Ben White <topgunis at sbcglobal dot net>
To: gcc at gcc dot gnu dot org
Date: Tue, 07 Aug 2007 12:08:48 -0500
Subject: Speedup area - Strings/Identifiers
Reply-to: topgunis at sbcglobal dot net

From the GCC Wiki, Speedup Areas page, Strings/identifiers section:

/3. Replace identifier hash table with a better data structure (have already tried a ternary tree, it's not faster; could try to code it even cleverer than it already was; B* trees might be worth looking into)/

A different approach:

Separate the "handle identifier characteristics" process from the "which identifier is this string?" process.

Identifier characteristics are usually handled as a table, with characteristics as fields; e.g. undefined/defined, float/double/int/long, structure/not-structure. The table key is usually the row number itself.

As I understand it, in GCC now, when the parser encounters an identifier string, it first looks up the string using a hash table, returning the internal key for the identifier. Then it uses the key and the identifier table to handle characteristics.

This is costly. Maintaing a hash table using insertions and deletions is costly.

Stopping all parsing, and using memory storage for hashing and hash tables, is also expensive. If they are used, they will be brought into the cache. Information needed when the identification is over will be paged out. This is true, no matter which method is used - hashing, ternary trees, or B* trees.

A two-pass approach might be significantly faster.

In the first pass, each identifier is parsed out, and appended to the unsorted list of all identifier instances, matched with its location - (a location id might be simply the count of identifier instances, e.g. the 1,024th identifier instance , or physical location in the source text.)

The list of identifier instances, with their matching locations, are then sorted en-mass.

A unique, numeric, ID is assigned to each unique identifier string, and a cross-reference table is populated. For each identifier instance in the source, there is an entry containing the unique identifier ID.

In the second pass, the identifier instance ID is the key to return the unique identifier ID. By its nature, this table will be processed in sequence, once.

Collisions are guaranteed. They are not show stoppers. When an identifier is re-defined in a sub-section, the characteristics for the previous definition will have to be temporarily stored somewhere else, and restored at the conclusion of the subsection.

Benefits:

1. All the sorting can be done at one time, using the most efficient sort - Radix/Bucket. 2. The second parsing pass will be considerably more efficient because it will not have to be continually interrupted to do a hash-search-insertion. 3. The larger the program, e.g. whole Kernel compiles, or GCC compiles, the more efficient the sort will be. Compiler perfomance is often judged by how fast it compiles the largest programs.

Drawbacks:

1. The unique ID table will be large. 2. Two parsing passes will be assumed to be less efficient, before this is ever tried.

Ben White

Follow-Ups:
- Re: Speedup area - Strings/Identifiers
  - From: Joe Buck

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]