This is the mail archive of the
mailing list for the GCC project.
Re: Repository for the conversion machinery
Joseph Myers <email@example.com>:
> This is still hypothetical, since I haven't seen any scripts posted that
> would actual implement this, or any resulting mappings of commits, and one
> wouldn't normally expect a repository conversion to attempt to distinguish
> committer from author when the source version control system has no such
Wow. Even attempting this would be a huge, ugly job.
I strongly recomend that if you want to try this, you separate it from the
initial repo conversion. That is, get the project to git first. Then
see if you can data-mine author information out of the history. If,
and only if, you get results that look reasonable, then you patch the repo
and force-push it, warning everyone there'll be a flag day.
The reason I recommend this is that I think you're going to have serious
trouble getting clean authorship data with good coverage. The data
mining will be messy and take longer than you expect.
Here's how I'd do it:
1. Write an analyzer for commit logs. Its goal should be to parse
logs and produce a list of records each consiting of an author, a
commit date, and a list of modified-file paths - one record
per commit-log entry.
2. Run this once on each terminal commit log - that is, at each branch
head on both the main Commit log and all its archival versions.
Aggregate all the records, dropping duplicates.
3. Write a custom Python extension to reposurgeon that generates the
same report, only this time per-commit and thus yielding a committer ID.
3. Set a recognition time window. It must be more than 24 hours or you're
going to have spurious negatives due to time-zone skew.
4. Write a program that fuzzy-matches the commit-log file-modification
cliques to the per-commit cliques. One aspect of "fuzzy" is the
time window; you need to include as potential matches any commits back
from the date of the commit-log entry *and those up to 24 hours forward*
(time-zone skew again). Also, you can't only look at the most recent
matching commit if it's within the 24-hour window - time zone skew might
mean that another one that looks older also matches, and might actually
be more recent.
5. Try the naive implementation using a 24-hour time window. Now look
at the percentages of unmatched commits and commit-log entries. If
it's too high, how does it vary as the time window rises?
Alas, there are other dimensions of 'fuzzy'. Here are a couple:
1. Typos or omissions in the commit-log file cliques and/or author
names. To get good coverage you might find you need to do
something like a Ratcliff-Obershelp fuzzy match. Set a high
similarity percentage, then back off it if you have lots of
2. What if someone did two or more commits on different filesets, but
described them in one commit-log entry? Ideally you'd like to propagate
the commit-log author info correctly to both, but testing for this case
mechanically would be combinatorially explosive. Your only hope is that
you end up with few enough unmatched commits and commit-log entries
that the problem can be solved manually.
Maybe you'll get lucky and the residuals (the sets of commits and commit-log
entries that don't have a match in the other set) will be tiny. I wouldn't
count on it - I'd expect that you will trip over other noise sources and
have to figure out ways to fuzzy-match around them.
Once you have the residuals down to an acceptably low number, make your
matcher grind out a set of reposurgeon commands that patches the attributions
appropriately. Apply. By careful to add a predicate check that prevents
each transformation from applying if the date matches more than one commit;
those two will have to be treated as residuals and hand-patched.
<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>