This is the mail archive of the
mailing list for the GCC project.
Re: Acceptance criteria for the git conversion
- From: David Malcolm <dmalcolm at redhat dot com>
- To: esr at thyrsus dot com
- Cc: Joseph Myers <joseph at codesourcery dot com>, gcc at gcc dot gnu dot org
- Date: Tue, 01 Sep 2015 20:32:38 -0400
- Subject: Re: Acceptance criteria for the git conversion
- Authentication-results: sourceware.org; auth=none
- References: <20150901105414 dot GA30270 at thyrsus dot com> <alpine dot DEB dot 2 dot 10 dot 1509011245480 dot 11400 at digraph dot polyomino dot org dot uk> <20150901153036 dot GA1223 at thyrsus dot com>
On Tue, 2015-09-01 at 11:30 -0400, Eric S. Raymond wrote:
> Joseph Myers <firstname.lastname@example.org>:
> > With 227369 revisions I don't think adding git-style summary lines is
> > really practical without some very reliable automation to match commits to
> > corresponding gcc-patches messages (whose Subject: headers would be the
> > natural choice for such summary lines)....
> In this case you may be right. Select =L tells me there are 101139
> commits wanting that sort of adjustment, which I think is at least
> 2.5x the bulk I've ever had to deal with before.
> Still, if anyone else is brave enough to write a script that will munch
> through gcc-patches producing committer/date/subject-line triples, I'll
> give it a try.
I don't think committer/date/subject-line triples are adequate: the
dates are unlikely to match up, for one thing.
I think such a solution would need to somehow locate and match patches
I was feeling brave, so I had a go at writing a scraper; see:
for what I have so far (tested with Python 2.7).
This can scrape the gcc-patches archives and locate mails containing
patches, extracting the patches (some of them anyway...). The idea
would be to stuff the patches into some kind of big data store, and
somehow them try to locate them (perhaps within a rough date "window").
Does this seem like a viable approach?
Caution: this script performs numerous URL GETs on gcc.gnu.org;
it caches everything, but the first time you run it, the cache
will be cold. (So please be careful!)
> About scale: The largest repository I've dealt with before this was
> NetBSD, with a working set of 18GB, vs 45GB for this one. The way reposurgeon's
> internal representations work, working set is dominated by comment text. So
> the GCC repo has about 2.5x the comment bulk of NetBSD.