This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Proposal for the transition timetable for the move to GIT


Alexandre Oliva <oliva@gnu.org>:
> I know very little about reposurgeon, but I'm concerned that, should we
> make the conversion with it, and later identify e.g. missed branches, we
> might be unable to make such an incremental recovery.  Can anyone
> alleviate my concerns and let me know we could indeed make such an
> incremental recovery of a branch missed in the initial conversion, in
> such a way that its commit history would be shared with that of the
> already-converted branch it branched from?

Reposurgeon has a reparent command.  If you have determined that a
branch is detached or has an incorrect attachment point, patching the
metadata of the root node to fix that is very easy.

> Now, would it be too much of a burden to insist that the commit graphs
> out of both conversions be isomorphic, and maybe mappings between the
> commit ids (if they can't be made identical to begin with, that is) be
> generated and shared, so that the results of both conversions can be
> efficiently and mechanically compared (disregarding expected
> differences) not only in terms of branch and tag names and commit
> graphs, but also tree contents, commit messages and any other metadata?
> Has anything like this been done yet?

On the GCC repository, no. 

There are very serious practical problems with full verification of
git against SVN stemming mainly from the fact that Subversion checkout
on a respository of this size is extremely slow. IIRC Joseph at one
point estimated a check time on the order of months due to that
overhead alone.

If you're talking about a commit-by-commit comparison between two
conversions that assumes one or te other is correct, that is
theoretically possible and - because git retrieval is much faster -
could theoretically be done in a reasonable amount of time.  But there
is a lot of devil in the practical details.

The reposurgeon suite once included a tool for such comparisons.
Last year this happened:

commit b8a609925ba70a6b68f9eda1d748eb667ad2fa59
Author: Eric S. Raymond <esr@thyrsus.com>
Date:   Fri Aug 24 12:40:46 2018 -0400

    Retire repodiffer.  Its only use case was checks against git-svn...
    
    ...which we now know to make such bad conversions that on larger than trivial
    repos the differ would be prohibitively noisy.

Maxim's scripts probably make a better conversion than bare git-svn,
because he uses git-svn only for linear basic blocks and thereby
avoids its worst failure modes. In theory I could dust off repodiffer
and apply it.

That's in theory. In practice, on a repository this size I am not
greatly optimistic about getting a result that could be interpreted by
a Mark I brain.  The reasons go beyond git-svn's brain damage to the
same ontological-mismatch problems that make SVN-to-git conversion a 
headache in general.

You might think at least there'd be a 1:1 correspondence between
commits in the two conversions, but that's not going to be true for a
couple of different reasons.

1. Split commits. Reposurgeon decomposes these into pieces one per 
git branch.  I don't know what Maxim's scripts do.  I think Joseph turned
up that there are over a thousand of these in the GCC history.

2. There are three classes of commits in Subversion that don't really fit 
the git data model, (1) directory creation/deletion commits, (2) directory
copy commits, (3) property changes with no associated blob.

For each of these exceptional commits a converter to Git has a choice
of dropping the commit, turning it into some sort of annotated tag, or
leaving it in place as a zero-op commit (anomalous but not forbidden
in the git model). It is pretty much guaranteed that different
converters will make different choices about these, which will make
for huge amounts of noise in your attempt at a diff.

Checking for DAG isomorphism: again, theoretically possible,
practically pretty daunting.  It could be worse - general graph
isomorphism is not even known to be polynomial-time - but in this case
we can label corresponding commits with matching legacy IDs, which
should make possible an isomorphism check in linear time with a trivial
algorithm.

Well, except for split commits. That one would be solvable, albeit
painful.

The real problem here would be mergeinfo links.  It's not even obvious
what "correct" mapping of mergeinfo links is, in general, due to the
mismatch between Subversion's cherry-pick-based merge model and git's
branch merging.  Again, different converters will make different
choices. Reconciling them would be not fun.

There is another world of hurt lurking in "(disregarding expected
differences)".  How do you know what differences to expect? How are
you going to specify them?  What will interpret that spec?

There is more months of work here - nasty, wearing toil, with no
guarantee of a result with a decent signal-to-noise ratio.  Even
though I'm quite literally the best-qualified person on earth to do
it, I flinch at the thought.
-- 
		<a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]