This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Proposal for the transition timetable for the move to GIT


> On Dec 30, 2019, at 7:08 PM, Richard Earnshaw (lists) <Richard.Earnshaw@arm.com> wrote:
> 
> On 30/12/2019 15:49, Maxim Kuvyrkov wrote:
>>> On Dec 30, 2019, at 6:31 PM, Richard Earnshaw (lists) <Richard.Earnshaw@arm.com> wrote:
>>> 
>>> On 30/12/2019 13:00, Maxim Kuvyrkov wrote:
>>>>> On Dec 30, 2019, at 1:24 AM, Richard Earnshaw (lists) <Richard.Earnshaw@arm.com> wrote:
>>>>> 
>>>>> On 29/12/2019 18:30, Maxim Kuvyrkov wrote:
>>>>>> Below are several more issues I found in reposurgeon-6a conversion comparing it against gcc-reparent conversion.
>>>>>> 
>>>>>> I am sure, these and whatever other problems I may find in the reposurgeon conversion can be fixed in time.  However, I don't see why should bother.  My conversion has been available since summer 2019, I made it ready in time for GCC Cauldron 2019, and it didn't change in any significant way since then.
>>>>>> 
>>>>>> With the "Missed merges" problem (see below) I don't see how reposurgeon conversion can be considered "ready".  Also, I expected a diligent developer to compare new conversion (aka reposurgeon's) against existing conversion (aka gcc-pretty / gcc-reparent) before declaring the new conversion "better" or even "ready".  The data I'm seeing in differences between my and reposurgeon conversions shows that gcc-reparent conversion is /better/.
>>>>>> 
>>>>>> I suggest that GCC community adopts either gcc-pretty or gcc-reparent conversion.  I welcome Richard E. to modify his summary scripts to work with svn-git scripts, which should be straightforward, and I'm ready to help.
>>>>>> 
>>>>> 
>>>>> I don't think either of these conversions are any more ready to use than
>>>>> the reposurgeon one, possibly less so.  In fact, there are still some
>>>>> major issues to resolve first before they can be considered.
>>>>> 
>>>>> gcc-pretty has completely wrong parent information for the gcc-3 era
>>>>> release tags, showing the tags as being made directly from trunk with
>>>>> massive deltas representing the roll-up of all the commits that were
>>>>> made on the gcc-3 release branch.
>>>> 
>>>> I will clarify the above statement, and please correct me where you think I'm wrong.  Gcc-pretty conversion has the exact right parent information for the gcc-3 era
>>>> release tags as recorded in SVN version history.  Gcc-pretty conversion aims to produce an exact copy of SVN history in git.  IMO, it manages to do so just fine.
>>>> 
>>>> It is a different thing that SVN history has a screwed up record of gcc-3 era tags.
>>> 
>>> It's not screwed up in svn.  Svn shows the correct history information for the gcc-3 era release tags, but the git-svn conversion in gcc-pretty does not.
>>> 
>>> For example, looking at gcc_3_0_release in expr.c with git blame and svn blame shows
>> 
>> In SVN history tags/gcc_3_0_release has been copied off /trunk:39596 and in the same commit bunch of files were replaced from /branches/gcc-3_0-branch/ (and from different revisions of this branch!).
>> 
>> $ svn log -qv --stop-on-copy file://$(pwd)/tags/gcc_3_0_release | grep "/tags/gcc_3_0_release \|/tags/gcc_3_0_release/gcc/expr.c \|/tags/gcc_3_0_release/gcc/reload.c "
>>   A /tags/gcc_3_0_release (from /trunk:39596)
>>   R /tags/gcc_3_0_release/gcc/expr.c (from /branches/gcc-3_0-branch/gcc/expr.c:43255)
>>   R /tags/gcc_3_0_release/gcc/reload.c (from /branches/gcc-3_0-branch/gcc/reload.c:42007)
>> 
> 
> Right, (and wrong).  You have to understand how the release branches and
> tags are represented in CVS to understand why the SVN conversion is done
> this way.  When a branch was created in CVS a tag was added to each
> commit which would then be used in any future revisions along that
> branch.  But until a commit is made on that branch, the release branch
> is just a placeholder.
> 
> When a CVS release tag is created, the tag labels the relevant commit
> that is to be used.  If that commit is unchanged from the trunk revision
> (no commit on the branch), then that is what gets labelled, and it
> *appears* to still come from trunk - but that does not matter, since it
> is the same as the version on trunk.
> 
> The svn copy operations are formed from this set of information by
> copying the SVN revision of trunk that applied at the point the branch
> was made, and then overriding the copy information for each file that
> was then modified on the branch with information about that copy.  This
> is sufficient for svn to fully understand the history information for
> each and every file in the tag.
> 
> Unfortunately, git-svn mis-interprets this when building its graph of
> what happened and while it copies the right *content* into the release
> branch, it does not copy the right *history*.  The SVN R operation
> copies the history from named revision, not just the content.  That's
> the significant difference between the two.
> 
> R
>> IMO, from such history (absent external knowledge about better reparenting options) the best choice for parent branch is /trunk@39596, not /branches/gcc-3_0-branch at a random revision from the replaced files.
>> 
>> Still, I see your point, and I will fix reparenting support.  Whether GCC community opts to reparent or not reparent is a different topic.

I've added proper reparenting support to svn-git scripts, and gcc-reparent will be updated in a day or so.  I've also added a few minor improvements and fixed things that Joseph pointed out in my conversion.

Once gcc-reparent conversion is regenerated, I'll do another round of comparisons between it and whatever the latest reposurgeon version is.

--
Maxim Kuvyrkov
https://www.linaro.org

>> --
>> Maxim Kuvyrkov
>> https://www.linaro.org
>> 
>> 
>>> git blame expr.c:
>>> 
>>> ba0a9cb85431 (Richard Kenner         1992-03-03 23:34:57 +0000   396)         return temp;
>>> ba0a9cb85431 (Richard Kenner         1992-03-03 23:34:57 +0000   397)       }
>>> 5fbf0b0d5828 (no-author              2001-06-17 19:44:25 +0000   398)     /* Copy the address into a pseudo, so that the returned value
>>> 5fbf0b0d5828 (no-author              2001-06-17 19:44:25 +0000   399)        remains correct across calls to emit_queue.  */
>>> 5fbf0b0d5828 (no-author              2001-06-17 19:44:25 +0000   400)     XEXP (new, 0) = copy_to_reg (XEXP (new, 0));
>>> 59f26b7caad9 (Richard Kenner         1994-01-11 00:23:47 +0000   401)     return new;
>>> 
>>> git log 5fbf0b0d5828
>>> commit 5fbf0b0d5828687914c1c18a83ff12c8627d5a70 (HEAD, tag: gcc_3_0_release)
>>> Author: no-author <no-author@gcc.gnu.org>
>>> Date:   Sun Jun 17 19:44:25 2001 +0000
>>> 
>>>   This commit was manufactured by cvs2svn to create tag
>>>   'gcc_3_0_release'.
>>> 
>>> while svn blame expr.c correctly shows:
>>> 
>>>  386     kenner             return temp;
>>>  386     kenner           }
>>> 42209     bernds         /* Copy the address into a pseudo, so that the returned value
>>> 42209     bernds            remains correct across calls to emit_queue.  */
>>> 42209     bernds         XEXP (new, 0) = copy_to_reg (XEXP (new, 0));
>>> 6375     kenner         return new;
>>> 
>>> svn log -r42209 ^/
>>> ------------------------------------------------------------------------
>>> r42209 | bernds | 2001-05-17 18:07:08 +0100 (Thu, 17 May 2001) | 2 lines
>>> 
>>> Fix queueing-related bugs
>>> 
>>> In other words, svn can correctly track the files that were modified on the release branch, while the git conversion looses that information, rolling up all the diffs on the release branch into a single unattributed commit.
>>> 
>>> As I said, gcc-reparent is better in this regard, but there are still artefacts from conversion, such as incorrect merge records, that show up.
>>> 
>>> R.
>>> 
>>>> 
>>>>> 
>>>>> gcc-reparent is better, but many (most?) of the release tags are shown
>>>>> as merge commits with a fake parent back to the gcc-3 branch point,
>>>>> which is certainly not what happened when the tagging was done at that
>>>>> time.
>>>> 
>>>> I agree with you here.
>>>> 
>>>>> 
>>>>> Both of these factually misrepresent the history at the time of the
>>>>> release tag being made.
>>>> 
>>>> Yes and no.  Gcc-pretty repository mirrors SVN history.  And regarding the need for reparenting -- we lived with current history for gcc-3 release tags for a long time.  I would argue their continued brokenness is not a show-stopper.
>>>> 
>>>> Looking at this from a different perspective, when I posted the initial svn-git scripts back in Summer, the community roughly agreed on a plan to
>>>> 1. Convert entire SVN history to git.
>>>> 2. Use the stock git history rewrite tools (git filter-branch) to fixup what we want, e.g., reparent tags and branches or set better author/committer entries.
>>>> 
>>>> Gcc-pretty does (1) in entirety.
>>>> 
>>>> For reparenting, I tried a 15min fix to my scripts to enable reparenting, which worked, but with artifacts like the merge commit from old and new parents.  I will drop this and instead use tried-and-true "git filter-branch" to reparent those tags and branches, thus producing gcc-reparent from gcc-pretty.
>>>> 
>>>>> 
>>>>> As for converting my script to work with your tools, I'm afraid I don't
>>>>> have time to work on that right now.  I'm still bogged down validating
>>>>> the incorrect bug ids that the script has identified for some commits.
>>>>> I'm making good progress (we're down to 160 unreviewed commits now), but
>>>>> it is still going to take what time I have over the next week to
>>>>> complete that task.
>>>>> 
>>>>> Furthermore, there is no documentation on how your conversion scripts
>>>>> work, so it is not possible for me to test any work I might do in order
>>>>> to validate such changes.  Not being able to run the script locally to
>>>>> test change would be a non-starter.
>>>>> 
>>>>> You are welcome, of course, to clone the script I have and attempt to
>>>>> modify it yourself, it's reasonably well documented.  The sources can be
>>>>> found in esr's gcc-conversion repository here:
>>>>> https://gitlab.com/esr/gcc-conversion.git
>>>> 
>>>> --
>>>> Maxim Kuvyrkov
>>>> https://www.linaro.org
>>>> 
>>>>> 
>>>>> 
>>>>>> Meanwhile, I'm going to add additional root commits to my gcc-reparent conversion to bring in "missing" branches (the ones, which don't share history with trunk@1) and restart daily updates of gcc-reparent conversion.
>>>>>> 
>>>>>> Finally, with the comparison data I have, I consider statements about git-svn's poor quality to be very misleading.  Git-svn may have had serious bugs years ago when Eric R. evaluated it and started his work on reposurgeon.  But a lot of development has happened and many problems have been fixed since them.  At the moment it is reposurgeon that is producing conversions with obscure mistakes in repository metadata.
>>>>>> 
>>>>>> 
>>>>>> === Missed merges ===
>>>>>> 
>>>>>> Reposurgeon misses merges from trunk on 130+ branches.  I've spot-checked ARM/hard_vfp_branch and redhat/gcc-9-branch and, indeed, rather mundane merges were omitted.  Below is analysis for ARM/hard_vfp_branch.
>>>>>> 
>>>>>> $ git log --stat refs/remotes/gcc-reposurgeon-6a/ARM/hard_vfp_branch~4
>>>>>> ----
>>>>>> commit ef92c24b042965dfef982349cd5994a2e0ff5fde
>>>>>> Author: Richard Earnshaw <rearnsha@gcc.gnu.org>
>>>>>> Date:   Mon Jul 20 08:15:51 2009 +0000
>>>>>> 
>>>>>>  Merge trunk through to r149768
>>>>>> 
>>>>>>  Legacy-ID: 149804
>>>>>> 
>>>>>> COPYING.RUNTIME                                     |    73 +
>>>>>> ChangeLog                                           |   270 +-
>>>>>> MAINTAINERS                                         |    19 +-
>>>>>> <MANY OTHER FILES>
>>>>>> ----
>>>>>> 
>>>>>> at the same time for svn-git scripts we have:
>>>>>> 
>>>>>> $ git log --stat refs/remotes/gcc-reparent/ARM/hard_vfp_branch~4
>>>>>> ----
>>>>>> commit ce7d5c8df673a7a561c29f095869f20567a7c598
>>>>>> Merge: 4970119c20da 3a69b1e566a7
>>>>>> Author: Richard Earnshaw <rearnsha@arm.com>
>>>>>> Date:   Mon Jul 20 08:15:51 2009 +0000
>>>>>> 
>>>>>>  Merge trunk through to r149768
>>>>>> 
>>>>>>  git-svn-id: https://gcc.gnu.org/svn/gcc/branches/ARM/hard_vfp_branch@149804 138bc75d-0d04-0410-961f-82ee72b054a4
>>>>>> ----
>>>>>> 
>>>>>> ... which agrees with
>>>>>> $ svn propget svn:mergeinfo file:///home/maxim.kuvyrkov/tmpfs-stuff/svnrepo/branches/ARM/hard_vfp_branch@149804
>>>>>> /trunk:142588-149768
>>>>>> 
>>>>>> === Bad author entries ===
>>>>>> 
>>>>>> Reposurgeon-6a conversion has authors "12:46:56 1998 Jim Wilson" and "2005-03-18 Kazu Hirata".  It is rather obvious that person's name is unlikely to start with a digit.
>>>>>> 
>>>>>> === Missed authors ===
>>>>>> 
>>>>>> Reposurgeon-6a conversion misses many authors, below is a list of people with names starting with "A".
>>>>>> 
>>>>>> Akos Kiss
>>>>>> Anders Bertelrud
>>>>>> Andrew Pochinsky
>>>>>> Anton Hartl
>>>>>> Arthur Norman
>>>>>> Aymeric Vincent
>>>>>> 
>>>>>> === Conservative author entries ===
>>>>>> 
>>>>>> Reposurgeon-6a conversion uses default "@gcc.gnu.org" emails for many commits where svn-git conversion manages to extract valid email from commit data.  This happens for hundreds of author entries.
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> --
>>>>>> Maxim Kuvyrkov
>>>>>> https://www.linaro.org
>>>>>> 
>>>>>> 
>>>>>>> On Dec 26, 2019, at 7:11 PM, Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Dec 26, 2019, at 2:16 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>>>>>>>> 
>>>>>>>> On Thu, Dec 26, 2019 at 11:04:29AM +0000, Joseph Myers wrote:
>>>>>>>> Is there some easy way (e.g. file in the conversion scripts) to correct
>>>>>>>> spelling and other mistakes in the commit authors?
>>>>>>>> E.g. there are misspelled surnames, etc. (e.g. looking at my name, I see
>>>>>>>> Jakub Jakub Jelinek (1):
>>>>>>>> Jakub Jeilnek (1):
>>>>>>>> Jelinek (1):
>>>>>>>> entries next to the expected one with most of the commits.
>>>>>>>> For the misspellings, wonder if e.g. we couldn't compute edit distances from
>>>>>>>> other names and if we have one with many commits and then one with very few
>>>>>>>> with small edit distance from those, flag it for human review.
>>>>>>> 
>>>>>>> This is close to what svn-git-author.sh script is doing in gcc-pretty and gcc-reparent conversions.  It ignores 1-3 character differences in author/committer names and email addresses.  I've audited results for all branches and didn't spot any mistakes.
>>>>>>> 
>>>>>>> In other news, I'm working on comparison of gcc-pretty, gcc-reparent and gcc-reposurgeon-5a repos among themselves.  Below are current notes for comparison of gcc-pretty/trunk and gcc-reposurgeon-5a/trunk.
>>>>>>> 
>>>>>>> == Merges on trunk ==
>>>>>>> 
>>>>>>> Reposurgeon creates merge entries on trunk when changes from a branch are merged into trunk.  This brings entire development history from the branch to trunk, which is both good and bad.  The good part is that we get more visibility into how the code evolved.  The bad part is that we get many "noisy" commits from merged branch (e.g., "Merge in trunk" every few revisions) and that our SVN branches are work-in-progress quality, not ready for review/commit quality.  It's common for files to be re-written in large chunks on branches.
>>>>>>> 
>>>>>>> Also, reposurgeon's commit logs don't have information on SVN path from which the change came, so there is no easy way to determine that a given commit is from a merged branch, not an original trunk commit.  Git-svn, on the other hand, provides "git-svn-id: <path>@<revision>" tags in its commit logs.
>>>>>>> 
>>>>>>> My conversion follows current GCC development policy that trunk history should be linear.  Branch merges to trunk are squashed.  Merges between non-trunk branches are handled as specified by svn:mergeinfo SVN properties.
>>>>>>> 
>>>>>>> == Differences in trees ==
>>>>>>> 
>>>>>>> Git trees (aka filesystem content) match between pretty/trunk and reposurgeon-5a/trunk from current tip and up tosvn's r130805.
>>>>>>> Here is SVN log of that revision (restoration of deleted trunk):
>>>>>>> ------------------------------------------------------------------------
>>>>>>> r130805 | dberlin | 2007-12-13 01:53:37 +0000 (Thu, 13 Dec 2007)
>>>>>>> Changed paths:
>>>>>>> A /trunk (from /trunk:130802)
>>>>>>> ------------------------------------------------------------------------
>>>>>>> 
>>>>>>> Reposurgeon conversion has:
>>>>>>> -------------
>>>>>>> commit 7e6f2a96e89d96c2418482788f94155d87791f0a
>>>>>>> Author: Daniel Berlin <dberlin@gcc.gnu.org>
>>>>>>> Date:   Thu Dec 13 01:53:37 2007 +0000
>>>>>>> 
>>>>>>> Readd trunk
>>>>>>> 
>>>>>>> Legacy-ID: 130805
>>>>>>> 
>>>>>>> .gitignore | 17 -----------------
>>>>>>> 1 file changed, 17 deletions(-)
>>>>>>> -------------
>>>>>>> and my conversion has:
>>>>>>> -------------
>>>>>>> commit fb128f3970789ce094c798945b4fa20eceb84cc7
>>>>>>> Author: Daniel Berlin <dberlin@dbrelin.org>
>>>>>>> Date:   Thu Dec 13 01:53:37 2007 +0000
>>>>>>> 
>>>>>>> Readd trunk
>>>>>>> 
>>>>>>> 
>>>>>>> git-svn-id: https://gcc.gnu.org/svn/gcc/trunk@130805 138bc75d-0d04-0410-961f-82ee72b054a4
>>>>>>> -------------
>>>>>>> 
>>>>>>> It appears that .gitignore has been added in r1 by reposurgeon and then deleted at r130805.  In SVN repository .gitignore was added in r195087.  I speculate that addition of .gitignore at r1 is expected, but it's deletion at r130805 is highly suspicious.
>>>>>>> 
>>>>>>> == Committer entries ==
>>>>>>> 
>>>>>>> Reposurgeon uses $user@gcc.gnu.org for committer email addresses even when it correctly detects author name from ChangeLog.
>>>>>>> 
>>>>>>> reposurgeon-5a:
>>>>>>> r278995 Martin Liska <mliska@suse.cz> Martin Liska <marxin@gcc.gnu.org>
>>>>>>> r278994 Jozef Lawrynowicz <jozef.l@mittosystems.com> Jozef Lawrynowicz <jozefl@gcc.gnu.org>
>>>>>>> r278993 Frederik Harwath <frederik@codesourcery.com> Frederik Harwath <frederik@gcc.gnu.org>
>>>>>>> r278992 Georg-Johann Lay <avr@gjlay.de> Georg-Johann Lay <gjl@gcc.gnu.org>
>>>>>>> r278991 Richard Biener <rguenther@suse.de> Richard Biener <rguenth@gcc.gnu.org>
>>>>>>> 
>>>>>>> pretty:
>>>>>>> r278995 Martin Liska <mliska@suse.cz> Martin Liska <mliska@suse.cz>
>>>>>>> r278994 Jozef Lawrynowicz <jozef.l@mittosystems.com> Jozef Lawrynowicz <jozef.l@mittosystems.com>
>>>>>>> r278993 Frederik Harwath <frederik@codesourcery.com> Frederik Harwath <frederik@codesourcery.com>
>>>>>>> r278992 Georg-Johann Lay <avr@gjlay.de> Georg-Johann Lay <avr@gjlay.de>
>>>>>>> r278991 Richard Biener <rguenther@suse.de> Richard Biener <rguenther@suse.de>
>>>>>>> 
>>>>>>> == Bad summary line ==
>>>>>>> 
>>>>>>> While looking around r138087, below caught my eye.  Is the contents of summary line as expected?
>>>>>>> 
>>>>>>> commit cc2726884d56995c514d8171cc4a03657851657e
>>>>>>> Author: Chris Fairles <chris.fairles@gmail.com>
>>>>>>> Date:   Wed Jul 23 14:49:00 2008 +0000
>>>>>>> 
>>>>>>> acinclude.m4 ([GLIBCXX_CHECK_CLOCK_GETTIME]): Define GLIBCXX_LIBS.
>>>>>>> 
>>>>>>> 2008-07-23  Chris Fairles <chris.fairles@gmail.com>
>>>>>>> 
>>>>>>>         * acinclude.m4 ([GLIBCXX_CHECK_CLOCK_GETTIME]): Define GLIBCXX_LIBS.
>>>>>>>         Holds the lib that defines clock_gettime (-lrt or -lposix4).
>>>>>>>         * src/Makefile.am: Use it.
>>>>>>>         * configure: Regenerate.
>>>>>>>         * configure.in: Likewise.
>>>>>>>         * Makefile.in: Likewise.
>>>>>>>         * src/Makefile.in: Likewise.
>>>>>>>         * libsup++/Makefile.in: Likewise.
>>>>>>>         * po/Makefile.in: Likewise.
>>>>>>>         * doc/Makefile.in: Likewise.
>>>>>>> 
>>>>>>> Legacy-ID: 138087
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Maxim Kuvyrkov
>>>>>>> https://www.linaro.org
>> 
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]