This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Git conversion: fixing email addresses from ChangeLog files
- From: Joseph Myers <jsm at polyomino dot org dot uk>
- To: "Richard Earnshaw (lists)" <Richard dot Earnshaw at arm dot com>
- Cc: Jakub Jelinek <jakub at redhat dot com>, gcc at gcc dot gnu dot org
- Date: Sat, 28 Dec 2019 17:14:53 +0000 (UTC)
- Subject: Re: Git conversion: fixing email addresses from ChangeLog files
- References: <c23ff406-c3af-52c8-5c1e-4c921790389f@arm.com> <20191228120427.GQ10088@tucnak> <8aea9992-b24e-8e31-d515-a55fb45639e0@arm.com>
On Sat, 28 Dec 2019, Richard Earnshaw (lists) wrote:
> My suggestion would be that we try to canonicalize all the author
> entries to UTF-8 as that avoids the limitations of ISO-8859-1, but that
> would probably need further fixups to detect the additional names that
> need rewriting.
What I've implemented in bugdb.py already includes converting ISO-8859-1
to UTF-8 (in any case where the author name is not valid UTF-8 - a general
property of text encodings is that if something is valid UTF-8, it almost
certainly is already encoded in ASCII or UTF-8 already), with special
handling of NBSP and with fixups for all the cases where the results of
converting ISO-8859-1 to UTF-8 looked wrong (i.e. where it looked like the
name in the original ChangeLog was not in fact UTF-8).
I've also now made bugdb.py check the list of fixups both before and after
recoding (which may help in some cases where e.g. a fixup is putting a
name in canonical form, meaning such a fixup doesn't need to be given in
forms with both UTF-8 and ISO-8859-1 encodings even if the name appears
with both those encodings in the history).
Because the author extraction is based on the ChangeLog entry included in
the original commit, any subsequent commits that (wrongly or correctly)
recoded ChangeLog entries are not relevant.
--
Joseph S. Myers
jsm@polyomino.org.uk