This is the mail archive of the
mailing list for the GCC project.
Re: Git conversion: fixing email addresses from ChangeLog files
On 28/12/2019 17:14, Joseph Myers wrote:
> On Sat, 28 Dec 2019, Richard Earnshaw (lists) wrote:
>> My suggestion would be that we try to canonicalize all the author
>> entries to UTF-8 as that avoids the limitations of ISO-8859-1, but that
>> would probably need further fixups to detect the additional names that
>> need rewriting.
> What I've implemented in bugdb.py already includes converting ISO-8859-1
> to UTF-8 (in any case where the author name is not valid UTF-8 - a general
> property of text encodings is that if something is valid UTF-8, it almost
> certainly is already encoded in ASCII or UTF-8 already), with special
> handling of NBSP and with fixups for all the cases where the results of
> converting ISO-8859-1 to UTF-8 looked wrong (i.e. where it looked like the
> name in the original ChangeLog was not in fact UTF-8).
> I've also now made bugdb.py check the list of fixups both before and after
> recoding (which may help in some cases where e.g. a fixup is putting a
> name in canonical form, meaning such a fixup doesn't need to be given in
> forms with both UTF-8 and ISO-8859-1 encodings even if the name appears
> with both those encodings in the history).
> Because the author extraction is based on the ChangeLog entry included in
> the original commit, any subsequent commits that (wrongly or correctly)
> recoded ChangeLog entries are not relevant.
I've added the list of emails that I posted yesterday to the conversion
scripts. I've not written anything to reprocess that yet. I want to
leave that until we've completed the general review of the preferred
changes we want. Auto-generating that data from the list will probably
be easier than maintaining it inside bugdb.py for now.