Bug 81114 - GNAT mishandles filenames with UTF8 chars on case-insensitive filesystems
Summary: GNAT mishandles filenames with UTF8 chars on case-insensitive filesystems
Status: SUSPENDED
Alias: None
Product: gcc
Classification: Unclassified
Component: ada (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-16 16:14 UTC by simon
Modified: 2017-07-09 10:39 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build: x86_64-apple-darwin16
Known to work:
Known to fail:
Last reconfirmed: 2017-07-04 00:00:00


Attachments
Demonstrator (with BOM) (255 bytes, application/octet-stream)
2017-06-16 16:14 UTC, simon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description simon 2017-06-16 16:14:16 UTC
Created attachment 41575 [details]
Demonstrator (with BOM)

The attached demonstrator contains two files, each with a UTF8
BOM. One file, pack3_user.adb, contains

   with Páck3;
   procedure Pack3_User is
   begin
      null;
   end Pack3_User;

while the other, páck3.ads, contains just

   package Páck3 is
   end Páck3;

There is no problem compiling on Linux (Debian Jessie). However, on
Darwin and Windows, we get

   $ gnatmake -c -f pack3_user.adb
   gcc -c pack3_user.adb
   gnatmake: "p?ck3.ads" not found

This is perhaps partly explained by looking at pack3_user.ali:

====================
V "GNAT Lib v8"
M P W=8
P ZX

RN

U pack3_user%b		pack3_user.adb		be67fdbd NE OO SU
W pUe1ck3%s		p?ck3.ads		p?ck3.ali           [A]

D p?ck3.ads		20170615165452 7221d8b1 páck3%s             [B]
D pack3_user.adb	20170616143450 cc46250c pack3_user%b
D system.ads		20161018202953 085b6ffb system%s
X 1 páck3.ads                                                       [C]
[...]
====================

from which ([A], [B]) it is clear that GNAT is sometimes confused
about the file names.

Interestingly, sometimes it gets it right (last component on [B],
[C]).

The ALI file is written by Lib.Writ.Write_ALI. In two places it says

   if not File_Names_Case_Sensitive then
      Get_Name_String (Fname);
      To_Lower (Name_Buffer (1 .. Name_Len));    <<<<<<<<<
      Fname := Name_Find;
   end if;

which is clearly the Wrong Thing to do if the file name is not
ASCII. In the ALI file above, the small-a-acute, which should be
encoded as C3 A1, has been rendered as E3 A1.

Using the undocumented env var GNAT_FILE_NAME_CASE_SENSITIVE alters
things:

   $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c -f pack3_user.adb
   gcc -c pack3_user.adb
   gcc -c páck3.ads

so it's clear that the problem lies in this region.

Interestingly, [B] and [C] above show that the compiler does
understand how to low-case extended characters in strings. I haven't
yet been able to find where this is done.
Comment 1 simon 2017-07-04 13:49:56 UTC
Further:

$ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
gcc -c páck3.ads
páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"

The reason for this apparently-bizarre message is[1] that macOS takes 
the composed form (lowercase a acute) and converts it under the hood 
to what HFS+ insists on, the fully decomposed form (lowercase a, combining 
acute); thus the names are actually different even though they _look_ 
the same.

I have to say that, great as it would be to have this fixed, the changes 
required would be extensive, and I can’t see that anyone would think it 
worth the trouble.

The recommendation would be "don’t use international characters in the 
names of library units".

[1] https://stackoverflow.com/a/6153713/40851
Comment 2 Eric Botcazou 2017-07-04 19:46:01 UTC
Right.  And people should use sane filesystems (and sane OSes to begin with).
Comment 3 simon 2017-07-04 19:53:46 UTC
Just for interest, this not-very-good code will successfully convert
the uppercase-a-acute input c381 to uppercase-a/combining-acute 41cc81:

#include <stdio.h>
#include <iconv.h>
#include <stdint.h>
#include <memory.h>

int main(void)
{
  uint8_t codepoint[] = {0xc3, 0x81, 0};
  char *input = (char *) &codepoint;
  size_t in_size = 2;

  char output_buffer[10];
  memset(output_buffer, 0, sizeof(output_buffer));
  char *output = output_buffer;
  size_t out_size = 10;

  iconv_t cd = iconv_open("utf8-mac", "UTF-8");

  iconv(cd, &input, &in_size, &output, &out_size);

  printf("in %d out %d result \"%s\"\n", in_size, out_size, output_buffer);

  return 0;
}

but of course only on macOS - https://stackoverflow.com/a/23159081/40851
Comment 4 simon 2017-07-09 10:39:22 UTC
(In reply to Eric Botcazou from comment #2)

When I said in comment 1 

>I have to say that, great as it would be to have this fixed, the changes 
>required would be extensive, and I can’t see that anyone would think it 
>worth the trouble.

I meant that coping with macOS’s HFS+ behaviour w.r.t. NFC vs NFD was 
something it’d be unreasonable to spend effort on fixing.

The main point of this PR is that you can’t use extended characters in 
unit names on case-insensitive filesystems, *which includes Windows*. 
Fixing that problem (I can see it might mean introducing a new adaint.c 
interface "is filesystem UTF8?") would be a good thing. Can the compiler 
use iconv? or Ada.Wide_Characters.Handling, Ada.Strings.UTF_Encoding.[Wide_]Strings?

The awkwardness discussed in comment 1 isn’t really a problem except 
when compiling the offending unit from the command line; when compiled 
as part of the closure by gnatmake there’s no problem, I guess gnatmake 
reads the unit name in NFC and gets the file name in NFD from the file 
system.

I think there _is_ a problem in gprbuild but of course that’s nothing 
to do with GCC.

Please can this PR be reopened?