Bug 81114 - GNAT mishandles filenames with UTF8 chars on case-insensitive filesystems
Summary: GNAT mishandles filenames with UTF8 chars on case-insensitive filesystems
Status: SUSPENDED
Alias: None
Product: gcc
Classification: Unclassified
Component: ada (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-16 16:14 UTC by simon
Modified: 2023-10-21 15:58 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build: x86_64-apple-darwin16
Known to work:
Known to fail:
Last reconfirmed: 2023-10-18 00:00:00


Attachments
Demonstrator (with BOM) (255 bytes, application/octet-stream)
2017-06-16 16:14 UTC, simon
Details
C demonstrator (205 bytes, application/zip)
2023-10-18 16:09 UTC, simon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description simon 2017-06-16 16:14:16 UTC
Created attachment 41575 [details]
Demonstrator (with BOM)

The attached demonstrator contains two files, each with a UTF8
BOM. One file, pack3_user.adb, contains

   with Páck3;
   procedure Pack3_User is
   begin
      null;
   end Pack3_User;

while the other, páck3.ads, contains just

   package Páck3 is
   end Páck3;

There is no problem compiling on Linux (Debian Jessie). However, on
Darwin and Windows, we get

   $ gnatmake -c -f pack3_user.adb
   gcc -c pack3_user.adb
   gnatmake: "p?ck3.ads" not found

This is perhaps partly explained by looking at pack3_user.ali:

====================
V "GNAT Lib v8"
M P W=8
P ZX

RN

U pack3_user%b		pack3_user.adb		be67fdbd NE OO SU
W pUe1ck3%s		p?ck3.ads		p?ck3.ali           [A]

D p?ck3.ads		20170615165452 7221d8b1 páck3%s             [B]
D pack3_user.adb	20170616143450 cc46250c pack3_user%b
D system.ads		20161018202953 085b6ffb system%s
X 1 páck3.ads                                                       [C]
[...]
====================

from which ([A], [B]) it is clear that GNAT is sometimes confused
about the file names.

Interestingly, sometimes it gets it right (last component on [B],
[C]).

The ALI file is written by Lib.Writ.Write_ALI. In two places it says

   if not File_Names_Case_Sensitive then
      Get_Name_String (Fname);
      To_Lower (Name_Buffer (1 .. Name_Len));    <<<<<<<<<
      Fname := Name_Find;
   end if;

which is clearly the Wrong Thing to do if the file name is not
ASCII. In the ALI file above, the small-a-acute, which should be
encoded as C3 A1, has been rendered as E3 A1.

Using the undocumented env var GNAT_FILE_NAME_CASE_SENSITIVE alters
things:

   $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c -f pack3_user.adb
   gcc -c pack3_user.adb
   gcc -c páck3.ads

so it's clear that the problem lies in this region.

Interestingly, [B] and [C] above show that the compiler does
understand how to low-case extended characters in strings. I haven't
yet been able to find where this is done.
Comment 1 simon 2017-07-04 13:49:56 UTC
Further:

$ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
gcc -c páck3.ads
páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads"

The reason for this apparently-bizarre message is[1] that macOS takes 
the composed form (lowercase a acute) and converts it under the hood 
to what HFS+ insists on, the fully decomposed form (lowercase a, combining 
acute); thus the names are actually different even though they _look_ 
the same.

I have to say that, great as it would be to have this fixed, the changes 
required would be extensive, and I can’t see that anyone would think it 
worth the trouble.

The recommendation would be "don’t use international characters in the 
names of library units".

[1] https://stackoverflow.com/a/6153713/40851
Comment 2 Eric Botcazou 2017-07-04 19:46:01 UTC
Right.  And people should use sane filesystems (and sane OSes to begin with).
Comment 3 simon 2017-07-04 19:53:46 UTC
Just for interest, this not-very-good code will successfully convert
the uppercase-a-acute input c381 to uppercase-a/combining-acute 41cc81:

#include <stdio.h>
#include <iconv.h>
#include <stdint.h>
#include <memory.h>

int main(void)
{
  uint8_t codepoint[] = {0xc3, 0x81, 0};
  char *input = (char *) &codepoint;
  size_t in_size = 2;

  char output_buffer[10];
  memset(output_buffer, 0, sizeof(output_buffer));
  char *output = output_buffer;
  size_t out_size = 10;

  iconv_t cd = iconv_open("utf8-mac", "UTF-8");

  iconv(cd, &input, &in_size, &output, &out_size);

  printf("in %d out %d result \"%s\"\n", in_size, out_size, output_buffer);

  return 0;
}

but of course only on macOS - https://stackoverflow.com/a/23159081/40851
Comment 4 simon 2017-07-09 10:39:22 UTC
(In reply to Eric Botcazou from comment #2)

When I said in comment 1 

>I have to say that, great as it would be to have this fixed, the changes 
>required would be extensive, and I can’t see that anyone would think it 
>worth the trouble.

I meant that coping with macOS’s HFS+ behaviour w.r.t. NFC vs NFD was 
something it’d be unreasonable to spend effort on fixing.

The main point of this PR is that you can’t use extended characters in 
unit names on case-insensitive filesystems, *which includes Windows*. 
Fixing that problem (I can see it might mean introducing a new adaint.c 
interface "is filesystem UTF8?") would be a good thing. Can the compiler 
use iconv? or Ada.Wide_Characters.Handling, Ada.Strings.UTF_Encoding.[Wide_]Strings?

The awkwardness discussed in comment 1 isn’t really a problem except 
when compiling the offending unit from the command line; when compiled 
as part of the closure by gnatmake there’s no problem, I guess gnatmake 
reads the unit name in NFC and gets the file name in NFD from the file 
system.

I think there _is_ a problem in gprbuild but of course that’s nothing 
to do with GCC.

Please can this PR be reopened?
Comment 5 Eric Gallager 2023-09-10 10:07:16 UTC
(In reply to simon from comment #4)
> (In reply to Eric Botcazou from comment #2)
> 
> When I said in comment 1 
> 
> >I have to say that, great as it would be to have this fixed, the changes 
> >required would be extensive, and I can’t see that anyone would think it 
> >worth the trouble.
> 
> I meant that coping with macOS’s HFS+ behaviour w.r.t. NFC vs NFD was 
> something it’d be unreasonable to spend effort on fixing.
> 
> The main point of this PR is that you can’t use extended characters in 
> unit names on case-insensitive filesystems, *which includes Windows*. 
> Fixing that problem (I can see it might mean introducing a new adaint.c 
> interface "is filesystem UTF8?") would be a good thing. Can the compiler 
> use iconv? or Ada.Wide_Characters.Handling,
> Ada.Strings.UTF_Encoding.[Wide_]Strings?
> 
> The awkwardness discussed in comment 1 isn’t really a problem except 
> when compiling the offending unit from the command line; when compiled 
> as part of the closure by gnatmake there’s no problem, I guess gnatmake 
> reads the unit name in NFC and gets the file name in NFD from the file 
> system.
> 
> I think there _is_ a problem in gprbuild but of course that’s nothing 
> to do with GCC.
> 
> Please can this PR be reopened?

Well, it was never closed in the first place, just marked as SUSPENDED, but I can put it back to UNCONFIRMED, I guess...
Comment 6 simon 2023-09-17 18:37:54 UTC
(In reply to simon from comment #1)
> Further:
> 
> $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
> gcc -c páck3.ads
> páck3.ads:1:10: warning: file name does not match unit name, should be
> "páck3.ads"
> 
> The reason for this apparently-bizarre message is[1] that macOS takes 
> the composed form (lowercase a acute) and converts it under the hood 
> to what HFS+ insists on, the fully decomposed form (lowercase a, combining 
> acute); thus the names are actually different even though they _look_ 
> the same.

This behaviour (I think it was an error) was fixed by darwin 19. Opening by a name with the composed form now correctly finds the file named with the fully decomposed form.
Comment 7 Eric Gallager 2023-10-01 17:34:11 UTC
(In reply to simon from comment #6)
> (In reply to simon from comment #1)
> > Further:
> > 
> > $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads
> > gcc -c páck3.ads
> > páck3.ads:1:10: warning: file name does not match unit name, should be
> > "páck3.ads"
> > 
> > The reason for this apparently-bizarre message is[1] that macOS takes 
> > the composed form (lowercase a acute) and converts it under the hood 
> > to what HFS+ insists on, the fully decomposed form (lowercase a, combining 
> > acute); thus the names are actually different even though they _look_ 
> > the same.
> 
> This behaviour (I think it was an error) was fixed by darwin 19. Opening by
> a name with the composed form now correctly finds the file named with the
> fully decomposed form.

OK, so do we still want to fix it for older darwin versions, or...?
Comment 8 simon 2023-10-18 16:05:09 UTC
I think I’d forgotten that compiling páck3.ads on its own, rather than as 
part of the closure, was the way to demonstrate this problem. It was NOT 
fixed in darwin19 (it’s still present in darwin23).

For interest, I made a C file which #includes a header with an a-acute in 
its name; the C file uses the composed a-acute, but the header’s file name
(as shown by ls) uses the combining a-acute. Compiles without complaint.
Attachment c-demo.zip.

On third thoughts, this should probably go back to SUSPENDED. When I looked
into it, it seemed to involve quite deep parts of the compiler, which
probably means that the Ada maintainers would be resistant (especially
since AdaCore don’t support macOS).
Comment 9 simon 2023-10-18 16:09:57 UTC
Created attachment 56140 [details]
C demonstrator

As noted in comment 8, the C compiler doesn’t have a problem with 
finding a file with a combining filename when the #include
directice uses a composed filename.