Created attachment 41575 [details] Demonstrator (with BOM) The attached demonstrator contains two files, each with a UTF8 BOM. One file, pack3_user.adb, contains with Páck3; procedure Pack3_User is begin null; end Pack3_User; while the other, páck3.ads, contains just package Páck3 is end Páck3; There is no problem compiling on Linux (Debian Jessie). However, on Darwin and Windows, we get $ gnatmake -c -f pack3_user.adb gcc -c pack3_user.adb gnatmake: "p?ck3.ads" not found This is perhaps partly explained by looking at pack3_user.ali: ==================== V "GNAT Lib v8" M P W=8 P ZX RN U pack3_user%b pack3_user.adb be67fdbd NE OO SU W pUe1ck3%s p?ck3.ads p?ck3.ali [A] D p?ck3.ads 20170615165452 7221d8b1 páck3%s [B] D pack3_user.adb 20170616143450 cc46250c pack3_user%b D system.ads 20161018202953 085b6ffb system%s X 1 páck3.ads [C] [...] ==================== from which ([A], [B]) it is clear that GNAT is sometimes confused about the file names. Interestingly, sometimes it gets it right (last component on [B], [C]). The ALI file is written by Lib.Writ.Write_ALI. In two places it says if not File_Names_Case_Sensitive then Get_Name_String (Fname); To_Lower (Name_Buffer (1 .. Name_Len)); <<<<<<<<< Fname := Name_Find; end if; which is clearly the Wrong Thing to do if the file name is not ASCII. In the ALI file above, the small-a-acute, which should be encoded as C3 A1, has been rendered as E3 A1. Using the undocumented env var GNAT_FILE_NAME_CASE_SENSITIVE alters things: $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c -f pack3_user.adb gcc -c pack3_user.adb gcc -c páck3.ads so it's clear that the problem lies in this region. Interestingly, [B] and [C] above show that the compiler does understand how to low-case extended characters in strings. I haven't yet been able to find where this is done.
Further: $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads gcc -c páck3.ads páck3.ads:1:10: warning: file name does not match unit name, should be "páck3.ads" The reason for this apparently-bizarre message is[1] that macOS takes the composed form (lowercase a acute) and converts it under the hood to what HFS+ insists on, the fully decomposed form (lowercase a, combining acute); thus the names are actually different even though they _look_ the same. I have to say that, great as it would be to have this fixed, the changes required would be extensive, and I can’t see that anyone would think it worth the trouble. The recommendation would be "don’t use international characters in the names of library units". [1] https://stackoverflow.com/a/6153713/40851
Right. And people should use sane filesystems (and sane OSes to begin with).
Just for interest, this not-very-good code will successfully convert the uppercase-a-acute input c381 to uppercase-a/combining-acute 41cc81: #include <stdio.h> #include <iconv.h> #include <stdint.h> #include <memory.h> int main(void) { uint8_t codepoint[] = {0xc3, 0x81, 0}; char *input = (char *) &codepoint; size_t in_size = 2; char output_buffer[10]; memset(output_buffer, 0, sizeof(output_buffer)); char *output = output_buffer; size_t out_size = 10; iconv_t cd = iconv_open("utf8-mac", "UTF-8"); iconv(cd, &input, &in_size, &output, &out_size); printf("in %d out %d result \"%s\"\n", in_size, out_size, output_buffer); return 0; } but of course only on macOS - https://stackoverflow.com/a/23159081/40851
(In reply to Eric Botcazou from comment #2) When I said in comment 1 >I have to say that, great as it would be to have this fixed, the changes >required would be extensive, and I can’t see that anyone would think it >worth the trouble. I meant that coping with macOS’s HFS+ behaviour w.r.t. NFC vs NFD was something it’d be unreasonable to spend effort on fixing. The main point of this PR is that you can’t use extended characters in unit names on case-insensitive filesystems, *which includes Windows*. Fixing that problem (I can see it might mean introducing a new adaint.c interface "is filesystem UTF8?") would be a good thing. Can the compiler use iconv? or Ada.Wide_Characters.Handling, Ada.Strings.UTF_Encoding.[Wide_]Strings? The awkwardness discussed in comment 1 isn’t really a problem except when compiling the offending unit from the command line; when compiled as part of the closure by gnatmake there’s no problem, I guess gnatmake reads the unit name in NFC and gets the file name in NFD from the file system. I think there _is_ a problem in gprbuild but of course that’s nothing to do with GCC. Please can this PR be reopened?
(In reply to simon from comment #4) > (In reply to Eric Botcazou from comment #2) > > When I said in comment 1 > > >I have to say that, great as it would be to have this fixed, the changes > >required would be extensive, and I can’t see that anyone would think it > >worth the trouble. > > I meant that coping with macOS’s HFS+ behaviour w.r.t. NFC vs NFD was > something it’d be unreasonable to spend effort on fixing. > > The main point of this PR is that you can’t use extended characters in > unit names on case-insensitive filesystems, *which includes Windows*. > Fixing that problem (I can see it might mean introducing a new adaint.c > interface "is filesystem UTF8?") would be a good thing. Can the compiler > use iconv? or Ada.Wide_Characters.Handling, > Ada.Strings.UTF_Encoding.[Wide_]Strings? > > The awkwardness discussed in comment 1 isn’t really a problem except > when compiling the offending unit from the command line; when compiled > as part of the closure by gnatmake there’s no problem, I guess gnatmake > reads the unit name in NFC and gets the file name in NFD from the file > system. > > I think there _is_ a problem in gprbuild but of course that’s nothing > to do with GCC. > > Please can this PR be reopened? Well, it was never closed in the first place, just marked as SUSPENDED, but I can put it back to UNCONFIRMED, I guess...
(In reply to simon from comment #1) > Further: > > $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads > gcc -c páck3.ads > páck3.ads:1:10: warning: file name does not match unit name, should be > "páck3.ads" > > The reason for this apparently-bizarre message is[1] that macOS takes > the composed form (lowercase a acute) and converts it under the hood > to what HFS+ insists on, the fully decomposed form (lowercase a, combining > acute); thus the names are actually different even though they _look_ > the same. This behaviour (I think it was an error) was fixed by darwin 19. Opening by a name with the composed form now correctly finds the file named with the fully decomposed form.
(In reply to simon from comment #6) > (In reply to simon from comment #1) > > Further: > > > > $ GNAT_FILE_NAME_CASE_SENSITIVE=1 gnatmake -c p*.ads > > gcc -c páck3.ads > > páck3.ads:1:10: warning: file name does not match unit name, should be > > "páck3.ads" > > > > The reason for this apparently-bizarre message is[1] that macOS takes > > the composed form (lowercase a acute) and converts it under the hood > > to what HFS+ insists on, the fully decomposed form (lowercase a, combining > > acute); thus the names are actually different even though they _look_ > > the same. > > This behaviour (I think it was an error) was fixed by darwin 19. Opening by > a name with the composed form now correctly finds the file named with the > fully decomposed form. OK, so do we still want to fix it for older darwin versions, or...?
I think I’d forgotten that compiling páck3.ads on its own, rather than as part of the closure, was the way to demonstrate this problem. It was NOT fixed in darwin19 (it’s still present in darwin23). For interest, I made a C file which #includes a header with an a-acute in its name; the C file uses the composed a-acute, but the header’s file name (as shown by ls) uses the combining a-acute. Compiles without complaint. Attachment c-demo.zip. On third thoughts, this should probably go back to SUSPENDED. When I looked into it, it seemed to involve quite deep parts of the compiler, which probably means that the Ada maintainers would be resistant (especially since AdaCore don’t support macOS).
Created attachment 56140 [details] C demonstrator As noted in comment 8, the C compiler doesn’t have a problem with finding a file with a combining filename when the #include directice uses a composed filename.
(In reply to simon from comment #9) > Created attachment 56140 [details] > C demonstrator > > As noted in comment 8, the C compiler doesn’t have a problem with > finding a file with a combining filename when the #include > directice uses a composed filename. clang has -Wnonportable-include-path, which judging by the text looks like it's for case differences only, but if adding it for GCC, too, we'd probably want to extend it to handle this sort of thing as well: https://clang.llvm.org/docs/DiagnosticsReference.html#wnonportable-include-path