Created attachment 47547 [details] Candidate patch Hello- When source lines are needed for diagnostics output, they are retrieved from the source file by the fcache infrastructure in input.c, since libcpp has generally already forgotten them already (plus not all front ends are using libcpp). This infrastructure does not read the files in the same way as libcpp does; in particular, it does not translate the encoding as requested by -finput-charset, and it does not strip a UTF-8 BOM if present. The attached patch adds this ability. My thinking in deciding how to do it was the following: - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer still, so this patch should try hard not to introduce any worse performance unless these features are being used. - It is desirable to reuse libcpp's encoding infrastructure from charset.c rather than repeat it in input.c. (Notably, libcpp uses iconv but it also has hand-coded routines for certain charsets to make sure they are available.) - There is a performance degradation required in order to make use of libcpp directly, because the input.c infrastructure only reads as much of the source file as necessary, whereas libcpp interfaces require to read the entire file into memory. - It can't be quite as simple as just "only delegate to libcpp if -finput-charset was specified", because the stripping of the UTF-8 BOM has to happen with or without this option. - So it seemed a reasonable compromise to me, if -finput-charset is specified, then use libcpp to convert the file, otherwise, strip the BOM in input.c and then process the file the same way it is done now. There's a little bit of leakage of charset logic from libcpp this way, but it seems worthwhile, since otherwise, diagnostics would always be reading the entire file into memory, which is not a cost paid currently. Hope that makes some sense. One thing I wasn't sure of, is what's the right way to communicate to input.c that libcpp is being used to process the source files. I added a global variable for this because I didn't see any other way to do it without making drastic changes. This requires C-family and Fortran front-ends, and any future libcpp users, to set the variable; basically it makes explicit the fact that there's a single global cpp_reader object in use. If there's a less-bad way to do it, please let me know... If the global cpp_reader variable doesn't get set, then the behavior is the same as current. Separate from the attached patch are two testcases that both fail before this patch and pass after. I attached them gzipped because they use non-standard encodings. If this overall approach seems OK, please let me know and I can prepare it for gcc-patches. bootstrap + reg test look good on x86-64 linux. Thanks! -Lewis
Created attachment 47548 [details] testcases
Did the patch ever get posted to gcc-patches? Does it still apply cleanly? (I just ran into this issue and was about to file my own bug about it)
Created attachment 49557 [details] updated patch to apply to current master It seems I did not pursue this beyond filing this bug report with the patch. The patch requires some minor rebasing, just due to surrounding context, so here is a version that applies now, and still bootstraps and works. Should I send to gcc-patches as well? I think the main potential issue with it is that it requires input.h / input.c to get a global variable so that frontends can let it know they are using libcpp to read their input. (Otherwise -finput-charset is not applicable). I am sure there is a better way to do that, but not sure the cleanest one, none of the routines in input.c seem to make use of a diagnostic_context or similar thing. -Lewis
Fixed for GCC 12 by r12-3140