I've run into issues with -fanalyzer integration testing when a diagnostic occurs in a source file with non-UTF-8 bytes in it. Specifically, haproxy-2.7.1's include/import/ist.h has a comment within function "istalloc" that isn't quite UTF-8. When a diagnostic occurs on this, with -fdiagnostics-format=sarif-file, the file is quoted in the SARIF output, but the individual bytes are copied directly into the JSON, leading to e.g.: $ sarif ls integration-tests/haproxy-2.7.1/haproxy-2.7.1/http_client.c.sarif Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 57, in load_sarif_file data = json.load(file_in) File "/usr/lib64/python3.8/json/__init__.py", line 293, in load return loads(fp.read(), File "/usr/lib64/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 96006: invalid start byte The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/bin/sarif", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.8/site-packages/sarif/cmdline/main.py", line 40, in main exitcode = args.func(args) File "/usr/local/lib/python3.8/site-packages/sarif/cmdline/main.py", line 384, in _ls ls_op.print_ls(args.files_or_dirs, args.output) File "/usr/local/lib/python3.8/site-packages/sarif/operations/ls_op.py", line 17, in print_ls sarif_files = loader.load_sarif_files(path) File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 30, in load_sarif_files path_exists = _add_path_to_sarif_file_set(path, ret) File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 17, in _add_path_to_sarif_file_set sarif_file_set.add_file(load_sarif_file(path)) File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 60, in load_sarif_file raise IOError(f"Cannot load {file_path}") from exception OSError: integration-tests/haproxy-2.7.1/haproxy-2.7.1/http_client.c.sarif I'm working on a fix.
I would have assumed you need -finput-charset= for the non-utf8 ones really if your LANG/LANGUAGE is not set to C/UTF8 really.
https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Preprocessor-Options.html#index-finput-charset Even has the following: -finput-charset=charset Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command-line option. Currently the command-line option takes precedence if there’s a conflict. charset can be any encoding supported by the system’s iconv library routine. So I think there is a bug in that code ...
(In reply to Andrew Pinski from comment #1) > I would have assumed you need -finput-charset= for the non-utf8 ones really > if your LANG/LANGUAGE is not set to C/UTF8 really. Yeah, but when complaining about encoding issues, the error message we emit should at least be properly encoded :/ It's a major pain for my integration testing where two(?) bad bytes in one source file lead to an unparseable .sarif file (out of thousands). When quoting source in the .sarif output, we should ensure that the final JSON output is all valid UTF-8, perhaps falling back to not quoting source for cases where e.g. - the source file isn't validly encoded, or - the -finput-charset= is wrong, or - the -finput-charset= is missing or - where the source file (erroneously) uses a mixture of different encodings in different parts of itself Probably should also check we do something sane for trojan source attacks
(In reply to Andrew Pinski from comment #2) > So I think there is a bug in that code ... The issue is in sarif_builder::maybe_make_artifact_content_object, which uses; char *text_utf8 = maybe_read_file (filename); where there's no guarantee that "text_utf8" is (ahem) actually utf-8. Sorry about that. Working on a fix to make it use the input.cc source-quoting machinery, which should handle encoding.
While considering UTF-8 in SARIF files, please also have a look at PR105959.
For diagnosis of non-UTF-8 in strings / comments, see commit 0b8c57ed40f19086e30ce54faec3222ac21cc0df, "libcpp: Add -Winvalid-utf8 warning [PR106655]" (implementing a new C++ requirement).
Should be fixed on trunk by r13-6861-gd495ea2b232f3e: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d495ea2b232f3eb50155d7c7362c09a744766746 https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff_plain;h=d495ea2b232f3eb50155d7c7362c09a744766746 The invalid UTF-8 in the patch seems to have broken the server-side script: Enumerating objects: 51, done. Counting objects: 100% (51/51), done. Delta compression using up to 64 threads Compressing objects: 100% (29/29), done. Writing objects: 100% (29/29), 7.74 KiB | 1.29 MiB/s, done. Total 29 (delta 22), reused 0 (delta 0), pack-reused 0 remote: Traceback (most recent call last): remote: File "hooks/post_receive.py", line 118, in <module> remote: post_receive(refs_data, args.submitter_email) remote: File "hooks/post_receive.py", line 65, in post_receive remote: submitter_email) remote: File "hooks/post_receive.py", line 47, in post_receive_one remote: update.send_email_notifications() remote: File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 189, in send_email_notifications remote: self.__email_new_commits() remote: File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 1031, in __email_new_commits remote: commit, self.get_standard_commit_email(commit)) remote: File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 1011, in __send_commit_email remote: default_diff=email.diff) remote: File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 946, in __maybe_get_email_custom_contents remote: hook_input=json.dumps(hooks_data), remote: File "/usr/lib64/python2.7/json/__init__.py", line 244, in dumps remote: return _default_encoder.encode(obj) remote: File "/usr/lib64/python2.7/json/encoder.py", line 207, in encode remote: chunks = self.iterencode(o, _one_shot=True) remote: File "/usr/lib64/python2.7/json/encoder.py", line 270, in iterencode remote: return _iterencode(o, 0) remote: UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 13147: invalid start byte To git+ssh://gcc.gnu.org/git/gcc.git 13ec81eb4c3..d495ea2b232 master -> master
(In reply to David Malcolm from comment #7) > The invalid UTF-8 in the patch seems to have broken the server-side script: Maybe the not-really-utf8 files need to be marked in some way in the git repo to be safely handled for future checkout and updates, including the problematic scripting? However, reading gitattributes(5) it's not obvious how.
(In reply to Hans-Peter Nilsson from comment #8) > (In reply to David Malcolm from comment #7) > > The invalid UTF-8 in the patch seems to have broken the server-side script: > > Maybe the not-really-utf8 files need to be marked in some way in the git > repo to be safely handled for future checkout and updates, including the > problematic scripting? However, reading gitattributes(5) it's not obvious > how. Perhaps https://www.git-scm.com/docs/gitattributes#_marking_files_as_binary though it's not clear if we can do that for individual files (or if it's worth bothering)