Bug 109098 - Encoding errors on SARIF output for non-UTF-8 source files
Summary: Encoding errors on SARIF output for non-UTF-8 source files
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: analyzer (show other bugs)
Version: 13.0
: P3 normal
Target Milestone: ---
Assignee: David Malcolm
URL:
Keywords: diagnostic, SARIF
Depends on:
Blocks:
 
Reported: 2023-03-11 00:22 UTC by David Malcolm
Modified: 2024-06-21 16:04 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-03-11 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Malcolm 2023-03-11 00:22:30 UTC
I've run into issues with -fanalyzer integration testing when a diagnostic occurs in a source file with non-UTF-8 bytes in it.

Specifically, haproxy-2.7.1's include/import/ist.h has a comment within function "istalloc" that isn't quite UTF-8.

When a diagnostic occurs on this, with  -fdiagnostics-format=sarif-file, the file is quoted in the SARIF output, but the individual bytes are copied directly into the JSON, leading to e.g.:

$ sarif ls integration-tests/haproxy-2.7.1/haproxy-2.7.1/http_client.c.sarif
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 57, in load_sarif_file
    data = json.load(file_in)
  File "/usr/lib64/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib64/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 96006: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/sarif", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/sarif/cmdline/main.py", line 40, in main
    exitcode = args.func(args)
  File "/usr/local/lib/python3.8/site-packages/sarif/cmdline/main.py", line 384, in _ls
    ls_op.print_ls(args.files_or_dirs, args.output)
  File "/usr/local/lib/python3.8/site-packages/sarif/operations/ls_op.py", line 17, in print_ls
    sarif_files = loader.load_sarif_files(path)
  File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 30, in load_sarif_files
    path_exists = _add_path_to_sarif_file_set(path, ret)
  File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 17, in _add_path_to_sarif_file_set
    sarif_file_set.add_file(load_sarif_file(path))
  File "/usr/local/lib/python3.8/site-packages/sarif/loader.py", line 60, in load_sarif_file
    raise IOError(f"Cannot load {file_path}") from exception
OSError: integration-tests/haproxy-2.7.1/haproxy-2.7.1/http_client.c.sarif

I'm working on a fix.
Comment 1 Andrew Pinski 2023-03-11 00:26:12 UTC
I would have assumed you need -finput-charset= for the non-utf8 ones really if your LANG/LANGUAGE is not set to C/UTF8 really.
Comment 2 Andrew Pinski 2023-03-11 00:28:25 UTC
https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Preprocessor-Options.html#index-finput-charset

Even has the following:

-finput-charset=charset
Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command-line option. Currently the command-line option takes precedence if there’s a conflict. charset can be any encoding supported by the system’s iconv library routine.

So I think there is a bug in that code ...
Comment 3 David Malcolm 2023-03-11 00:31:26 UTC
(In reply to Andrew Pinski from comment #1)
> I would have assumed you need -finput-charset= for the non-utf8 ones really
> if your LANG/LANGUAGE is not set to C/UTF8 really.

Yeah, but when complaining about encoding issues, the error message we emit should at least be properly encoded :/

It's a major pain for my integration testing where two(?) bad bytes in one source file lead to an unparseable .sarif file (out of thousands).

When quoting source in the .sarif output, we should ensure that the final JSON output is all valid UTF-8, perhaps falling back to not quoting source for cases where e.g.
- the source file isn't validly encoded, or
- the -finput-charset= is wrong, or   
- the -finput-charset= is missing or
- where the source file (erroneously) uses a mixture of different encodings in different 
parts of itself

Probably should also check we do something sane for trojan source attacks
Comment 4 David Malcolm 2023-03-11 00:33:50 UTC
(In reply to Andrew Pinski from comment #2)
> So I think there is a bug in that code ...

The issue is in sarif_builder::maybe_make_artifact_content_object, which uses;

 char *text_utf8 = maybe_read_file (filename);

where there's no guarantee that "text_utf8" is (ahem) actually utf-8.  Sorry about that.

Working on a fix to make it use the input.cc source-quoting machinery, which should handle encoding.
Comment 5 Hans-Peter Nilsson 2023-03-11 00:55:00 UTC
While considering UTF-8 in SARIF files, please also have a look at PR105959.
Comment 6 jsm-csl@polyomino.org.uk 2023-03-13 21:46:27 UTC
For diagnosis of non-UTF-8 in strings / comments, see commit 
0b8c57ed40f19086e30ce54faec3222ac21cc0df, "libcpp: Add -Winvalid-utf8 
warning [PR106655]" (implementing a new C++ requirement).
Comment 7 David Malcolm 2023-03-25 01:01:27 UTC
Should be fixed on trunk by r13-6861-gd495ea2b232f3e:

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d495ea2b232f3eb50155d7c7362c09a744766746

https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff_plain;h=d495ea2b232f3eb50155d7c7362c09a744766746

The invalid UTF-8 in the patch seems to have broken the server-side script:

Enumerating objects: 51, done.
Counting objects: 100% (51/51), done.
Delta compression using up to 64 threads
Compressing objects: 100% (29/29), done.
Writing objects: 100% (29/29), 7.74 KiB | 1.29 MiB/s, done.
Total 29 (delta 22), reused 0 (delta 0), pack-reused 0
remote: Traceback (most recent call last):
remote:   File "hooks/post_receive.py", line 118, in <module>
remote:     post_receive(refs_data, args.submitter_email)
remote:   File "hooks/post_receive.py", line 65, in post_receive
remote:     submitter_email)
remote:   File "hooks/post_receive.py", line 47, in post_receive_one
remote:     update.send_email_notifications()
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 189, in send_email_notifications
remote:     self.__email_new_commits()
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 1031, in __email_new_commits
remote:     commit, self.get_standard_commit_email(commit))
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 1011, in __send_commit_email
remote:     default_diff=email.diff)
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 946, in __maybe_get_email_custom_contents
remote:     hook_input=json.dumps(hooks_data),
remote:   File "/usr/lib64/python2.7/json/__init__.py", line 244, in dumps
remote:     return _default_encoder.encode(obj)
remote:   File "/usr/lib64/python2.7/json/encoder.py", line 207, in encode
remote:     chunks = self.iterencode(o, _one_shot=True)
remote:   File "/usr/lib64/python2.7/json/encoder.py", line 270, in iterencode
remote:     return _iterencode(o, 0)
remote: UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 13147: invalid start byte
To git+ssh://gcc.gnu.org/git/gcc.git
   13ec81eb4c3..d495ea2b232  master -> master
Comment 8 Hans-Peter Nilsson 2023-03-25 01:57:24 UTC
(In reply to David Malcolm from comment #7)
> The invalid UTF-8 in the patch seems to have broken the server-side script:

Maybe the not-really-utf8 files need to be marked in some way in the git repo to be safely handled for future checkout and updates, including the problematic scripting?  However, reading gitattributes(5) it's not obvious how.
Comment 9 David Malcolm 2023-03-27 15:34:59 UTC
(In reply to Hans-Peter Nilsson from comment #8)
> (In reply to David Malcolm from comment #7)
> > The invalid UTF-8 in the patch seems to have broken the server-side script:
> 
> Maybe the not-really-utf8 files need to be marked in some way in the git
> repo to be safely handled for future checkout and updates, including the
> problematic scripting?  However, reading gitattributes(5) it's not obvious
> how.

Perhaps
  https://www.git-scm.com/docs/gitattributes#_marking_files_as_binary 
though it's not clear if we can do that for individual files (or if it's worth bothering)