Bug 116724 - RFE: can generated SARIF diagnostics contain untranslated messages (plus translations?)
Summary: RFE: can generated SARIF diagnostics contain untranslated messages (plus tran...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: other (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: SARIF
Depends on:
Blocks:
 
Reported: 2024-09-15 14:31 UTC by David Malcolm
Modified: 2024-09-19 20:35 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Malcolm 2024-09-15 14:31:46 UTC
We were chatting about this at Cauldron; filing this as a placeholder for now
Comment 1 David Malcolm 2024-09-15 14:52:18 UTC
I tried this for some examples; consider:

$ LANG=C ./xgcc -B. ../../src/gcc/testsuite/gcc.dg/format/sentinel-1.c -fdiagnostics-format=text -Wall -S -fdiagnostics-format=text
../../src/gcc/testsuite/gcc.dg/format/sentinel-1.c:15:1: warning: 'sentinel' attribute only applies to function types [-Wattributes]
   15 | extern int a ATTR; /* { dg-warning "applies to function types" "sentinel" } */
      | ^~~~~~
../../src/gcc/testsuite/gcc.dg/format/sentinel-1.c:19:1: warning: 'sentinel' attribute requires prototypes with named arguments [-Wattributes]
   19 | extern void foo3 () ATTR; /* { dg-warning "named arguments" "sentinel" } */
      | ^~~~~~
../../src/gcc/testsuite/gcc.dg/format/sentinel-1.c:20:1: warning: 'sentinel' attribute only applies to variadic functions [-Wattributes]
   20 | extern void foo4 (const char *, int) ATTR; /* { dg-warning "variadic functions" "sentinel" } */
      | ^~~~~~
[...snip...]

Now try running with LANG=ja_JP.UTF-8

$ LANG=ja_JP.UTF-8 ./xgcc -B. ../../src/gcc/testsuite/gcc.dg/format/sentinel-1.c -fdiagnostics-format=text -Wall -S -fdiagnostics-format=text
../../src/gcc/testsuite/gcc.dg/format/sentinel-1.c:15:1: 警告: ‘sentinel’ 属性は関数型にのみ適用できます [-Wattributes]
   15 | extern int a ATTR; /* { dg-warning "applies to function types" "sentinel" } */
      | ^~~~~~
../../src/gcc/testsuite/gcc.dg/format/sentinel-1.c:19:1: 警告: ‘sentinel’ 属性は名前付き引数があるプロトタイプが必要です [-Wattributes]
   19 | extern void foo3 () ATTR; /* { dg-warning "named arguments" "sentinel" } */
      | ^~~~~~
../../src/gcc/testsuite/gcc.dg/format/sentinel-1.c:20:1: 警告: ‘sentinel’ attribute only applies to variadic functions [-Wattributes]
   20 | extern void foo4 (const char *, int) ATTR; /* { dg-warning "variadic functions" "sentinel" } */
      | ^~~~~~
[...lots snipped...]

Note that some of the msg fmt strings have translations in ja.po; others don't.

Trying again with -fdiagnostics-format=sarif-file

           "results": [{"ruleId": "-Wattributes",
                        "level": "warning",
                        "message": {"text": "‘[sentinel](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-sentinel-function-attribute)’ 属性は関数型にのみ適用できます"},

[...snipped...]

Note how the message text in the SARIF output is translated via the ja.po, which could be a problem if the maintainer doesn't speak the same language as the user.

Perhaps we should try to capture both the untranslated text and the translated text?  SARIF has various abilities for handling translations.
Comment 3 David Malcolm 2024-09-15 15:12:12 UTC
Note to self:
499	diagnostic_set_info (diagnostic_info *diagnostic, const char *gmsgid,
500			     va_list *args, rich_location *richloc,
501			     diagnostic_t kind)
502	{
503	  gcc_assert (richloc);
504	  diagnostic_set_info_translated (diagnostic, _(gmsgid), args, richloc, kind);
505	}

Perhaps the diagnostic_info should store both the untranslated *and* translated strings.  Could probably defer the _() lookup until later, e.g. if the warning isn't actually emitted.
Comment 4 Hans-Peter Nilsson 2024-09-15 22:26:50 UTC
(In reply to David Malcolm from comment #1)

> Perhaps we should try to capture both the untranslated text and the
> translated text?  SARIF has various abilities for handling translations.

Works for me! The use-case I was thinking of, is for the SARIF output to be a nice containment of the non-source-code part of bug-reports: "instead of quoting stderr, use --diagnostics-format=sarif-file and send sourcename.sarif".  But, to fulfill that, more is needed, including the gcc arguments.  (Maybe that's all.)
I don't see that included, right?
Sorry for the "creaturization request"!
Comment 5 David Malcolm 2024-09-16 04:24:11 UTC
(In reply to Hans-Peter Nilsson from comment #4)
> (In reply to David Malcolm from comment #1)
> 
> > Perhaps we should try to capture both the untranslated text and the
> > translated text?  SARIF has various abilities for handling translations.

To clarify, consider this hypothethical diagnostic:

  error_at (location,
              "missing %qs after %qs",
              "decl-name", "foo");
with a hypothetical "pig-latin" locale and translation (pig-latin.po); see https://en.wikipedia.org/wiki/Pig_Latin

where
  "missing %qs after %qs"
has this translation in the .po file:
  "issingmay %qs afteray %qs"

The classic text output format might read:

  foo.c:42:11: erroray: issingmay `decl-name' afteray `foo'

and currently GCC's SARIF output would presumably capture the text of the message with:

message: {"text": {"issingmay `decl-name' afteray `foo'"}}

i.e. currently GCC's SARIF output for a formatted string "bakes in" both localization of the format string *and* param substitution.  

We could instead defer parameter substitution to the SARIF consumer via §3.11.5 "Messages with placeholders" (https://docs.oasis-open.org/sarif/sarif/v2.1.0/errata01/os/sarif-v2.1.0-errata01-os-complete.html#_Toc141790716) like this:

message: {"text": {"issingmay `{0}' afteray `{1}'",
          "arguments: ["decl-name", "foo"]}

and, with that, potentially capture the pre-translated message string *and* its translation in the currently after .po file e.g.:
message: {id: "missing %qs after %qs",
          "arguments: ["decl-name", "foo"]}

with something like this: (see 3.11.7 "Message string lookup")

 "translations": [
    {                              # A toolComponent object.
      "language": "pig-latin",
      "contents": ["localizedData"],
      "globalMessageStrings": [
         {"missing %qs after %qs": {"text": "issingmay {0} afteray {1}"}}]}]

where we'd list the subset of format strings that got used by diagnostics in the particular log, and their translations, with the caveats that:
- I'm not sure that that's how translations of strings are meant to be stored (the SARIF spec's tutorial doesn't seem to cover translations yet)
- I'm using (abusing?) the string as its own "id"

If gettext supported it, could even try to capture translations from *all* .po files.  But if needed that's probably much easier to handle via a post-processing script.
 
> Works for me! The use-case I was thinking of, is for the SARIF output to be
> a nice containment of the non-source-code part of bug-reports: "instead of
> quoting stderr, use --diagnostics-format=sarif-file and send
> sourcename.sarif".  

Sounds like an interesting idea; can you open this as a separate RFE please?

> But, to fulfill that, more is needed, including the gcc
> arguments.  (Maybe that's all.)

I've added support for capturing the command-line arguments in GCC 15:
  https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658206.html
though note that it's capturing the arguments as supplied by the driver to e.g. cc1, as opposed to those that the user supplied to the driver.  

> I don't see that included, right?
> Sorry for the "creaturization request"!

Thanks for the feedback; hope the above makes sense.
Comment 6 David Malcolm 2024-09-16 04:38:17 UTC
(In reply to David Malcolm from comment #5)
[...]
> (the SARIF spec's tutorial doesn't seem to cover translations yet)

FWIW I've requested this as https://github.com/microsoft/sarif-tutorials/issues/40

[...]
Comment 7 David Malcolm 2024-09-16 04:42:00 UTC
(In reply to Hans-Peter Nilsson from comment #4)
> (In reply to David Malcolm from comment #1)
> 
> > Perhaps we should try to capture both the untranslated text and the
> > translated text?  SARIF has various abilities for handling translations.
> 
> Works for me! The use-case I was thinking of, is for the SARIF output to be
> a nice containment of the non-source-code part of bug-reports

FWIW that SARIF can capture embedded copies of source files of interest, and that the GCC SARIF generation code does this for any source file mentioned in a diagnostic.  So perhaps the .sarif file could contain everything.
Comment 8 Joseph S. Myers 2024-09-19 20:35:31 UTC
Note that there are various places where translation happens without the diagnostic machinery ever seeing an untranslated message.  A representative example is cp/typeck.cc:cp_build_unary_op, where messages are translated with _() then passed to error_at (location, "%s", errstring);. Note that using error_at (location, errstring) would cause issues with -Wformat-security warnings; I've previously suggested adding error_at_no_args (allowing no-arguments formats such as %< and %> but not others) to help with such issues.