This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC] New pragma exec_charset


My concern with this pragma/attribute and inlining has to do with
strings in one exec charset being propagated into functions that
operate on strings in another charset.  E.g., like in the test
case below that's "miscompiled" with your patch -- the first test
for n == 7 is eliminated and the buffer overflow is not detected.
If this cannot be made to work then I think some effort should be
made to detect this mixing and matching and existing optimizations
that assume the same charset (like the sprintf one does) disabled.

static inline int
f (char *d, const char *fmt)
{
#pragma GCC exec_charset ("utf8")
   int n = __builtin_sprintf (d, fmt, 12345);
#pragma GCC exec_charset (pop)

   if (n == 7)   // incorrectly optimized away
     __builtin_abort ();

To my understanding the problem in your example isn't triggered through inlining.
The check gets optimized away because sprintf is called with "i=%i" being converted to EBCDIC first.

It's optimized away because the sprintf pass interprets the constant
contents of the format string, which it only does when the function
is inlined into its caller.  Otherwise, without inlining, it treats
the format string as an unknown and so it cannot do the buffer
overflow analysis or the optimization, and everything works fine
at runtime.

 The sprintf implementation does not recognize the format letter and hence only prints the converted
format string without inserting the integer. That's kinda expected I would say.  You call a function
which is expected to deal with a UTF-8 string with an EBCDIC string. GCC is reasoning about what the
sprintf invocation would return and removes the condition since it would never be triggered.
Compiling with -O0 makes the program behave the same way even without the inlining and the check not
being removed.

Right, at -O0 the function isn't inlined so it behaves sensibly.
The buffer overflow analysis doesn't happen but that is expected
and documented.

It would also happen without using the pragma but compiling with -fexec-charset=EBCDIC-US instead.
E.g. compiling this program with -fexec-charset=EBCDIC-US does abort while it does not abort without
using -fexec-charset:

int main (void)
{
   char d[5];

   int n = __builtin_sprintf (d, "i=%i", 12345);

   if (n != 7)
     __builtin_abort ();
}

It aborts in GCC 7 because its sprintf pass doesn't support
-fexec-charset.  The sprintf pass in GCC 8 does support it so it's
supposed to work correctly.  It does with -fexec-charse=IBM1047,
the only charset I tested the implementation with.  The output is:

y.c: In function ‘main’:
y.c:7:34: warning: too many arguments for format [-Wformat-extra-args]
    int n = __builtin_sprintf (d, "i=%i", 12345);
                                  ^~~~~~
y.c:7:34: warning: ‘%i’ directive writing 5 bytes into a region of size 3 [-Wformat-overflow=]
    int n = __builtin_sprintf (d, "i=%i", 12345);
                                  ^~~~~~
y.c:7:12: note: ‘__builtin_sprintf’ output 8 bytes into a destination of size 5
    int n = __builtin_sprintf (d, "i=%i", 12345);

The first warning is wrong, caused by the missing support for
-fexec-charset in -Wformat.  But the other warning is expected
and correct.

The abort seems like a bug in the implementation, although since
the buffer overflows the behavior is undefined so one could argue
the abort is okay.  I will look into what's going on and fix it.
 When there is no buffer overflow there is no abort, and that's
the correct behavior that needs to be preserved with the pragma.

For some reason, GCC 8 gets an ICE when -fexec-charset=EBCDIC-US.
I opened pr82700 to investigate and fix this bug.

I'm not sure what kind of warning to expect for such code? Would it be helpful if the compiler tells
that the exec_charset setting does not affect "fmt" for this code?!

#pragma GCC exec_charset ("EBCDIC-US")
   int n = __builtin_sprintf (d, fmt, 12345);
#pragma GCC exec_charset (pop)

Other than making it work as one would expect I think the only
reasonable approach would be to disable the optimization and
analysis.  Losing the optimization wouldn't be a big deal, but
it would be very unfortunate to have to trade off the pragma for
the sprintf buffer overflow detection.  But I'd like to think
we don't have to make such a tradeoff.

On the other hand, diagnostics like this might create false positives. One reason of course is that
this might be intentional. But it might also be because in C it is difficult to detect whether a
char* variable will eventually be interpreted as string or not.

While I agree with you that the feature is not that easy to use and more diagnostics would be good
I'm not really sure what to do about it.

I think the suggestion to associate each string with its own
charset is a good one.  That should let the sprintf pass (and
others like it) do the right thing even as strings get inlined
and charsets are mixed and matched, e.g., like so:

  static inline void f (const char *f)
  {
    #pragma GCC exec_charset ("UTF-8")
      const char format[] = "fmt = %s\n";   // format in UTF-8

      printf (format, f);         // f interpreted in EBCDIC-US

      printf (f, 12345);          // ditto here
    #pragma GCC exec_charset (pop)
  }

  void g (void)
  {
    #pragma GCC exec_charset ("EBCDIC-US")

    f ("i = %i\n");   // literal in EBCDIC-US

    #pragma GCC exec_charset (pop)
  }

Martin


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]