This is the mail archive of the
mailing list for the libstdc++ project.
Re: Unicode and C++
On Mon, Jul 03, 2000 at 07:29:35PM -0400, Havoc Pennington wrote:
> Per Hedbor <firstname.lastname@example.org> writes:
> > > Or in the GTK+ case, massive quantities of legacy code that has to
> > > keep working. UTF8 is pretty easy to port to
> > Only if you live in the US or some other 8-bit challenged country.
> > If you do not, you have to decode from UTF8 everywhere to support
> > things like file-names, names etc. anyway. Thus, the porting job is
> > not really smaller, only more hidden.
> You only need to decode from UTF8 to interface outside your program
> (for display, for filenames, output, etc.). Most uses of strings are
> passing them around inside the program itself, and those uses don't
> change with UTF8.
That's just because of legacy interfaces still in the code. As the
libraries get converted to take pointers to wchar_t, there will be
no need to "decode from UTF-8" at any point.
It is almost always better to code to the goal, and put in hacks for
current limitations, than to code to current limitations and then add
hacks for modern implementations. With the former, as the rest of
the "world" comes up to speed you get to just delete the hacks. With
the latter you live with the hacks forever.
> For display, things continue to work unchanged in
> GTK since GTK handles those details. The frequency of loading/saving a
> file is substantially less than the frequency of gtk_label_set_text()
> in your average GUI app.
Exactly. That's why you want to convert to/from UTF-8 only when
loading/saving a file, or (equivalently) sending over the net.
Otherwise, you want UCS4 uniformly throughout, in memory.
> We have some convenience functions for filenames:
> gchar* g_filename_to_utf8 (const gchar *opsysstring);
> gchar* g_filename_from_utf8 (const gchar *utf8string);
> So it's relatively easy to fix all such cases.
> The patch for a non-unicode app is certainly going to cover a lot
> fewer lines of code with UTF8, though conceptually you're doing the
> same thing, you just avoid a lot of busywork hassle by keeping the
> char* type.
In avoiding busywork for legacy code, you may create a lot of busywork
hassles and inefficiencies for anyone trying to write new code the
right way. That would be short-sighted.
> Also you can use ASCII string literals with UTF8, which means you can
> usually use string literals (most apps use English for their gettext
> keys). People simply won't accept not being able to use string
There is no problem with wide character string literals in C or C++,
and hasn't been for a very long time. Simply prefix with a letter 'L'.
wchar_t hello = L"hello, world";
ncm at cantrip dot org