This is the mail archive of the
gcc-help@gcc.gnu.org
mailing list for the GCC project.
Unicode and C++ (GCC 4.0)
- From: Eljay Love-Jensen <eljay at adobe dot com>
- To: "gcc-help at gcc dot gnu dot org" <gcc-help at gcc dot gnu dot org>
- Date: Thu, 04 Aug 2005 10:00:16 -0500
- Subject: Unicode and C++ (GCC 4.0)
Hi everyone (and especially Benjamin Kosnik if you are listening),
I have a stupid C++ question. I've spun my wheels for days on this issue.
I've been lost in a morass of std:: ... basic_istream, basic_ostream,
basic_istringstream, basic_ostringstream, locale, facet, and codecvt.
A simplified version of my problem:
I have these files:
utf8.txt
utf8-w-bom.txt ***
utf16le-w-bom.txt
utf16be-w-bom.txt
utf16le-wo-bom.txt ***
utf16be-wo-bom.txt *
utf32le-w-bom.txt
utf32be-w-bom.txt
utf32le-wo-bom.txt ***
utf32be-wo-bom.txt *
I want to read in those files into big behemoth strings. Strings that are
IN MEMORY utf8, or utf16, or utf32 at my discretion (i.e.,
programmatically). I want to write out those Unicode strings into new
Unicode files.
Now, granted, some of them are peculiar, as indicated with ***. The * mark
ones that the "without BOM" behavior is supposed to presume big-endian, so
they are copacetic.
I want to write out those files from those strings, but not necessarily UTF8
to UTF8. I want to go from anything (UTF8, 16, 32) to anything (UTF8, 16,
32).
Pseudo code example:
-----------------------------------------
#include <stdint.h> // C99-ism
#include <iostream>
#include <sstream>
#include <string>
class Utf8Char
{
uint8_t m;
public:
explicit Utf8Char(char in) : m(in) { }
operator uint8_t () const { return m; }
};
class Utf16Char
{
uint16_t m;
public:
explicit Utf8Char(char in) : m(in) { }
operator uint16_t () const { return m; }
};
class Utf32Char
{
uint32_t m;
public:
explicit Utf8Char(char in) : m(in) { }
operator uint32_t () const { return m; }
};
typedef std::basic_string<Utf8Char> Uft8String;
typedef std::basic_string<Utf16Char> Uft16String;
typedef std::basic_string<Utf32Char> Uft32String;
typedef std::basic_istream<Utf8Char> istream8;
typedef std::basic_istream<Utf16Char> istream16;
typedef std::basic_istream<Utf32Char> istream32;
typedef std::basic_ostream<Utf8Char> ostream8;
typedef std::basic_ostream<Utf16Char> ostream16;
typedef std::basic_ostream<Utf32Char> ostream32;
typedef std::basic_istringstream<Utf8Char> istringstream8;
typedef std::basic_istringstream<Utf16Char> istringstream16;
typedef std::basic_istringstream<Utf32Char> istringstream32;
typedef std::basic_ostringstream<Utf8Char> ostringstream8;
typedef std::basic_ostringstream<Utf16Char> ostringstream16;
typedef std::basic_ostringstream<Utf32Char> ostringstream32;
-----------------------------------------
BUT... none of that works. At all.
I'm completely dazed and confused.
I can't even get this to work (COMPILABLE example with GCC 4.0, so I didn't
break my own often given advice on this forum)...
-----------------------------------------
#include <ios>
#include <iostream>
#include <sstream>
#include <ext/stdio_filebuf.h>
// Following Stroustrup's 11.7.1 advice...
class Utf16Char
{
public:
Utf16Char() : c(0) { }
Utf16Char(unsigned short int in) : c(in) { }
operator unsigned short int () const { return c; }
private:
unsigned short int c; // UTF16.
};
typedef std::basic_ostream<Utf16Char> uostream;
int main()
{
__gnu_cxx::stdio_filebuf<Utf16Char> buf_ucerr(stderr, std::ios_base::out);
uostream ucerr(&buf_ucerr);
ucerr.flags(std::ios_base::unitbuf);
// Comingling cerr and ucerr output isn't going to really work.
// At this nascent stage, this is just for show-and-tell.
std::cerr
<< (ucerr.good() ? "ucerr is good" : "ucerr is not good")
<< std::endl;
// Prints: ucerr is good.
ucerr << Utf16Char(0xFEFF); // BOM, to kick things off.
// Where is my FF FE hex bytes output?
for(int i = 0; i < 1000; ++i)
ucerr << Utf16Char('x');
// Where are my 00 78 hex bytes on output?
// Heck, where is ANY of the output going?
// Oh, gdb says ucerr is in a bad state.
// But why?
// What did I miss?
// How can I fix it?
std::cerr
<< (ucerr.good() ? "ucerr is good" : "ucerr is not good")
<< std::endl;
// Prints: ucerr is not good.
}
-----------------------------------------
My immediate goal is to understand how basic_string and basic_istream and
basic_ostream can make my life easier.
Then I want to be able to write a little program that does this:
$ unicat --help
unicat [--utfX] [--Xbom] [-o file] [-i | [--] files...]
--utf8 output utf8 (default)
--utf16le output utf16le
--utf16be output utf16be
--utf32le output utf32le
--utf32be output utf32be
--bom output bom (even if incorrect)
--nobom suppress bom (even if required)
--autobom does the right thing (default)
-o file output file, otherwise stdout
-i input from stdin, not files...
-- subsequent parms are files...
files... any Unicode encoded format
(I already have a little program that does this, but it is written using a
little state machine and regular ifstream and ofstream on a byte-by-byte
basis. My goal is to understand std::basic_string/stream, not to make this
trivial Unicode text concatenation program.)
Does anyone grok this C++ (and GCC) string and stream magic and the
bewildering locale, facet, codecvt -- and how to get it to work with a
variety of Unicode encoded inputs, in memory Unicode encodings, and Unicode
encoded outputs?
NOTE: I *must* stay away from char and wchar_t. They are insufficiently
portable and reliable for my needs.
HELP! Insights, understandings, explanations, enlightenments welcome,
--Eljay