Tech Support > Microsoft Windows > Development Resources > Why do I get a ? in front of text when open Unicode text file
Why do I get a ? in front of text when open Unicode text file
Posted by Angus on March 30th, 2007


Hello

I have a file which I created in Notepad on Windows. I pasted some
text from the clipboard into the file and when I saved I chose in the
Save As options Unicode.

But when I attempt to open the file like this:

typedef std::basic_string<TCHAR> tstring
tstring str;

DWORD dwSize = GetFileSize (hFile, &fs) ;

DWORD dwBytesRead;

TCHAR* szRead = new TCHAR[(dwSize + 1)];

BOOL bSuccess = ReadFile(hFile, szRead, dwSize, &dwBytesRead, NULL) ;

CloseHandle(hFile);

str = szRead;

delete [] szRead;

There is a ? as the first character of the string in szRead. Why? Is
there something I have missed when saving the file?

Posted by [Jongware] on March 30th, 2007


"Angus" <anguscomber@gmail.com> wrote in message
news:1175247903.905060.204040@e65g2000hsc.googlegr oups.com...
To recognize valid Unicode files, Notepad writes the character sequence FFh
FEh as the first 2 characters. This is as recommended by the Unicode
consortium -- a smart program can (a) see it is actually Unicode, and (b)
see what byte ordering is used.
The FFh-FEh sequence was chosen because the Windows characters FFh and FEh
aren't very likely to start a file with (FYI these are a thorn and an y with
umlaut). The Unicode org declared both two-byte sequences FFFEh and FEFFh to
be 'invalid', so they may *not* appear inside a valid Unicode string.
Beware, what you now have in memory is (rather: should be) not a valid
string.
Why the '?' ? Because there is no character associated with that code -- and
there shouldn't be one, too.

Best practice is to examine the first 2 bytes of a file to check if they are
one of the Unicode markers, and ignore them if they are there.

[Jongware]



Posted by Ulrich Eckhardt on March 30th, 2007


Angus wrote:
This will in fact save it not as Unicode, which is not a file format, but in
UTF-16, which is a fileformat capable of holding the whole Unicode range of
characters.

File size in bytes.

A string of TCHARs, one for each byte in the file plus one more. Use
std::vector for such things.

This in turn is a read operation that uses bytes.

FYI: A TCHAR isn't necessarily the same as a byte or a char. Also take into
account that you might have to convert the external file to the internal
codeset, it is only under win32 that a wchar_t is interpreted as UTF-16.

As already pointed out, this is a BOM (in fact it should be two characters)
and the debugger can't display it correctly anyway.

Uli

--
Sator Laser GmbH
Geschäftsführer: Ronald Boers, Amtsgericht Hamburg HR B62 932



Similar Posts