- Why do I get a ? in front of text when open Unicode text file
- Posted by Angus on March 30th, 2007
Hello
I have a file which I created in Notepad on Windows. I pasted some
text from the clipboard into the file and when I saved I chose in the
Save As options Unicode.
But when I attempt to open the file like this:
typedef std::basic_string<TCHAR> tstring
tstring str;
DWORD dwSize = GetFileSize (hFile, &fs) ;
DWORD dwBytesRead;
TCHAR* szRead = new TCHAR[(dwSize + 1)];
BOOL bSuccess = ReadFile(hFile, szRead, dwSize, &dwBytesRead, NULL) ;
CloseHandle(hFile);
str = szRead;
delete [] szRead;
There is a ? as the first character of the string in szRead. Why? Is
there something I have missed when saving the file?
- Posted by [Jongware] on March 30th, 2007
"Angus" <anguscomber@gmail.com> wrote in message
news:1175247903.905060.204040@e65g2000hsc.googlegr oups.com...
To recognize valid Unicode files, Notepad writes the character sequence FFh
FEh as the first 2 characters. This is as recommended by the Unicode
consortium -- a smart program can (a) see it is actually Unicode, and (b)
see what byte ordering is used.
The FFh-FEh sequence was chosen because the Windows characters FFh and FEh
aren't very likely to start a file with (FYI these are a thorn and an y with
umlaut). The Unicode org declared both two-byte sequences FFFEh and FEFFh to
be 'invalid', so they may *not* appear inside a valid Unicode string.
Beware, what you now have in memory is (rather: should be) not a valid
string.
Why the '?' ? Because there is no character associated with that code -- and
there shouldn't be one, too.
Best practice is to examine the first 2 bytes of a file to check if they are
one of the Unicode markers, and ignore them if they are there.
[Jongware]
- Posted by Ulrich Eckhardt on March 30th, 2007
Angus wrote:
This will in fact save it not as Unicode, which is not a file format, but in
UTF-16, which is a fileformat capable of holding the whole Unicode range of
characters.
File size in bytes.
A string of TCHARs, one for each byte in the file plus one more. Use
std::vector for such things.
This in turn is a read operation that uses bytes.
FYI: A TCHAR isn't necessarily the same as a byte or a char. Also take into
account that you might have to convert the external file to the internal
codeset, it is only under win32 that a wchar_t is interpreted as UTF-16.
As already pointed out, this is a BOM (in fact it should be two characters)
and the debugger can't display it correctly anyway.
Uli
--
Sator Laser GmbH
Geschäftsführer: Ronald Boers, Amtsgericht Hamburg HR B62 932