- Questions about MSDN for some DDK functions
- Posted by Norman Diamond on July 8th, 2005
1.
http://msdn.microsoft.com/library/de...ef171a.xml.asp
says:
MSDN gives no exceptions. Even if the uppercase version of the specified
Unicode character requires two Unicode characters to express,
RtlUpcaseUnicodeChar returns it in one WCHAR. May I express some doubts
about this definition?
2.
http://msdn.microsoft.com/library/en...3d3a4b.xml.asp
says:
That makes sense. In a case where the relevant ANSI character fits in a
single byte and was specified, but the uppercase version requires two ANSI
characters and therefore cannot be converted, we can figure out what this
function will do.
The page continues:
That makes sense too. If the relevant ANSI character doesn't fit in a
single byte then we have to convert the character from ANSI to Unicode, then
call RtlUpcaseUnicodeChar, then convert the result from Unicode to ANSI.
But something is missing. If the relevant ANSI character does fit in a
single byte but cannot be directly converted, then in this case also should
we convert the character from ANSI to Unicode and call RtlUpcaseUnicodeChar
and convert the result back? I think not because of problem 1 above. So
there is still no way?
3.
http://msdn.microsoft.com/library/de...3d3a4b.xml.asp
says (for RtlUpperString):
So even if the MaximumLength of DestinationString is longer than the Length
of SourceString and is also long enough to hold the entire uppercase
conversion of SourceString, RtlUpperString will still truncate the result at
the Length of SourceString, will waste the remaining available space, and
will lose some of the characters that should have been converted. Why?
4.
A little bird told me why RtlIsValidOemCharacter isn't documented. Probably
the thing takes a single byte parameter and returns a single byte result and
is thoroughly incapable of distinguishing valid OEM characters from garbage.
But why is the thing exported? Are there really some callers that call it?
If so, aren't the callers guaranteed to fail? Wouldn't it be better to
delete the function RtlIsValidOemCharacter entirely?
(Where I live, Microsoft products default to code page 932 for both ANSI and
OEM. The contents of boot.ini are read and displayed in Shift-JIS not
Unicode. I've also seen a Microsoft product with a different default code
page where one particular single byte lowercase letter uppercases to SS, but
didn't experiment with kernel mode programming in it.)
- Posted by Don Burn on July 8th, 2005
See comments inline:
"Norman Diamond" <ndiamond@community.nospam> wrote in message
news:uxBBwK6gFHA.3608@TK2MSFTNGP12.phx.gbl...
You are confusing things here, all UNICODE characters are 16 bits, this is
not a multibyte model where some things are lead characters.
There is not a direct call to do this, multibyte character stuff has limited
support in the kernel, the expectation is you go with UNICODE where
everything is the same size, therefore faster and safer to use.
Without checking the source, so this is from memory. RtlUpperString will
use the RtlUpperChar type conversion model if this is upper case able and
not part of multi-character string it is uppered, otherwise not.
It is documents in the IFS kit.
--
Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
Remove StopSpam from the email to reply
- Posted by Skywing on July 8th, 2005
Correct me if I'm wrong, but I think that the source of confusion here is
that the Rtl routines only really support UCS-2 characters.
(There are in fact Unicode characters that use more than 16 bits, but these
are not supported by the Rtl routines as far as I know.)
"Don Burn" <burn@stopspam.acm.org> wrote in message
news:S6tze.1451$yL4.339@fe02.lga...
- Posted by Norman Diamond on July 11th, 2005
"Don Burn" <burn@stopspam.acm.org> wrote in message
news:S6tze.1451$yL4.339@fe02.lga...
I am not confusing anything. In a character set which Microsoft commonly
uses for German and other languages, there is a lowercase letter which is a
single character but which uppercases to two characters SS. In the ANSI
encoding which Microsoft commonly uses for that character set, one single
byte lowercase letter properly uppercases to two single byte uppercase
letters. In Unicode one 16-bit lowercase letter properly uppercases to two
16-bit uppercase letters. I still doubt that RtlUpcaseUnicodeChar performs
the way MSDN documents it.
Fine, we agree that Microsoft's ANSI functions are incapable of uppercasing
the German letter referred to above. But here you say the expectation is to
go with Unicode, where it still looks like Microsoft's Unicode functions
don't deal with it the way MSDN says.
Meanwhile as originally mentioned, some of Microsoft's code still uses ANSI
(for example, in the country where I live, boot.ini is both read and
displayed in ANSI code page 932 = OEM code page 932). This kind of thing
doesn't get solved by setting an expectation for me to go with Unicode.
That looks like an answer to 0% of what I asked in question 3. Did I
misunderstand your answer?
Yikes. From that I infer that there really are some callers that call it,
and that is why it is exported. But surely the thing is thoroughly
incapable of distinguishing valid OEM characters from garbage? I messed up
in guessing that it probably returns a single byte result because obviously
it returns some representation of a Boolean, but what is the type of its
argument? It sounds like you have the IFS kit and its documentation, so
please kindly look it up. Does it really take a parameter of type pointer
to multibyte character (i.e. char* or something equivalent to that or
pointer to an unsigned or signed version) and does it really parse all chars
of the character to check for validity?
- Posted by Maxim S. Shatskih on July 11th, 2005
Really? From what I know, Windows Unicode only supports 16bit characters, and
no multi-character sequences. Am I wrong?
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com
- Posted by Don Burn on July 11th, 2005
Since all the comments talk about single character UNICODE, I am confused on
how you can have a multi-char unicode item!
--
Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
Remove StopSpam from the email to reply
"Norman Diamond" <ndiamond@community.nospam> wrote in message
news:uae4j9ahFHA.3608@TK2MSFTNGP12.phx.gbl...
- Posted by Skywing on July 11th, 2005
Win32k.sys (and thus user32/gdi32) support UTF-16 character encodings as of
Win2K, IIRC. So, yes, there is some support.
From MSDN, WM_CHAR documentation:
"Remarks
The WM_CHAR message uses Unicode Transformation Format (UTF)-16. "
"Maxim S. Shatskih" <maxim@storagecraft.com> wrote in message
news:uEGKjQghFHA.2904@tk2msftngp13.phx.gbl...
- Posted by Maxim S. Shatskih on July 11th, 2005
IIRC UTF-16 means - any char is UINT16. The UTF-16 charset contains no
other chars.
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com
"Skywing" <skywing_NO_SPAM_@valhallalegends.com> wrote in message
news:%23xBkNSkhFHA.3164@TK2MSFTNGP15.phx.gbl...
- Posted by Don Burn on July 11th, 2005
And from the Unicode Standards Group web page:
Q. I understand that all Unicode characters are 16 bits, and that the high
byte is used to switch between code blocks. Is that correct?
A. Absolutely not! Unicode characters may be encoded at any code point from
U+0000 to U+10FFFF. The size of the code unit used for expressing those code
points may be 8 bits (for UTF-8), 16 bits (for UTF-16), or 32 bits (for
UTF-32) [See UTF & BOM]. Even when Unicode characters are expressed with
16-bit code units, there is no concept of a high byte switching values
between "code pages" expressed in the low byte. The entire 16-bit value
expresses the entire character, period. [KW]
UTF-16 is 16-bit characters only.
--
Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
Remove StopSpam from the email to reply
"Skywing" <skywing_NO_SPAM_@valhallalegends.com> wrote in message
news:%23xBkNSkhFHA.3164@TK2MSFTNGP15.phx.gbl...
- Posted by Skywing on July 11th, 2005
No. You're thinking of UCS-16, which only supports 64K unique character
values. UTF-16 supports characters with code values > 16 bits of
information with an escape sequence.
See rfc 2781 "UTF-16, an encoding of ISO 10646" at
http://www.ietf.org/rfc/rfc2781.txt for the technical details if you're
interested.
"Maxim S. Shatskih" <maxim@storagecraft.com> wrote in message
news:%23gVa0ikhFHA.2472@TK2MSFTNGP15.phx.gbl...
- Posted by Skywing on July 11th, 2005
My understanding is that UTF-16 supports using multiple 16-bit integers to
encode characters with values > 0xffff.
From rfc 2781 http://www.ietf.org/rfc/rfc2781.txt
" The rules for how characters are encoded in UTF-16 are:
- Characters with values less than 0x10000 are represented as a
single 16-bit integer with a value equal to that of the character
number.
- Characters with values between 0x10000 and 0x10FFFF are
represented by a 16-bit integer with a value between 0xD800 and
0xDBFF (within the so-called high-half zone or high surrogate
area) followed by a 16-bit integer with a value between 0xDC00 and
0xDFFF (within the so-called low-half zone or low surrogate area).
- Characters with values greater than 0x10FFFF cannot be encoded in
UTF-16."
"Don Burn" <burn@stopspam.acm.org> wrote in message
news:enzAe.3343$5R1.3243@fe07.lga...
- Posted by Maxim S. Shatskih on July 11th, 2005
Let's ask MS for this.
I'm absolutely sure that Windows Unicode has all characters = UINT16. No
multi-word characters. You can parse any string by just ++, without any
IsItALeadWord stuff.
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com
"Skywing" <skywing_NO_SPAM_@valhallalegends.com> wrote in message
news:O2sQ%23vkhFHA.2156@TK2MSFTNGP14.phx.gbl...
- Posted by Skywing on July 11th, 2005
I think that you're correct as far as ntoskrnl/ntdll go (no multi-word
characters), but user/gdi are apparently documented to support UTF-16.
"Maxim S. Shatskih" <maxim@storagecraft.com> wrote in message
news:uMJsFzkhFHA.1148@TK2MSFTNGP12.phx.gbl...
- Posted by Maxim S. Shatskih on July 11th, 2005
Then this means that MS understands UTF-16 this way 
Same strings - filenames and such - which are used in the UI are used in
the kernel.
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com
"Skywing" <skywing_NO_SPAM_@valhallalegends.com> wrote in message
news:OjHH78khFHA.1244@TK2MSFTNGP14.phx.gbl...
- Posted by Skywing on July 11th, 2005
In the same sense that a program which understands ASCII understands UTF-8,
yes.
"Maxim S. Shatskih" <maxim@storagecraft.com> wrote in message
news:eScN4flhFHA.576@TK2MSFTNGP15.phx.gbl...
- Posted by Maxim S. Shatskih on July 11th, 2005
Can Windows support the Unicode encoding where some chars are several 16bit
words - leading and then one or more trailing ones? Unicode standard defines
such.
Or Windows always assumes that 1 WCHAR = 1 character?
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com
"Skywing" <skywing_NO_SPAM_@valhallalegends.com> wrote in message
news:uT6H6ilhFHA.3544@TK2MSFTNGP15.phx.gbl...
- Posted by Skywing on July 11th, 2005
It depends on what API you're using, it seems.
Some more details:
http://www.microsoft.com/globaldev/D...8/default.mspx
http://blogs.msdn.com/michkap/archiv...11/416552.aspx
http://blogs.msdn.com/michkap/archiv...12/416735.aspx
From the Platform SDK [Surrogates, International Features]
"Surrogates
There is a need to support more characters than the 65,536 that fit in the
16-bit Unicode code space. For example, the Chinese speaking community alone
uses over 55,000 characters. To answer this need, the Unicode Standard
defines surrogates. A surrogate or surrogate pair is a pair of 16-bit
Unicode code values that represent a single character. The first (high)
surrogate is a 16-bit code value in the range U+D800 to U+DBFF. The second
(low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF. Using
surrogates, Unicode can support over one million characters. For more
details about surrogates, refer to The Unicode Standard, version 2.0.
Windows 2000 introduces support for basic input, output, and simple sorting
of surrogates. However, not all system components are surrogate compatible.
Also, surrogates are not supported in Windows 95/98/Me.
The system supports surrogates in the following ways:
a.. The cmap 12 OpenType font format is introduced, which directly
supports the 4-byte character code. Refer to the OpenType font specification
for more detail.
b.. Windows USER supports surrogate-enabled IMEs.
c.. Windows GDI APIs support cmap 12 so surrogates can be displayed
correctly.
d.. Uniscribe APIs support surrogates.
e.. Windows controls, including Edit and Rich Edit, support surrogates.
f.. HTML engine supports HTML page that includes surrogates for display,
editing (through Outlook Express), and forms submission.
g.. System sorting table supports surrogates.
h.. Planes two and three (defined in ISO/IEC 10646) are reserved for
ideographic characters.These planes fall in the high surrogate range of
U+D840 to U+D8BF. "
"Maxim S. Shatskih" <maxim@storagecraft.com> wrote in message
news:uCkE1vlhFHA.2152@TK2MSFTNGP14.phx.gbl...
- Posted by Pavel A. on July 12th, 2005
Ok, ok... this is about Chinese.
What about less widespread languages, German for example?
How come that the original 16-bit table does not fully cover German?
--PA
"Skywing" <skywing_NO_SPAM_@valhallalegends.com> wrote in message news:%23wvXyMmhFHA.2372@TK2MSFTNGP14.phx.gbl...
- Posted by Maxim S. Shatskih on July 12th, 2005
Thanks.
The docs on MultiByteToWideChar says that some Chinese code pages have
limitations on how this routine can be called. Maybe these are the only code
pages which have surrogates.
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com
"Skywing" <skywing_NO_SPAM_@valhallalegends.com> wrote in message
news:%23wvXyMmhFHA.2372@TK2MSFTNGP14.phx.gbl...
- Posted by Norman Diamond on July 12th, 2005
"Maxim S. Shatskih" <maxim@storagecraft.com> wrote in message
news:uEGKjQghFHA.2904@tk2msftngp13.phx.gbl...
[Norman Diamond:]
I agree with you that in most contexts Windows Unicode only supports 16-bit
characters. A multi-character sequence is a string. For example the
multi-character sequence L'S' followed by L'S' followed by L'\0' is the
multi-character string L"SS". Many parts of Windows (especially in the
kernel) support counted multi-character strings without null terminators but
I couldn't write it in simple C for this posting, sorry. Anyway, that isn't
the problem.
There is a lowercase letter that is one 16-bit character in Unicode
(including Windows Unicode). Although not relevant to this subthread of my
original questions, it is also one 8-bit character in an ANSI code page that
is commonly used in Windows (though not in the country where I live). In
either encoding, when that letter is uppercased it becomes a sequence of two
letters, SS.
MSDN says that RtlUpcaseUnicodeChar returns a value of type WCHAR not
pointer to array of WCHAR values. So I doubt very much that
RtlUpcaseUnicodeChar can return a sequence of two letters, SS. MSDN goes on
to say:
definition.