Tech Support > Microsoft Windows > Drivers > Using SCASB opcode asm or intrinsic
Using SCASB opcode asm or intrinsic
Posted by Vetzak on October 21st, 2005



Hello

I'm writing a driver that exposes a serial interface (ntddser.h
compatible). The driver is required to scan all incoming data bytes for
2 special characters, the "event" character and the "break" character.
At the moment I use for-loops to scan for these special characters.
However, on x86 32-bit platforms there's the SCASB opcode which is
quicker.

I tried to find an intrinsic function for scasb in the Visual C/C++
compiler but I didn't find it. Does somebody how to add it?
Alternatively, does somebody know how to inline the scasb opcode? Thx.

Posted by Pavel A. on October 21st, 2005


Strange.ntddk.h has intrinsics for movsb, bsf, bsr but not for scasb.

--PA

"Vetzak" <ptrshrn@gmail.com> wrote in message news:1129901242.925209.53420@g14g2000cwa.googlegro ups.com...


Posted by Maxim S. Shatskih on October 21st, 2005


Is it really so on modern CPUs? IIRC starting from Pentium, such opcodes are
obsolete since they are slower then the normal code - the normal code is
pipelined, and exotic opcodes use microcode.

--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com



Posted by 440gtx@email.com on October 22nd, 2005


That's right. Even "rep movs" is slower than individual instructions in
a loop since the debut of the 486 processor. Similarly, putting inline
assembly for scasb is also likely to slow down your code not just
because it is slower than a loop, but because it can interfere with
optimization in the function.

If you are trying to measure the difference, be careful not to time one
right after the other because cacheing effects will favor the code run
second.


Posted by Vetzak on October 22nd, 2005



The question is, which individual opcodes beat the "rep movs" or "rep
scas" opcodes? This is what my for-loop looks like in the release build
of my driver:

..text:00012393 mov dl, [ecx+0ABh]
..text:00012399 jbe short loc_123B6
..text:0001239B
..text:0001239B loc_1239B: ; For-loop starts here
..text:0001239B mov edi, [ebp+var_4]
..text:0001239E cmp [edi+eax], dl
..text:000123A1 jz short loc_123B0
..text:000123A3 inc [ebp+var_4]
..text:000123A6 mov edi, [ebp+arg_4]
..text:000123A9 cmp [ebp+var_4], edi
..text:000123AC jb short loc_1239B

(Disregard instruction at .text:00012399)

5 out of 7 instructions in the code for the for-loop access stack
variables. If I were to write code like this in assembler, I'd use
registers.

Another remark I have is that the memcpy() is translated to "rep movs"
instructions, like this code:

..text:000156F8 rep movsd
..text:000156FA mov ecx, eax
..text:000156FC and ecx, 3
..text:000156FF rep movsb

So if rep movs is slower than individual opcodes, does the Visual C/C++
compiler generate bad code?

Posted by Pavel A. on October 22nd, 2005


Comparing an incoming byte with just two "special characters"
probably can be coded shorter and faster than setup of scasb.

--PA


"Vetzak" <ptrshrn@gmail.com> wrote in message news:1129971075.713919.205120@g43g2000cwa.googlegr oups.com...


Posted by Tim Roberts on October 23rd, 2005


440gtx@email.com wrote:

That's not true. As long as you're doing more than 7 iterations, "rep
movs" is faster than the equivalent loop.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Posted by 440gtx@email.com on October 24th, 2005


Admittedly I have not visited this topic in years so I freshly
benchmarked moving 64K comparing "rep movsd" versus this loop:

l: mov eax,[esi+0]
mov ebx,[esi+4]
mov [edi+0],eax
mov [edi+4],ebx
add esi,8
add edi,8
sub ecx,8
jnz l

The benchmark was run by executing each algorithm 8 times in succession
and choosing the quickest run to rule out interrupts and cacheing
impacting the results. The results are as follows:

Pentium IV: rep movsd is 270% faster (wow!!!)
Athlon 64: rep movsd is 15% SLOWER
Pentium II: rep movsd is 10% faster


Posted by Vetzak on October 24th, 2005


Some remarks. Your routine moves 2x 4 bytes per loop, while "rep movsd"
moves 1x 4 bytes at a time. Also, "sub ecx,8" should be "sub ecx,1" or
"dec ecx".

A better benchmark would be:

mov ebx,4
L:
mov eax,[esi]
mov [edi],eax
add esi,ebx
add edi,ebx
dec ecx
jnz L

Posted by 440gtx@email.com on October 24th, 2005


That's right. What's wrong with making it as fast as possible? I also
experimented unrolling the loop to 16 and 32 bytes at a time but it
yielded just slight pluses and even minuses performance wise. This is
not cheating. It is easy to handle copies not a multiple of 8 bytes
just as with movsd having back end code to handle the remainder if not
a multiple of 4.

ECX had been loaded with the count in bytes so this one is ok. I know I
didn't post the entire thing so no way for you to know that.
Conversely, the "rep movsd" method had ECX loaded with the (count / 4).

I benchmarked that and it proved to always be the slowest way to go.


Posted by Maxim S. Shatskih on October 24th, 2005


Now let's try benchmarking REPNE SCASB and compare it to memchr(). REP
MOVSB can be specifically optimized, since Intel knows that this is how memcp
() is implemented in most compilers.

--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
maxim@storagecraft.com
http://www.storagecraft.com

<440gtx@email.com> wrote in message
news:1130160741.130162.266990@g43g2000cwa.googlegr oups.com...



Similar Posts