Tech Support > Microsoft Windows > Drivers > Debugging system hang
Debugging system hang
Posted by wenhua on January 2nd, 2005



Hi,

I'm testing my multi-path storage device driver in a multi-node W2K3 cluster
enviroment and have seen the system hang many times. Here is the test
scenario:
the LUNs are assigned from a SAN, 2 or 4 nodes in cluster, storport driver
and
MPIO driver involved, a test test is running on all nodes to issue io. On
storage system, the 2 storage controler are rebooted perioically and
alternately.

Because don't know which node hangs, so when a node hangs, I take a memory
dump
to debug the problem. Here is what I do for debugging the problem using
WinDbg.

1. Run !ready and !running command
1: kd> !ready
Processor 0: No threads in READY state
Processor 1: No threads in READY state
1: kd> !running

System Processors 3 (affinity mask)
Idle Processors 3
All processors idle.

This shows that the system might hang due to the deadlock.

2. Run command !process 0 1, this shows that the process blast.exe (the test
tool)
has used the kernel time 2'14'' while all other processes just have used
several
seconds. Here is the part of output of the above command:

PROCESS 8905b780 SessionId: 0 Cid: 0b74 Peb: 7ffdf000 ParentCid: 0ab0
DirBase: 7530b000 ObjectTable: e2020e28 HandleCount: 148.
Image: ResrcMon.exe
VadRoot 89bbf430 Vads 111 Clone 0 Private 300. Modified 568. Locked 0.
DeviceMap e1fc58e0
Token e1fc3d00
ElapsedTime 3:09:53.0546
UserTime 0:00:00.0656
KernelTime 0:00:02.0156
QuotaPoolUsage[PagedPool] 34196
QuotaPoolUsage[NonPagedPool] 4696
Working Set Sizes (now,min,max) (1275, 50, 345) (5100KB, 200KB, 1380KB)
PeakWorkingSetSize 1347
VirtualSize 37 Mb
PeakVirtualSize 41 Mb
PageFaultCount 15913
MemoryPriority BACKGROUND
BasePriority 13
CommitCharge 897

PROCESS 89b67020 SessionId: 0 Cid: 07e0 Peb: 7ffdf000 ParentCid: 0940
DirBase: 7ac1f000 ObjectTable: e204efb8 HandleCount: 96.
Image: blast.EXE
VadRoot 89868308 Vads 72 Clone 0 Private 416. Modified 386603. Locked 0.
DeviceMap e1aac590
Token e20699e8
ElapsedTime 2:12:02.0343
UserTime 0:00:13.0828
KernelTime 0:02:14.0578
QuotaPoolUsage[PagedPool] 12328
QuotaPoolUsage[NonPagedPool] 4568
Working Set Sizes (now,min,max) (880, 50, 345) (3520KB, 200KB, 1380KB)
PeakWorkingSetSize 895
VirtualSize 33 Mb
PeakVirtualSize 34 Mb
PageFaultCount 903
MemoryPriority BACKGROUND
BasePriority 8
CommitCharge 500

3. I then run the command !process 0 7 blast.exe to get the detailed stack
trace
of every thread in the process blast.exe . This showed there are 15
threads
belonging to this process, most of them are waiting for the resource
0x88d384c0
by calling ExAcquireResourceSharedLite, this resource is owned by another
thread in the same process. this thread is waiting for some other resource
owned by another thread in the same process. Here is a part of the output
of the above command.

THREAD 8905cb20 Cid 07e0.0680 Teb: 7ffae000 Win32Thread: 00000000
WAIT: (WrResource) KernelMode Non-Alertable
88ee6cb0 Semaphore Limit 0x7fffffff
8905cb98 NotificationTimer
IRP List:
8893fe00: (0006,01fc) Flags: 00000884 Mdl: 00000000
88946008: (0006,01fc) Flags: 00000404 Mdl: 00000000
Not impersonating
DeviceMap e1aac590
Owning Process 89b67020
Wait Start TickCount 1184260 Elapsed Ticks: 48
Context Switch Count 30604
UserTime 00:00:00.0953
KernelTime 00:00:09.0515
Start Address 0x77e4a99b
Win32 Start Address 0x0040195b
Stack Init b528a000 Current b5289110 Base b528a000 Limit b5287000
Call 0
Priority 8 BasePriority 8 PriorityDecrement 0
ChildEBP RetAddr Args to Child
b5289128 804edb2b 8905cbc0 8905cb20 88ee6cb0 nt!KiSwapContext+0x26
(FPO: [EBP 0xb5289150] [0,0,4])
b5289150 804ed790 8905cb20 88d384c0 00000000 nt!KiSwapThread+0x280
b5289184 8051c0fb 88ee6cb0 0000001b 00000000
nt!KeWaitForSingleObject+0x249
b52891c0 804f9c75 e2632b08 e25757b0 804e97e0
nt!ExpWaitForResource+0xd3
b52891e0 bab5d218 88d384c0 00000001 bab8ebb1
nt!ExAcquireResourceSharedLite+0xc7
b52891ec bab8ebb1 88e0db90 e25757b0 00000001
Ntfs!NtfsAcquireResourceShared+0x1b (FPO: [3,0,0])
b5289204 bab63eda 88e0db90 e25757b0 00000001
Ntfs!NtfsAcquireSharedFcbCheckWait+0x35
b5289250 bab8eb29 88e0db90 89015100 e26328b8
Ntfs!NtfsTeardownFromLcb+0xe5
b52892a8 bab64080 88e0db90 e2632980 e2632b08
Ntfs!NtfsTeardownStructures+0x125
b52892d4 bab845d1 88e0db90 e2632980 e2632b08
Ntfs!NtfsDecrementCloseCounts+0xa7
b528935c bab86e5d 88e0db90 e2632980 e26328b8
Ntfs!NtfsCommonClose+0x39b
b52893f0 baba8d70 00000000 00000000 b528953c Ntfs!NtfsFspClose+0xe3
b528951c bab8551f 88ddb6b0 8893fe00 b528955c
Ntfs!NtfsCommonCreate+0x132
b5289620 804f0473 89057020 8893fe00 b528968c Ntfs!NtfsFsdCreate+0x17d
b5289630 f762825a 00000000 b528968c 00000000 nt!IofCallDriver+0x3f
(FPO: [0,0,0])
WARNING: Stack unwind information not available. Following frames may be
wrong.
b5289658 f762ae6b 89057020 88c8bb10 b528968c SYMEVENT+0x125a
891587a0 890b0098 00630000 00000000 00000000
SYMEVENT!SYMEvent_Operation+0x335
891587a0 890b0098 00630000 00000000 00000000 0x890b0098
f7627708 f762ad94 f762ad99 f7628bcf f7628bcf 0x890b0098
f762ac44 000000aa 082444f6 56077401 ffd020e8
SYMEVENT!SYMEvent_Operation+0x25e

4. Run command !stack 0 to get a summary of kernel stack. For blast.exe, it
is:

[89b67020 blast.EXE]
7e0.0000c8 898cea40 000002c Blocked Ntfs!NtfsAcquireResourceShared+0x1b
7e0.000168 8913e8a0 0000032 Blocked Ntfs!NtfsAcquireResourceShared+0x1b
7e0.0001d8 89ae2bf0 0000032 Blocked Ntfs!NtfsAcquireResourceShared+0x1b
7e0.0001dc 8914e7b0 0000031 Blocked Ntfs!NtfsAcquireResourceShared+0x1b
7e0.0001e0 898d1788 0000030 Blocked Ntfs!NtfsAcquireResourceShared+0x1b
7e0.000288 898d1c00 000002b Blocked Ntfs!NtfsAcquireResourceShared+0x1b
7e0.000228 898c2400 0000033 Blocked Ntfs!NtfsAcquireResourceShared+0x1b
7e0.0006c0 8908bb70 0000031 Blocked Ntfs!NtfsAcquireResourceShared+0x1b
7e0.000398 89084020 0000032 Blocked Ntfs!NtfsAcquireResourceShared+0x1b
7e0.000680 8905cb20 0000030 Blocked Ntfs!NtfsAcquireResourceShared+0x1b
7e0.000650 8909c680 0000028 Blocked
Ntfs!NtfsAcquireResourceExclusive+0x1d
7e0.000494 89058350 0000025 Blocked
Ntfs!NtfsAcquireResourceExclusive+0x1d
7e0.0004e0 89051be0 0077d4c Blocked Ntfs!NtfsWaitSync+0x18

The 4th column is the tickcount that the thread has beening running. This
shows that
the thread 0x89051be0 has used the most of time.

5. Run command dt _ERESOURCE 88d384c0

1: kd> dt _ERESOURCE 88d384c0
+0x000 SystemResourcesList : _LIST_ENTRY [ 0x88d2b1b0 - 0x88d38500 ]
+0x008 OwnerTable : 0x883464d0
+0x00c ActiveCount : 1
+0x00e Flag : 0x80
+0x010 SharedWaiters : 0x88ee6cb0
+0x014 ExclusiveWaiters : 0x88e83c48
+0x018 OwnerThreads : [2] _OWNER_ENTRY
+0x028 ContentionCount : 0x11
+0x02c NumberOfSharedWaiters : 0x10
+0x02e NumberOfExclusiveWaiters : 0
+0x030 Address : (null)
+0x030 CreatorBackTraceIndex : 0
+0x034 SpinLock : 0

6. then
1: kd> dt _OWNER_ENTRY 88D384c0+0x18
+0x000 OwnerThread : 0x89058350
+0x004 OwnerCount : 1
+0x004 TableSize : 1

So, the resource 0x88d384c0 is owned by the thread 0x89058350, another
thread
in blast.exe .

7. run !thread 0x89058350

1: kd> !threaad 89058350
No export threaad found
1: kd> !thread 89058350
THREAD 89058350 Cid 07e0.0494 Teb: 7ffac000 Win32Thread: 00000000 WAIT:
(WrResource) KernelMode Non-Alertable
89afe5f8 SynchronizationEvent
890583c8 NotificationTimer
IRP List:
8980ac40: (0006,01fc) Flags: 00000884 Mdl: 00000000
Not impersonating
DeviceMap e1aac590
Owning Process 89b67020
Wait Start TickCount 1184271 Elapsed Ticks: 37
Context Switch Count 32022
UserTime 00:00:01.0125
KernelTime 00:00:12.0750
Start Address 0x77e4a99b
Win32 Start Address 0x0040195b
Stack Init baaa8000 Current baaa7358 Base baaa8000 Limit baaa5000 Call 0
Priority 14 BasePriority 8 PriorityDecrement 6
ChildEBP RetAddr Args to Child
baaa7370 804edb2b 890583f0 89058350 89afe5f8 nt!KiSwapContext+0x26 (FPO:
[EBP 0xbaaa7398] [0,0,4])
baaa7398 804ed790 89058350 89802258 00000000 nt!KiSwapThread+0x280
baaa73cc 8051c0fb 89afe5f8 0000001b 00000000 nt!KeWaitForSingleObject+0x249
baaa7408 804f62f6 00000000 89882e28 8890bb78 nt!ExpWaitForResource+0xd3
baaa7428 bab5d29c 89802258 8890bb01 bab8203c
nt!ExAcquireResourceExclusiveLite+0x8b
baaa7434 bab8203c 8890bb78 89882e28 8890bb01
Ntfs!NtfsAcquireResourceExclusive+0x1d (FPO: [3,0,0])
baaa7458 bab82114 8890bb01 89882e28 8908ccd0 Ntfs!NtfsAcquireExclusiveFcb+0x40
baaa7470 bab9787a 8890bb78 8908ccd0 8890bb78
Ntfs!NtfsAcquireExclusiveScb+0x14 (FPO: [2,0,0])
baaa74e0 bab96379 8890bb78 89015100 00000000 Ntfs!NtfsAllocateMftRecord+0x1a
baaa769c bab8f5cf 8890bb78 8980ac40 8980adf4 Ntfs!NtfsCreateNewFile+0x2fd
baaa78c0 bab8551f 8890bb78 8980ac40 baaa7900 Ntfs!NtfsCommonCreate+0x1200
baaa79c4 804f0473 89015020 8980ac40 baaa7a30 Ntfs!NtfsFsdCreate+0x17d
baaa79d4 f762825a 00000000 baaa7a30 00000000 nt!IofCallDriver+0x3f (FPO:
[0,0,0])
WARNING: Stack unwind information not available. Following frames may be
wrong.
baaa79fc f762ae6b 89015020 88ebaec0 baaa7a30 SYMEVENT+0x125a
89085580 890b00b0 00630000 00000000 00000000 SYMEVENT!SYMEvent_Operation+0x335
89085580 890b00b0 00630000 00000000 00000000 0x890b00b0
f7627708 f762ad94 f762ad99 f7628bcf f7628bcf 0x890b00b0
f762ac44 000000aa 082444f6 56077401 ffd020e8 SYMEVENT!SYMEvent_Operation+0x25e

This thread is now acquiring the resource 0x89892258 and getting stuck.

8. Run dt _ERESOURCE

1: kd> dt _ERESOURCE 89802258
+0x000 SystemResourcesList : _LIST_ENTRY [ 0x89802218 - 0x89c4bb38 ]
+0x008 OwnerTable : (null)
+0x00c ActiveCount : 1
+0x00e Flag : 0x80
+0x010 SharedWaiters : (null)
+0x014 ExclusiveWaiters : 0x89afe5f8
+0x018 OwnerThreads : [2] _OWNER_ENTRY
+0x028 ContentionCount : 0x14e
+0x02c NumberOfSharedWaiters : 0
+0x02e NumberOfExclusiveWaiters : 2
+0x030 Address : (null)
+0x030 CreatorBackTraceIndex : 0
+0x034 SpinLock : 0

9. then
1: kd> dt _OWNER_ENTRY 89802258+0x18
+0x000 OwnerThread : 0x89051be0
+0x004 OwnerCount : 1
+0x004 TableSize : 1

the resource 0x89802258 is owned by thread 0x89051b10

10. Run !thread

1: kd> !thread 89051be0
THREAD 89051be0 Cid 07e0.04e0 Teb: 7ffab000 Win32Thread: 00000000 WAIT:
(Executive) KernelMode Non-Alertable
b4e68f3c NotificationEvent
IRP List:
88930008: (0006,01fc) Flags: 00000884 Mdl: 00000000
Not impersonating
DeviceMap e1aac590
Owning Process 89b67020
Wait Start TickCount 693480 Elapsed Ticks: 490828
Context Switch Count 29320
UserTime 00:00:00.0750
KernelTime 00:00:09.0796
Start Address 0x77e4a99b
Win32 Start Address 0x0040195b
Stack Init b4e6a000 Current b4e68cb4 Base b4e6a000 Limit b4e67000 Call 0
Priority 14 BasePriority 8 PriorityDecrement 5
ChildEBP RetAddr Args to Child
b4e68ccc 804edb2b 89051c80 89051be0 b4e68f3c nt!KiSwapContext+0x26 (FPO:
[EBP 0xb4e68cf4] [0,0,4])
b4e68cf4 804ed790 889bc008 b4e69000 8908ccd0 nt!KiSwapThread+0x280
b4e68d28 bab5ec78 b4e68f3c 00000000 00000000 nt!KeWaitForSingleObject+0x249
b4e68d44 bab5e951 b4e69000 00001000 8908ccd0 Ntfs!NtfsWaitSync+0x18 (FPO:
[1,0,0])
b4e68f00 bab610f8 b4e69000 889bc008 8908ccd0 Ntfs!NtfsNonCachedIo+0x2f3
b4e68fec bab61040 b4e69000 889bc008 00000001 Ntfs!NtfsCommonRead+0xaf1
b4e69194 804f0473 89015020 889bc008 889bc008 Ntfs!NtfsFsdRead+0x20f
b4e691a4 f7628065 0023c000 89ad2298 804f0473 nt!IofCallDriver+0x3f (FPO:
[0,0,0])
WARNING: Stack unwind information not available. Following frames may be
wrong.
b4e691d4 80508fa9 890a1020 89c48b0b 89c48b18 SYMEVENT+0x1065
b4e691f4 80509286 89093500 89c48b38 89c48b18 nt!IoPageRead+0x19
b4e69274 804f4c21 00000001 ce63c000 c03398f0 nt!MiDispatchFault+0x939
b4e692d0 80507494 00000000 ce63c000 00000000 nt!MmAccessFault+0x5ca
b4e692fc 804fdd46 ce63c000 00000000 b4e69428 nt!MmCheckCachedPageState+0x41d
(FPO: [EBP 0xb4e69344] [2,2,0])
b4e69344 804fd656 8914c3f0 b4e69384 00001000 nt!CcMapAndRead+0x93
b4e693d8 80590586 89093500 b4e69418 00000400 nt!CcPinFileData+0x245
b4e6944c bab8ecea 89093500 b4e69484 00000400 nt!CcPinRead+0xc4
b4e69474 bab8ed8b 88ee43f0 8908ccd0 0023c000 Ntfs!NtfsPinStream+0x6f
b4e694d0 bab963aa 88ee43f0 89015100 b4e6958c Ntfs!NtfsPinMftRecord+0x6b
b4e6969c bab8f5cf 88ee43f0 88930008 889301bc Ntfs!NtfsCreateNewFile+0x32e
b4e698c0 bab8551f 88ee43f0 88930008 b4e69900 Ntfs!NtfsCommonCreate+0x1200
b4e699c4 804f0473 89015020 88930008 b4e69a30 Ntfs!NtfsFsdCreate+0x17d
b4e699d4 f762825a 00000000 b4e69a30 00000000 nt!IofCallDriver+0x3f (FPO:
[0,0,0])
b4e699fc f762ae6b 89015020 8905e428 b4e69a30 SYMEVENT+0x125a
89085580 890b00b0 00630000 00000000 00000000 SYMEVENT!SYMEvent_Operation+0x335
89085580 890b00b0 00630000 00000000 00000000 0x890b00b0
f7627708 f762ad94 f762ad99 f7628bcf f7628bcf 0x890b00b0
f762ac44 000000aa 082444f6 56077401 ffd020e8 SYMEVENT!SYMEvent_Operation+0x25e

this thread is waiting on object 0xb4e68f3c, what's the argument
0xb4e69000
passed to function Ntfs!NtfsWaitSync

12. Run

1: kd> dt _ERESOURCE b4e68f3c
+0x000 SystemResourcesList : _LIST_ENTRY [ 0x40000 - 0x0 ]
+0x008 OwnerTable : 0x89051c80
+0x00c ActiveCount : 7296
+0x00e Flag : 0x8905
+0x010 SharedWaiters : 0x89051be0
+0x014 ExclusiveWaiters : 0x00000200
+0x018 OwnerThreads : [2] _OWNER_ENTRY
+0x028 ContentionCount : 0x21000
+0x02c NumberOfSharedWaiters : 0
+0x02e NumberOfExclusiveWaiters : 0
+0x030 Address : 0x80589a87
+0x030 CreatorBackTraceIndex : 0x80589a87
+0x034 SpinLock : 5
1: kd> !object b4e68f3c
b4e68f3c: Not a valid object (ObjectType.Name at 0x0 invalid)

What is b4e68f3c, a resource ???

13. Run !locks

1: kd> !locks 88d384c0

Resource @ 0x88d384c0 Exclusively owned
Contention Count = 17
NumberOfSharedWaiters = 16
Threads: 89058350-01<*> 8ae08018-01 898c2400-01 8913e8a0-01
89ae2bf0-01 89084020-01 8914e7b0-01 8908bb70-01
8905cb20-01 898d1788-01 898d1c00-01 8914f798-01
89806578-01 8ae12018-01 88565da0-01 89815718-01
898cea40-01
1 total locks, 1 locks currently held

This shows that the resource 0x88d384c0 is shared by many threads, the
thread
0x89058350 ownes the resource.




Is my above analysis the right direction to figure out what's wrong ? How do
I
move forward and get this problem resolved.

Any suggestion is welcome.

Thanks in advance.

Wenhua


Similar Posts