- Catastrophic Corruption of Dynamic Disks
- Posted by Will on February 14th, 2008
I had a really disappointing event take place today with a critical system
that runs Windows 2003 32-bit. Effectively *all* of the dynamic disk
structures were corrupted, even though we were not working on them at the
time of the reboot. Upon reboot the system gives some brief message about
the disk being corrupt, and the Windows 2003 boot sequence never starts.
Looking at all of the drives inside the Disk Management utility in
Microsoft's ERD Commander Boot CD, the Dynamic volumes show in the state
"Offline". A Google search seems to suggest that Dynamic volumes in an
offline state normally means the Dynamic volume information is corrupt and
cannot be loaded.
What is particularly horrifying to me is we have two separate hardware RAID
controllers with three and five volumes respectively, and then we had
mirrored drives across those different controllers using Windows 2003
mirroring. When we rebooted, the corruption of the Dynamic volume
information resulted in ALL EIGHT drives effectively disappearing and going
"offline". So while the benefit of Dynamic volumes carrying around
information about other volumes on each volume has its advantages, the
downside of this system now becomes very clear to me. If you hit on any
bug that writes the Dynamic volume information incorrectly, you are going to
lose EVERYTHING on that system that is Dynamic!
We are in the process of recovering by hacking volume structures to convert
the Dynamic back to Simple volumes, and that so far seems to be going the
right direction.
Can someone explain to me under what scenarios this kind of dramatic
corruption of Dynamic volume structures can take place? If I have a loose
end in my hardware, I need to know what possibilities to chase. I didn't
have a good backup of this system since it was just being built up, and
losing it would have been a complete catastrophe.
--
Will
- Posted by Will on February 14th, 2008
In attempting to recover the boot device from our failure of dynamic disks,
we did these steps:
1) Converted the boot volume from Dynamic to Simple using DSKPROBE from
inside ERD Commander 2005.
2) Made the Simple partition Active, probably from Disk Management in ERD
Commander 2005.
3) Ran chkdsk /r from the Windows 2003 recovery console.
4) Ran Fixboot from recovery console
5) Ran FixMBR from recovery console
In spite of all of these steps, any attempt to boot the from the system
volume gets:
"Error Loading Operating System"
I normally associate that message with a BIOS configuration problem or a
hard drive cylinder mapping issue. I am not finding the problem in our
case. Can someone give me more detail about this error and how to overcome
it?
Would an incorrect disk number inside of BOOT.INI ever cause this error?
--
Will
"Will" <westes-usc@noemail.nospam> wrote in message
news:qZCdnVi46_lsWi7anZ2dnUVZ_rCtnZ2d@giganews.com ...
- Posted by Pegasus \(MVP\) on February 14th, 2008
Try to separate the boot process from the Windows startup process,
by booting the machine with a Windows boot diskette. Format a
floppy disk on some Windows2000/XP machine, then copy these
files to it:
- ntldr
- ntdetect.com
- boot.ini
"Will" <westes-usc@noemail.nospam> wrote in message
news:gq-dnewMEsqNbC7anZ2dnUVZ_rCtnZ2d@giganews.com...
- Posted by Will on February 14th, 2008
"Pegasus (MVP)" <I.can@fly.com.oz> wrote in message
news:O3Ab4dubIHA.1168@TK2MSFTNGP02.phx.gbl...
I AM able to boot the system using a Windows boot diskette. So what does
that suggest and what is further correction required to allow boot without
floppy?
--
Will
- Posted by Pegasus \(MVP\) on February 14th, 2008
Congratulations! This proves that there is nothing wrong with
Windows and that the problem lies with the boot environment.
I would now launch diskmgmt.msc and make sure that the boot
partition is marked "active".
"Will" <westes-usc@noemail.nospam> wrote in message
news:L-6dnaUCYOIOOSnanZ2dnUVZ_g6dnZ2d@giganews.com...
- Posted by Pegasus \(MVP\) on February 14th, 2008
"Will" <westes-usc@noemail.nospam> wrote in message
news:fMednVwUeIbpISnanZ2dnUVZ_rSrnZ2d@giganews.com ...
I suspect that the tools you used when converting your damaged
partition, altered something that is essential for the boot-up process.
What it is I have no idea. I can now think of these options:
a) Keep booting off the FDD.
b) Make a bootable CD and keep booting off it.
c) Partition and format a spare disk on some ***other*** machine.
d) Boot your machine with a Bart PE boot CD. Now use robocopy.exe
to copy the old disk to the new disk, then test the new disk. Do NOT
boot the machine with both disks connected!
Instead performing the robocopy process under a Bart PE boot, you
could perform it while both disks are connected as slaves to some
other machine.
If the new disk works then you could format the system partition on
the old disk and restore its content from the new disk.
By the way, what's happened to the date & time on your posting
computer. A leap into the future?
- Posted by Will on February 14th, 2008
"Pegasus (MVP)" <I.can@fly.com.oz> wrote in message
news:OPig3O1bIHA.6024@TK2MSFTNGP06.phx.gbl...
Partition was definitely active, and I had checked for that per my procedure
posted below.
Other possible causes for the hardware to not be able to bootstrap? How
can I investigate that at a hardware level?
Microsoft should create some special version of a boot floppy (startup
option?) that tells the user how the hardware looks for the BIOS and reports
errors on possible mismatch between what it needs to see and what it
actually sees.
You could also report with such a utility the more obvious conditions, like
a partition not marked Active.
--
Will
- Posted by Will on February 15th, 2008
"Pegasus (MVP)" <I.can@fly.com.oz> wrote in message
news:Osxifn1bIHA.1204@TK2MSFTNGP03.phx.gbl...
Probably this is not the case, because we were getting the "Error Loading
OS" boot time message even before we made any change to the Dynamic boot
volume.
Sounds like fun 
I'm still thinking that somehow the computer BIOS is not finding the drive
it wants or that the geometry somehow doesn't match what it wants to see.
The posting time of 3:10p is approximately when I remember doing the post.
Was I an hour ahead of clock? Sounds like a daylight savings time error on
the posting computer....
--
Will
- Posted by Will on February 15th, 2008
Not sure if this is part of our problem, but the controller reports a
different SCSI ID ordering of volumes than does the ERD Commander 2005 boot
environment. The boot volume is of course SCSI ID=0 as reported by the
controller. Inside of ERD Commander, it is reported as the highest SCSI
ID. In fact, all of our drives have an inverted SCSI ID ordering as seen
by ERD Commander. I'm not sure how that is even possible if Windows is
simply reporting its drive numbers as a reflection of the hardware drive ID
order.
What would such reordering of drives suggest, and is there a way to force a
reset of that mapping?
--
Will
"Pegasus (MVP)" <I.can@fly.com.oz> wrote in message
news:Osxifn1bIHA.1204@TK2MSFTNGP03.phx.gbl...
- Posted by Edwin vMierlo [MVP] on February 15th, 2008
This is not solving your problem, but a little advice
The Dynamic disk implementation in Windows will write an LDM database header
to each and every dynamic disk on the same system. This means that if for
some reason this data (in a private region of the disk) gets corrupted, it
could potentially affect all dynamic disks on your system. Exactly as you
describe in your post.
Here is my take on dynamic disks :
<opinion>
Practically there should only be 1 reason that you use dynamic disk, and
that is to create a spanned volume over 2TB. And even this reason is now
decreased in priority by the support of GPT disks who support partition
sizes over 2TB.
Any other reason would be fround upon if you have hardware raid, either
internally or in a SAN.
</opinion>
There is an unfounded misconception about dynamic disk versus basic disk,
and that is "that it has a performance gain". No people, it has not, and
although "dynamic disk" sounds better than "basic disk" it has nothing to do
with performance.
So, if you are using dynamic disk, for the right reasons, for your "data"
disks, then ensure your boot and system partitions are on basic disk.
Microsoft even recommends this for SAN setups, boot/system/internal-disks on
basic when using dynamic on your SAN.
from this article http://support.microsoft.com/kb/816307 :
"If you decide to use dynamic disks and you have both locally attached
storage (IDE-based storage or Small Computer System Interface [SCSI]-based
storage) and storage that is located on a storage area network (SAN),
consider the following recommendations, depending on your situation:
. Use dynamic disks on only the SAN storage drives and keep the
locally attached storage as basic disks.
-or-
. Use basic disks on the SAN storage drives and configure the locally
attached storage as dynamic disks."
You can apply the same logic to a server with only internal disk
controllers, your data on dynamic, then your boot/system should be basic.
Again, please use dynamic disks for the right (technical) reasons,
HTH,
Edwin.
"Will" <westes-usc@noemail.nospam> wrote in message
news:qZCdnVi46_lsWi7anZ2dnUVZ_rCtnZ2d@giganews.com ...
- Posted by Will on February 17th, 2008
Well, after we had this bad experience with Dynamic Disk corruption taking
out every single one of eight volumes on the system, I certainly understand
why you are not eager to use them for your own applications. But I have a
slightly different take on this. Dynamic disks have been the only way we
have found to make a consistent disk image of the boot volume that does not
corrupt or miss backing up the registry files while they are in use by the
OS. Microsoft (and Veritas, since the technology is OEM from them) have
some kind of very low level notification technology there when you "break" a
dynamic mirror to force the OS to write out consistent versions of the
registry files. I have many times broken a mirror, removed the drive from
the system, and then later used that drive to immediately recover a crashed
boot volume. It's extremely reliable technology, and extremely good at
what it does.
After this experience losing all of our Dynamic volumes, I am keen on the
idea that we should take our dynamic disk backups, then find a way to
disconnect the storage in our script and power off the backed up drives.
That way if we corrupt the dynamic disk structures, the backups are offline
and intact and can be used for immediate recovery while the dynamic disks
are manually recovered.
Using Symantec Storage Foundation for Windows Basic (which is free), you get
some enhancements to Dynamic Disks, most notably the ability to script
mirroring and breaking off of drives. Our nightly backups of the host
that runs virtual machines looks something like this:
1) Resynchronize the dynamic volumes
2) Stop all virtual machines (to guarantee consistent disk images of the
virtual machines)
3) Break off the dynamic volume
4) Restart all VMs
5) Backup the "broken" mirror to tape
Step 3 only takes about 10 minutes for about 300 MB, so our virtual machines
only have about 10 minutes of downtime to ensure consistent images are
available for backup to tape later on.
After this bad experience, I would like to modify the above with:
0) Turn on the drive array with the backup drives programmatically and wait
for it to signal ready
....
6) Flush the broken mirror to disk.
7) Remove the drive letter from the broken mirror.
8) Power off the drive array programmatically.
To do 0) and 8), I need to find a PDU that comes with a command line
interface that works under Windows. I don't suppose you know of one?
--
Will
"Edwin vMierlo [MVP]" <EdwinvMierlo@discussions.microsoft.com> wrote in
message news:uy58pd8bIHA.5208@TK2MSFTNGP04.phx.gbl...