After bringing a Windows 2003 cluster back online after an unexpected outage today, we had a problem where the file cluster service group wasn’t coming back online, in particular the disk resource (A separate volume on a SAN) was just stuck in the ‘Online Pending’ state, as were all of its dependant resources, and as it was in the pending state you couldn’t take it offline or move it to another cluster node (Not that it would have helped!).
The event log wasn’t too helpful about what the issue might have been, until I came across an entry advising that the volume on the SAN should have ‘chkdsk /f’ run against it. Wondering how you can perform a chkdsk on a volume that the system is having problems mounting it, I turned to google and found the following KB article: How to run the “chkdsk /f” command on a shared cluster disk. The article starts to explain how the chkdsk can be performed, but mentions the following interesting point:
” If the dirty bit was previously set, Chkdsk may automatically run and the Physical Disk resource may take awhile to come online. In Windows NT 4.0, you will see a Command Prompt window with Chkdsk running. In Windows 2000, if you open Task Manager you will see Chkdsk running as a process.”
A quick look in task manager did indeed reveal the chkdsk process running! And the output was being dumped into a file in c:\windows\cluster\chkdsk……. – although not brilliant to read ‘type c:\windows\cluster\chkdsk…’ at the command line made it a bit better to look at! Once the chkdisk had completed (After around 3hrs on our 1.7TB volume!) it came straight online again!
I believe that the chkdsk process could have been killed to quickly bring the volume back online again, but as the dirty but was set, it’s most probable that the same thing will happen next time the disk resource moves nodes.