Hello,
ENVIRONMENT:
We have a 2-node Windows 2003 Enterprise SP2 64bit MS SQL 2005 cluster. MS SQL consists of two instances. Several LUNs originating from the same SAN are presented to it. Our 3rd party server backup software only backups the local drives, while the LUNs holding SQL data are backed up using another 3rd party software intended for MS SQL backups.
PROBLEM:
Recently we've ran chkdsk for each LUN and local disk and discovered there are errors reported for each of them. Due to the nature of the setup (cluster), chkdsk has been ran in read-only mode first, so no specifics have been offered. Upon seeing indication of errors, we then ran it with /f switch in offline mode and for each LUN it did not complete reporting "An unspecified error occurred” with no error code. Example partial output is listed below:
"
CHKDSK is verifying files (stage 1 of 3)...
0 percent complete. (0 of 8272 file records processed)
Deleted corrupt attribute list entry
with type code 128 in file 40.
Deleted corrupt attribute list entry
with type code 128 in file 40.
Deleting corrupt attribute record (128, "")
from file record segment 304.
Deleted corrupt attribute list entry
with type code 128 in file 343.
An unspecified error occurred.
"
DISCOVERED SO FAR:
MFT of each LUN is not dirty (checked with chkntfs). Nothing alarming gets logged to Event Logs, apart from the entries that chkdsk was ran in read-only mode. Reported disk space OS-side is valid. Nothing apart from OS and SQL holds handles
on the LUNs (checked with SysInternals' Handle tool). Worth mentioning is that chkdsk scan was never offered by the OS itself and we only but discovered this problem when running chkdsk on demand.
What's interesting, for troubleshooting purposes we have setup a brand new LUN and replicated entirely one of the troubled LUN's content (including permissions and GUID) to it. We then presented this test LUN to a test Windows 2008 based machine and ran chkdsk for it which returned no errors.
Seems that the issue is somehow limited to Windows 2003 only?
Please advise how this can be mitigated or if this is a false positive.
Thank you,