Thursday, July 3, 2008

Utilizing Log File for MS SQL Server Data Recovery (Part 1)

In my previous posts, I already explain the method to enable the crashed database to be attached, especially crashed database caused by torn page error.

The second phase (or second part) after this, is to examine the crashed data in terms of pages that is damaged or broken. For example, pages that has torn page error that we already 'fixed' in order that database can be attached.

Please remember that what we do at this phase is only to fix the torn bits so that SQL server do not trip on this error, but the integrity of the data is still questionable. Now the question is, is it possible to correct, or at lease determine the causes of page damage ? In this post, I will describe how it is possible and the restrictions about this issue.

The context of data recovery phase that I am about to explain is applied to MS SQL Server 2000 only, but this knowledge can also be expanded to be applied as a starting point for MS SQL Server 2005.

First, the restrictions. The restrictions I am talking about is by assuming the integrity of the MS SQL Server's LOG data is intact. As we already know, MS SQL Server technology is using the write-ahead method when writing the database, i.e. the system first write to the log file and ensure that the log file (LDF) is written successfully before applied to the actual database (MDF).

Using this assumption, theoretically, by using the database log file it is possible to view the modification that is about to be done/already done to the database seconds before the system crashed.

So, in order to determine which data is valid in the event of database crash, or particularly the page damage, I am relying on the log file as the reference point of operation that is about to be or already done to actual MDF file, seconds before disaster. Or in short, I am relying to the log file as the only valid data to be checked against the data in MDF file.

To present the sample case, I will have some quick review of the steps that I already elaborated on previous post, and then go on to second phase of data integrity checking. In the process of the sample case walkthrough, I am also present some caveat or issue that is emerged, and unless this issue is resolved, I can not go to next step.

Usually, the surfaced issue is about the design of MS SQL Server 2000, for example, the inability to view the log file when the database mode is in emergency mode or bypass mode. I am hoping that, this will be resolved in the next version of MS SQL Server.

First, I am trying to attached the damaged database file (MDF). And I am facing with torn page error :

Server: Msg 823, Level 24, State 2, Line 1
I/O error (torn page) detected during read at offset 0x00000000012000 in file 'c:\Program Files\Microsoft SQL Server\MSSQL\Data\edoc_data.mdf'.


I am recording this offset, so I can examine it later against the database log file. After fixing the tornbits (m_tornbits), and re-attached it :

Server: Msg 1813, Level 16, State 2, Line 1
Could not open new database 'edoc'. CREATE DATABASE is aborted.
Device activation error. The physical file name 'D:\Program Files\Microsoft SQL Server\MSSQL\data\eDoc_Log.LDF' may be incorrect.


This is the first issue that is I am talking about in MS SQL Server 2000. I tripped this error because, the damaged database is located at D: drive, whereas the server that the database I am about to be examine is using C: drive as location for the database. There are two ways to correct this issue, either create the exact location in D: drive in destination server, or using utility such as hex editor to change the drive in MDF file, when we have no D: drive in destination server.

After resolving the above error and re-attach again :

Server: Msg 823, Level 24, State 2, Line 1
I/O error (torn page) detected during read at offset 0x00000004142000 in file 'c:\Program Files\Microsoft SQL Server\MSSQL\Data\edoc_data.mdf'.


This is the second issue pertaining to MS SQL Server 2000. Actually, the server is trying to redone or rollback, depends on transaction condition at the time of crash. But, alas, the page is already damaged and have the torn page checking enabled, this again, will tripped the torn page error, unless something has to be done to fix the m_tornbits value.

Again, I am fixing the m_tornbits value, record the damaged offset (0x4142000) for later examination, and re-attach the database. This time, the database can be successfully attached with the messages below :

604 transactions rolled forward in database 'edoc' (26).
0 transactions rolled back in database 'edoc' (26).


Now, it's time to examine the damaged page(s) using the log file.

DBCC TRACEON(3604)
dbcc log(edoc, 1)


Instead of providing me with the log data before the database is crashed, I have only practically empty log data to be worked with :

00000ec1:0000011a:0001 LOP_BEGIN_CKPT
00000ec1:0000011a:0002 LOP_MODIFY_ROW
00000ec1:0000011b:0001 LOP_END_CKPT


It seems that every time the database is detached, then re-attached, the server is resetting the position of log recording, using the term 'Checkpoint'.

By using option 2 of DBCC LOG command, it is possible to view the previous LSN value :

DBCC TRACEON(3604)
dbcc log(edoc, 2)


For example, the previous LSN for 00000ec1:0000011a:0001 is 00000ec1:00000117:0003. So, is it possible, using DBCC LOG to view the previous LSN ?

DBCC TRACEON(3604)
dbcc log(edoc, 2, 'lsn', '0xec1:117:3')


Nope, it can't. This became the third issue of MS SQL Server 2000. It can only filter the lsn from the last checkpoint onward, as proved by this statement :

DBCC TRACEON(3604)
dbcc log(edoc, 2, 'lsn', '0xec1:11a:2')


It will give :

00000ec1:0000011a:0002
00000ec1:0000011b:0001


This is also to prove that the above syntax to view the previous LSN is not a garbage one, but a valid syntax. The lsn option in DBCC LOG is clearly not intended to be used to view any lsn. So, there is no starting lsn option in DBCC LOG.

Also, because of this kind of checkpointing mechanism, the information about the last checkpoint before the crash is also lost :

DBCC TRACEON(3604)
DBCC PAGE ('edoc', 1, 9, 3)


dbi_checkptLSN
--------------
m_fSeqNo = 3777 m_blockOffset = 282 m_slotId = 1


Converting the above value to hex value yield 0xEC1:11A:1 which is equal to the first record of log above. So, what is the checkpoint value at the time of crash ? Is it the same with previous LSN recorded in the active portion of log shown above ? (i.e. 00000ec1:00000117:0003).

To get those value and to prove it, I am marking the current database to emergency mode :

Sp_configure 'allow updates', 1
Reconfigure with override


Marking the database as emergency mode or bypass mode :

update master..sysdatabases set status = 32768 where name = 'edoc'

After this, I am shutting down the MS SQL Server, re-copy the database from damaged one, and restart the server. This will bypass the recovery and allow me to view the data as it is :

2008-07-03 17:01:50.80 spid8 Bypassing recovery for database 'edoc' because it is marked BYPASS.

DBCC TRACEON(3604)
DBCC PAGE ('edoc', 1, 9, 3)


dbi_checkptLSN
--------------
m_fSeqNo = 3776 m_blockOffset = 108 m_slotId = 6


Again, by converting the above value to hex equals 00000ec0:000006c:0006 which is NOT equals to 00000ec1:00000117:0003.

Now, is it possible, in emergency mode, to view the log from this checkpoint onward ?

DBCC TRACEON(3604)
dbcc log(edoc, 2)


Again, nope, it is impossible :

2008-07-03 16:31:01.98 spid52 Cannot do a dbcc log() on a database in emergency mode

This, if I am not mistaken, this will be the fourth issue in MS SQL Server 2000. In my opinion, the reason behind this error, is that in emergency or bypass mode, the log data is not initialized, and any attempt to view the log file is irrelevant :

DBCC TRACEON(3604)
DBCC DBTABLE(edoc)


LogTruncMgr @0x19FF10E8
-----------------------
m_replLSN = (0:0:0) m_ckptLSN = (0:0:0) m_oldActXact = (0:0:0)
m_backupLSN = (0:0:0) m_oldestBackupXactLSN = (0:0:0)


So, I have the dead end situation. In order to breaktrough this barrier, I have to go back to the previous fixed database, and this will initialized the log, and make the DBCC LOG operation to read the intended LSN (i.e 00000ec0:000006c:0006).

To make this post short, this is done by breaking to the DBCC LOG routine using WinDBG, and changing the in memory value of the current LSN to the LSN to be read. The detail of how this is done will be the subject of my next post.

Before the re-copying to set the database to emergency mode, I already make a backup copy so that I can step back. For stepping back, I reset the database to online mode, shut down the server, re-copy the database of previous step, and restart the server.

After that, firing the WinDBG, attach the sqlservr.exe program, and break into the DBCC LOG, setting the in memory of LSN to be read to 00000ec0:000006c:0006 and voila ! I have the log file for crosschecking to the damaged MDF file.

For example, by converting the error page at offset 0x4142000 to m_pageId = 1:8393, which is in hex value is 1:20A1, I can track the log file, looking for any operation at those page, which is, subject of my next posting.

So, keep in touch for my next two posting :)