Note: This web page was automatically created from a PalmOS "pedit32" memo.

Usenix/LISA December 2005, Recovering from Linux hard drive disasters

Theodore Ts'o

Failed hardware (S.M.A.R.T.) and software

Old IDE, prior to UDMA, had no error checking at all

same is true of PCI now

hardware failures are often progressive.  Image backup least llikely to
contribute to that.

1K blocksize minimizes data loss, but is slower

fsck, gpart

4500 rpm on a laptop

Disk heads just a few nanometers above the platters

The disk eads aerodynamically shaped, and air flow raises the heads away
from the platters, while springs push them back

The heads can "crash" or "high write".  A "high write" is too weak
magnetically, and impacts a too-large region of the disk.

Heads park physically against platters, in a landing zone.

The probability of disk failure hits 50% after about 50,000
landing/takeoff cycles (but this is derived from a bit old data)

In servers, you can improve disk useful life by cutting the number
of landings/takeoffs, but some drives may not be designed to run
continuously.  Eg the old IBM deskstar desktop drives were not good in
servers because they shaved a few pennies off the cost to produce each
drive, but making them work best when used once in a while.  As a result,
when used continuously in servers, spindle motors started failing.

The Apple iPod has the same issue, but it's not due to spin time, it was
due to heat dissipation - or rather the lack of proper heat disipation..
So despite their firewire interface, iPod's are not good for using as
a desktop drive.

A disk head crash can scrape off particles.  Those loose particles
hit the head(s) and cause another crash - iteratively, resulting in a
possible exponential increase in loose particles.

For this reason, some feel it is best to swap out bad disks ASAP

old days, always even number of heads, one for each side of platter.
Now one side of one platter has reference points, for track location
known as "indexing info".

Also used to have constant blocks per cylinder.  Now more bits on outer
tracks (they're bigger)

Some drives store in a series of partial spirals - IE there's a number
of spirals that don't go the full length of the platter like an old
vinyl record would have.

CHS data today lies to you today.  It's a pleasant fiction for the sake
of old software.

fsck needs same CHS fantasy as bios.  Different hardware people have
mapped CHS to linear addressing in different ways, but this was a problem
years ago, not so much today.

Low level format.  Does different things on different generations of
drives.  Can do any number of things.  Speaker does not know of any way
of rewriting the indexing information.  "Ask your drive manufacturer"
"Buy a new harddrive"

MS-DOS partitions probably should have been retired about a decade ago.

EFI partitioning scheme by intel was supposed to become The Scheme
but it's not become common due to itanium's failure to catch on in
the marketplace.  But it's sad EFI didn't catch on.

First 512 bytes of a disk are a bit magic.

Hex aa55 at end of 512 - positions 510 and 511, numbered from 0.

LBA linear addressing, in units of 512 bytes times 2**32 is max size

One of the four primary partitions can contain extended partition info

In an extended MBR you use only 1 or 2 partitions - it's a linked list.

Linux supports up to 16 extended partitions, but theoretically it could
go on forever due to it being a linked list.  But that's fragile.

Reiserfs mkfs really bad about making itself obvious to gpart and similar
tools - doesn't clear enough of the disk to be clear.  Heuristic tools
for ID'ing a filesystem type can get confused.

ext3 clears 64K at the beginning and end of the partition.  Not a bad idea.

LVM or hardware RAID probably won't use fat partitioning

EFI/GUID (itanium) checksummed.  Also at partition data is stored at
both the beginning and end of the disk in case of damage.

EFI allows large partitions, and has a human readable label.

EFI is also much more complicated than FAT/MBR.

A disk GUID is 16 bytes, originally from OSF/1 which has died.

no central registry for part types, just pick a random one, because
there is such a huge number of distinct ones...

72 byte human readable (GUID's?)

"AFAIK" AMD 64 bit architectures use MBR/FAT partitioning

An attendee asks if linuxbios has a different scheme?  Speaker: Probably not.

LVM may be thought of as an extension of partition tables.

some LVM's can migrate data to another disk without downtime, when it's
time to replace a disk.

Snapshots, Copy On Write, useful for fsck, again without downtime.

Physical volumes are usually but not always physical disks

Physical Extents (PE's) are usually 4 megabytes in length.

Physical Extents are then combined into a logical volume.

LVM1 in 2.4, by Heinz, removed in 2.5

EVMS is by ibm.  

Now the "device mapper" is in kernel, and then both LVM and EVMS talk
to device mapper from userspace.

LVM2 is used more than EVMS2, but both can work with device mapper.
LVM is more scriptable and easier to understand.  LVM is much like HPUX
volume .anagement.

EVMS2 has a GUI and poor scriptability - it's possible but painful.

One can mix EVMS for GUI, and LVM for command line, but EVMS supports
more on disk formats, which would naturally cause problems when later
using LVM.

In this talk, only discussing non-cluster disk-based filesystems.

FAT is a family of filesystems.

GPT partition scheme on itanium is in a FAT filesystem.

"File Allocation Table"

Blocks are "clusters" in microsoft parlance

FAT uses a linked list of clusters to store files and directories

root directory appears in a fixed location with FAT.

linked list makes random access i/o hard, good for sequential.

FAT16 supports at most 64K clusters.

FAT32 supports 2**32 clusters, greater fragmentation.

VFAT is "virtual fat".

For a while a bunch of digital cameras only supported FAT16.

VFAT added long filenames.
VFAT is just FAT with long filenames.  In microsoft literature it's
known as "LFN" for "long filesystems".

"VFAT" originally referred to its protected mode implementation, not to
any functional attribute of itself.

IPA international phoenetic alphabet is the transliteration system used
in Ted's last name.

During break:
My dd hypothesis, due to writeback cache, but can use a sync/fsync or
O_DIRECT or writethrough cache.

After break
ext2 focus now, not other filesystems like xfs, reiserfs.

Inode concept goes back to version 7 unix.

v7 unix used a simple linked list for directories, today best fs's use
btrees or hashes.

32768 * 4K  128 meg

filesystem descriptors are in block numbers that are powers of 3, 5 and
7, used to figure out where the inode table is.  Usually at 0, and then
8192 or 32768 (depending on block size).  Fsck should automatically use
the first backup.  Barring that, mkfs -n.  The exponentiation gives a
diminishing percentage of blocks used for backup filesystem descriptors
as the size of disks increases.

With modern EXT3, inodes can be a number of bits long that is powers of
2 above 128.

Samba 4 has extended attributes, but requires a very new 2.6 kernel.
These attributes are stored in the inodes, which is much faster than
hopping to some other part of the disk to store them.

Sambe 4 extended attributes in the inode was implemented by Tridge
(Andrew Tridgell).  EXT3 beat XFS, ReiserFS and others when used this way
(IE SMB/CIFS performance).

The basic 128 bit inode is the standard inode structure
I'm not certain that's true though, because:

strombrg@blee:/tmp$ cat /tmp/t.c #include <linux/ext3_fs.h> main() { printf("%d\n", sizeof(struct ext3_inode)); } strombrg@blee:/tmp$ /tmp/t 132 strombrg@blee:/tmp$
Inodes storethe location of data blocks. In EXT3 there are 12 direct blocks. Larger files use indirect blocks. Double indirect. Triple indirect blocks. Directories contain a list of names and (initial?) inode pointers (numbers?) Modern EXT3 can hash directory indexes. These indexes are stored in what looks like an empty directory entry. When using hashed EXT3 directories, you can still boot old kernels. The older kernel will clears a bit that says the indexing is valid. Fsck -D will cleanup directory hashing if the feature is enabled on filesystem, when returning to newer kernel. Skipping lots of stuff about specific filesystems. How to recover lost data? First you need to figure out why it's messed up. Then you need to know "what is the lowest level at which things are messed up". fsck can do damage if the partition table is bad, for example. You also need to ask "How important is the data?" Also "When were backups last done? Finally, create a plan of recovery, to avoid further damage. Don't just dive in and try this, then that, then that without thinking about how the various things you want to attempt might impact each other. The main levels to consider are: 1) drive 2) raid 3) partition table 4) LVM 5) Filesystem 6) Application Disk errors have no standard reporting scheme on linux. IDE is one kind, SCSI is another. Also, the messages are intended more for programmers than system adminstrators. "lbasect" is numbered from the beginning of hard disk. "sector" number is relative to (the beginning of the partition?) Both lbasect and sector are in 512 byte blocks. If you get these errors, it means something is wrong at the hard disk level. Bad CRC messages look very similar to previous error messages, and indicate that you may have a problem with the IDE controller cable - it might be loose or poorly shielded. SCSI errors very different from IDE errors. The SCSI driver identifies disks using "major, minor in hex", not the device name like in IDE. 08:13 is 8: scsi 1 is second disk 3 is 3rd (?) partition. I believe 0 will be the whole disk, including the primary partition table and boot program. Examine logs, find frequency of occurrence. Check S.M.A.R.T. reports. Do complete image backup before starting any recovery attempts. IDE may yield correctable or uncorrectable errors. Correctable errors are from when the ECC data allowed data recovery, transparent to the application, but it appears in the logs. If a disk is 2 years old, you might just replace it: It's the familiar people time vs hardware expense. dd_rescue Hard drive recovery services. Some only do logical data recovery. Others will have clean rooms that you probably couldn't use yourself. Freezer, as discussed during the break. Put the disk in a freezer bag, and get it good and cold. Then plug it into your machine, and start using it. The more you use it, the faster it'll warm up. There may be a sweet spot among the temperatures that allows you to get your data back. Some people will replace the PCB board. This does not involve a clean room. However, even if the PCB's look identical, they could have different firmware versions, which could actually cause further damage, "really, really, really toast your data". EG if the two firmware revs use two different ECC codes, then your data gets really messed up. If the PCB's are from disks from same lot ("nearby" serial numbers) and all from same hw revs, then you're more likely to be able to swap PCB's to get back data. Clean room: open drive, try to clean it up with distilled water. If your machine room ever gets flooded, and your servers are floating, don't try to spin up the disks! Put them in zip lock bags, and keep them wet! Recovery shops can get a pretty good success rate this way. If fsck says the filesystem super block is not found, check the table. fdisk -l. backing up MBR with dd only does first 4 partitions. LILO likes to put backup MBR's in /boot gpart searches a hard drive and tries to guess the partition table. It can write the guessed partition table directly to disk, if you're feeling brave. It can also keep a backup of the preexisting partition table at same the time: -b -W. As far as LVM problems, there aren't that many. The LVM designer was good about keeping backup copies of things. vgconfigrestore Don't worry about lvm metadata corruption, there's always a backup. LVM is not a substitute for RAID. Data in LVM without RAID - if a disk goes bad, you can recover from the loss of one or more physical volumes. You need lvm2 to do this. vgchange -partial -ay lvm2 requires device mapper You =can= upgrade lvm1 to lvm2 to recover lvm1 data, and there may be a backport of LVM2 to 2.4.x Linux. vgreduce -removemissing vg_name Any logical volumes will be deleted. (Eh? We don't want our logical volumes toasted!) Recovers what data you can, when some disks gone, some disks still good You'll probably want to link /dev/ioerror to /dev/null beforehand. Filesystem corruption lost+found can lead to security problems if it is world readable. ext3-fs error tune2fs -e continue|remount-ro|panic /dev/hda1 - reactions to filesystem problems The journal will take care of "data loss" due to panic. That's true with EXT3, XFS (but see below), ReiserFS e2fsck is your primary recovery tool. Strongly prioitizes not toasting additional data. You should be able to answer yes always, but no and debugfs can be better sometimes, in uncommon situations. The errors on the "Filesystem corruption" slide errors are progressively more dire. locatedb / locate can be used to find where a file was we should cron lost+found file detection backup locatedb? Files harder than directories: use file, look in file BSD deletes files until there is no more than one file claiming a given block. Better is to fork the block, so one file is kept intact, the other is nearly intact. Corrupted inode table "nastiest". Usually due to bad hardware, but unclean shutdowns can do it too. Voltage gradually goes to 0 during a powerdown. Memory (RAM) is much more sensitive to voltage drops than other parts of the system. Memory can be messed up but DMA works and the disk controller writes fine, so garbage goes to disk "well". Capacitors' discharge yields the gradual voltage drops. Non-PC hardware may shut down DMA ASAP to protect against this. EXT3 relatively safe, due to its journal. EXT2 had more corrupted inode tables, due to lack of journaling. Logical journalling (like xfs) is faster, but not as safe as EXT3 journalling, due to substantial shortcutting in the xfs journal. IOW, xfs can still have inode table problems from an incautious poweroff. This was not a problem with IRIX due to the hardware and OS taking steps to protect against this. BUT this is troublesome on linux. Speaker recommends using a UPS if you use xfs on x86 hardware. If inode table really is trashed... then you don't know where data blocks are. Best is to have something like the MacOS trashcan - IE, don't delete, mv to a magic directory for a while, then delete. Made to go away in EXT3, worked better in EXT2: lsdel, find deleted files, finds files and deletion time, under debugfs. Does not work in EXT3. Num blocks to recover always 0 in EXT3 today, but someday they may fix that. It's related to how EXT3 journals grep -ab regex /dev/hda1, round up to 1K block offsets EXT2/3 tries to store files contiguously, which helps with this kind of file recovery. backups! Auto cool laptops can notice when plugged in on a particular subnet and try to initiate a backup of themselves. cron a check for subnet at 3AM PRIOR to backup S.M.A.R.T.: "self monitoring analysis and reporting technology". Do a restore once in a while! Eg multivol could yield a serious problem you wouldn't notice until much later. 0-255, units? Who knows? Scaled to 0-255. Low bad, high good. Check number of spare bad blocks, among other things. If the number of spare bad blocks gets too low, then the drive is likely to have problems soon. Hard drives engineered to have only so many stop/start cycles. temperature_celcius likely not scaled - but it's an exception to the general rule. smartctl -H /dev/hda smartctl -A /dev/hda smartd daemon can email you, distribution may have it set up already. /etc/smartd.conf -m in smartd.conf is the option for setting up automated e-mailing. doing checksum tests integrit: md5 or tripwire Can also detect accidental corruption, can use CRC for this, faster than MD5. cfv command for example, in python. mtime is same, checksum has changed, that's a problem. Ted hasn't seen a program that does this yet, but wants to. e2image: backing up inode data, or more generally metadata e2image /dev/hda3 /some/file - saves filesystem metadata to a file dumpe2fs can look at it This requires a modern e2fsprogs, 1.36 or later. e2debugfs -d dev -i foo.e2i ... Can ls, cat... e2debugfs can operate on "loopback" filesystems. Safer unmounted. LVM snapshot also safer than a live filesystem. .e2i extension on e2image outputs. debugfs test.img root fs still trashed ls <inum> cat <inum> dump <inum> new.file or -I option to e2image, then fsck everything that was in the root directory is no a dir or file in lost+found effectively recovers from a mkfs can take a couple of minutes, will consume a lot of disk bandwidth Hardware RAID Software RAID tradeoffs between the two One not necessarily always better than the other Hardware RAID may have battery backed-up journal Software RAID can RAID across multiple controllers (controller types) "in general you get what you pay for" A "logging filesystem" (as opposed to the journalled filesystems discussed above) uses the entire disk as a big journal for storing data in. In-use blocks are never overwritten - they are instead replaced by new blocks in the journal. This makes writing very fast, because of minimal head motion, but reads tend to be pretty slow. Some Logging filesystem implementations will cache frequently needed parts of the filesystem in a huge amount of memory for performance. The "log cleaner" coalesces free space by copying used blocks to the head of the log - IOW, it defragments the filesystem. Defragmenting can load up the I/O subsystem.

Back to Dan's palm memos