Linux software RAID

Table of contents

Dan on RHEL 3 + md + striping
Dan on Ubuntu Gutsy + md + mirroring
Harry on Ubuntu Hoary preview + md + RAID 5
Neil Brown on what "non-fresh" means in an mdadm context
Some external links
Monitoring
Attributes of Linux software RAID

The methods

striping
- This was done on an RHEL 3 system.
- Edit /etc/mdadm.conf with lines similar to the following:
- Create the array:
- Finish setting up /etc/mdadm.conf
- Next verify that your /etc/mdadm.conf has a DEVICE, ARRAY and MAILADDR line
- Edit /etc/fstab with something like this:
- Create a filesystem:
- Mount the filesystem
- Add a script like /etc/rc5.d/S76md-assemble that contains something like:
  - Note: It seems like there should be some Redhat-supplied startup script for this, but mdmpd and mdmonitor don't appear to be doing it.
  - A promising tip from Dan Pritts: "You need to set the partition ID to 'fd' ('linux raid autodetect') instead of 83 ('linux')."
  - Another note: The "uuid" option gets its value from the --scan above, which was concatenated onto /etc/mdadm.conf. These uuid's identify drives as being part of a RAID array, so you can't accidentally add the wrong disk and lose data)
  - Make sure you chmod 755 this script, as Redhat ignores rc scripts that aren't marked executable.
- reboot to see if it comes up OK (assuming reboot isn't a hardship - some production systems shouldn't be rebooted!).
  
  Thanks.
  -- Dan Stromberg DCS/NACS/UCI <strombrg@dcs.nac.uci.edu>
Dan on mirroring with md
- umount the partitions (assuming the have filesystems on them you don't care about anymore - but is one preserved?)
- /sbin/mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdb5 /dev/sdc5
- fdisk "t" option to set /dev/sdb5 and /dev/sdc5 to type "fd"
- it doesn't have to be done sync'ing (cat /proc/mdstat) to use it
- mkfs -t ext3 /dev/md0

Harry Mangalam's letter about md and RAID 5

Hi All,

FYI, the machine platform is a 2xOpteron, running ubuntu hoary preview 
(64bit), 4GB RAM, system running off a single IDE drive.

The raid drives are running on the on-board 4way Silicon Image SATA 
controller.  The drives are identical 250GB WD SATAs Model: WDC WD2500JD-00G, 
each partitioned for 232GB on /dev/sdx1 and 1.8GB on /dev/sdx2 (for 
parallel swap partitions).

I'm using the mdadm suite to set them up and control the raid:
1 - create the raid:
$  mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 
--spare-devices=0 -c128 /dev/sd{a,b,c,d}1 
mdadm: layout defaults to left-symmetric
mdadm: /dev/sda1 appears to contain a reiserfs file system
	 size = 242220004K
mdadm: /dev/sdb1 appears to contain a reiserfs file system
	 size = 242220004K
mdadm: /dev/sdc1 appears to contain a reiserfs file system
	 size = 242220004K
mdadm: /dev/sdd1 appears to contain a reiserfs file system
	 size = 242220004K
mdadm: size set to 242219904K
Continue creating array? y
mdadm: array /dev/md0 started.

2 - make sure we monitor it.
 $ nohup mdadm --monitor --mail='hjm@tacgi.com' --delay=300 /dev/md0 &

3 - make the reiserfs on md0 (it was made on the individual partitions before, 
but apparantly it needs to be made on the virtual device)
$ mkreiserfs /dev/md0 

4 - # then mount it
$ mount -t reiserfs /dev/md0 /r


5- #then admire it
 $ df
Filesystem           1K-blocks      Used Available Use% Mounted on
...
/dev/md0             726637532     32840 726604692   1% /r

so for a raid5 array, we end up with about 78% of the input space (more than I 
expected) - the rest is lost to the parity info which is striped across all 
the disks, giving the redundancy.

When the raid initialized, mdadm immediately sent me an email warning of a 
degraded array - this was not welcome news, but it turns out that this is 
normal - in building the parity checksums, it essentially fakes a dead disk 
and rebuilds all the parity info.  This took about 8 hrs to do for 1 TB, 
however, the array was available and pretty peppy without waiting for it to 
finish.  And the message did confirm that mdadm was actually monitoring the 
array.

I immediately tried a few cp's to and from it - and on the 'degraded' array, 
got ~40MB/s to and from the IDE drive on some 100-600MB files.  There was not 
much difference after it finished doing the parity rebuild - possibly it was 
deferring the parity calculations until afterwards?  If anything it's 
slightly slower now that the parity info is complete - maybe 38-40MB/s.  
(this measure includes the sync time - with 4GB of RAM, GB files can be 
buffered to RAM and so appear to be copied in a few sec).

On my home 2xPIII system with IDE drives, I only get ~7-8MB/s between drives, 
so 40MB/s sounds pretty good.  Bonnie++ reports (a bunch of confusing #s) but 
seems to indicate that depending on CPU utilization, type of io, and size of 
file, disk io will range from ~80MB/s to 24MB/s on the SATA raid.

On my old IDE laptop (but with a newer disk), bonnie returns numbers that are 
surprisingly good - about 1/3 to 1/4 the RAID speed.

On the 2xPII home IDE system, bonnie returns numbers that are not much better 
than the laptop.

So there you have it - linux SW SATA raid is pretty easy to set up, can be 
configured to be reasonably informative via email, is pretty cheap (relative 
to the true HW raid cards that go for $300-$400 each) and seems to be pretty 
fast.  Long term, I can't say yet.

Also note that this is using an md device without any further wrapping with 
lvm - we just need a huge data space, not much needed in the way of 
administering different group allocations etc.

Would like to hear others' experiences.

Neil Brown on what "non-fresh" means in an mdadm context:

To re-incorporate sda1 into the array, use

mdadm /dev/md0 -a /dev/sda1

NeilBrown

Monitoring linux software RAID for problems

less -sc /proc/mdstats
smartmontools
- smartctl -a -d ata /dev/sd[etc]

Some links about linux RAID on other sites

Very nice page from Gentoo that has notes about Linux software RAID reliability
Nice Linuxdevcenter article about mdadm-based Linux-software-RAID
Good mdadm documentation

Some information about Linux software RAID - good points and bad

> When trying to read sectors from a disk and the disk fails the read:
>  1.)  Read the data from the other disks in the RAID and
>  2.)  Overwrite the sectors where the read error occur.

Note: this is NOT how current linux softraid code works, it's
how it *supposed* to work.  And right now, linux raid code
kicks a drive out of the array after *any* error (read or
write), without trying to "understand" what happened.

/mjt

From Raz Ben Jehuda on the linux-raid mailing list:
I use RAID-s in 3 layers!
Layer 1 : (RAID5 in 11+1 raw disk in 1 pc) x4 (4 disk nodes)
Layer 2 : RAID1 4x 1/2 nodes (only for ability to backup complete nodes)
Layer 3 : RAID0 to 4 nodes -> 1 big 8TB disk. :-)
I use gnbd for this.
The gnbd sends only small packets, and I think, the too much readahead is
the problem, because the whole 8TB array is phisically stripped to 44 raw
disks, but NOT!
When I disable the readahead, the performace is more worse.
Now I grow the readahead in the raw disk to 1MB, the raid5 to 10MB, the
raid1 to 1MB, and the raid0 to 8MB, and the performance is great now! :-)

Hits: 11151
Timestamp: 2025-07-25 09:57:27 PDT

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: