SunOS software RAID

The programs

Ones I wrote, with this license: This software is owned by The university of California, Irvine, and is not distributed under any version of the GPL. GPL is a fine series of licenses, but the owners of the software need it to be distributed under these terms.
- Here's the script I wrote to set up system disk mirroring using SVM. It's called SVM-root-mirror. I've tested it on two machines so far; both appear to be working flawlessly now. Both machines run Solaris 9 04/20904.
  The script has a bit of an affinity for controller 0, but is designed to work with other controllers too (except that it won't set the boot-device in the eeprom for you if your root filesystems aren't both on controller 0).
- Here's another script - this one to set up RAID 5 relatively easily. It's called simply "SVM-RAID-5". It should just have four tunables at the top. Note that this script will destroy ALL of the data in the slices that you add to your new RAID 5 volume! Apparently, that's business as usual in the SVM world. Anyway, ignore the "SRCDISK" nomenclature - it's only a "source" in the sense that the program will look for $SRCDISK slices in /etc/vfstab, and rewrite them to work with the new metadevice(s) you're creating. Note that this program can cope without dup-label, but it likes dup-label.

Outside efforts

metachk looks really valuable. It detected errors I wouldn't have realized I needed to be checking for. This is the crontab entry I use to run metachk periodically:

0 * * * * OUTPUT="`/dcs/bin/metachk`"; if [ "$OUTPUT" = "" ]; then : ; else echo "$OUTPUT" | /bin/mailx -s 'meter metachk' strombrg@dcs.nac.uci.edu dcs@dcs.nac.uci.edu dmelzer@uci.edu; fi

External misc doc

The sunhelp.org disksuite FAQ
admin's choice
Steve Chen's notes on setting up RAID 5
sysunconfig.net: Lots of good info on SDS/SVM
Barbie LeVile's SDS/SVM Howto. I pulled this out of Google's cache
Nice SVM presentation
Configuring SVM via autoinstall/jumpstart (courtesy of Michael Steeves on sun-managers)
HOWTO: Mirrored root disk on Solaris

Dan's notes on setting up mirroring (done with Solaris 9 4/04)

Word on the net is that Disksuite comes with Solaris 9, so no pkgadd should be necessary anymore. I like that.
In the rules file, you can use an entry like:
For those of you not in UCI/NACS, that basically means something like the following in your autoinstall configuration:
This will create a 10 megabyte partition called /rm-me. After the install, umount /rm-me and then comment out the partition in /etc/vfstab, and use it for your metadb's. 5 meg is supposedly all that's really necessary, but autoinstall didn't want to create something smaller than 10 meg - and, when I tried to create 3 metadb's in a 10M partition, SVM said there wasn't enough room. Creating 2 metadb's in a 10M partition worked though.
Anyway, to use the script, just save it in a file, make it executable, and then fire up your favorite text editor on it. There are five tuneable parameters (shell variables) at the top of the file that will most likely need to be adjusted for your system.

Here's what to do if a disk in a mirror fails:

Again, from google's cache
Really nice sun-managers article on the subject
My notes (for bingy):
- First metadetach:
  - metadetach -f /dev/md/dsk/d0 /dev/md/dsk/d20
  - metadetach -f /dev/md/dsk/d1 /dev/md/dsk/d21
  - metadetach -f /dev/md/dsk/d6 /dev/md/dsk/d26
- Then metaclear:
  - metaclear /dev/md/dsk/d20
  - metaclear /dev/md/dsk/d21
  - metaclear /dev/md/dsk/d26
- Then erase the metadb's from the bad disk:
  - metadb -d /dev/dsk/c0t2d0s7
- To simulate a disk failure, I used the following. You can do the same thing with dd though. Do not do this with slice 2! It's a special slice that has access to the disk label, and newfs is special-cased to jump past that label. Most other programs, like dd or reblock, will clobber your disk label if you write to the beginning of slice 2! Note though, that if you do, you can usually get your label back with sun format's autoconfigure option.
  - Turn off the disk.
  - Write something to all relevant partitions through the filesystem, like: echo foo > /bar for the root filesystem. (the efficacy of this is speculative)
  - Make sure all partitions on the turned-off disk are marked down by checking metastat.
  - Then turn the disk back on, and wipe the partitions, so you know you're starting over from scratch:
    - reblock < /dev/zero > /dev/dsk/c0t2d0s0 -t $(expr 1024 \* 1024) 120
    - reblock < /dev/zero > /dev/dsk/c0t2d0s1 -t $(expr 1024 \* 1024) 120
    - reblock < /dev/zero > /dev/dsk/c0t2d0s6 -t $(expr 1024 \* 1024) 120
    - reblock < /dev/zero > /dev/dsk/c0t2d0s7 -t $(expr 1024 \* 1024) 120
- Duplicate the partitioning of the source disk:
  - dd if=/dev/dsk/c0t0d0s2 of=/dev/dsk/c0t2d0s2 count=16
  - Another (IMO better) alternative is to use my dup-label program.
  - On x86 this is very different. It's simpler if you have a single partition on your system. It's not at all complicated to copy the fdisk partitions, but then there are Solaris partitions within the fdisk partitions, which I'm guessing may not necessarily be at the beginning of the disk, much as with fdisk "extended" partitions.
- Then create metadb's on the new good disk:
  - metadb -a -c 2 /dev/dsk/c0t2d0s7
- Then metainit the replacement disks into the mirror:
  - metainit /dev/md/dsk/d20 1 1 /dev/dsk/c0t2d0s0
  - metainit /dev/md/dsk/d21 1 1 /dev/dsk/c0t2d0s1
  - metainit /dev/md/dsk/d26 1 1 /dev/dsk/c0t2d0s6
- Then add the new submirrors to the mirrors:
  - metattach d0 d20
  - metattach d1 d21
  - metattach d6 d26
  - Upon doing these, the new disk will get really busy, well after the metattach's look like they've completed. I recommend waiting until they quiesce again before continuing. You can run metastat to see if the resyncing is done or not yet.
- That should be it. Now, if your new disk is relatively quiet, go ahead and reboot. If you don't have a disk light on the new disk (or if you're working remotely), try iostat. Or better, look at metastat - it tells you when something is "Resyncing". (speculative: It's possible you can go ahead and reboot before the disks quiet down)
- Note that Solaris 9 has some sort of UUID-like feature for disks, so you may have to set that up when actually switching to a different disk.
See also "metareplace"; it appears to be easier in some cases.
You're probably also going to need to clear any lingering metadb errors - see immediately below where this is covered under the RAID 5 recovery.

Here's what to do if a disk in a RAID 5 volume fails

Check with metastat, to see what component has failed. It'll probably give you a metareplace command fragment that's a good starting point for how to recover from the failure.
If you trust the same disk that already failed once, then you can probably use, for example "metareplace -e d6 c0t3d0s6".
If you are replacing a disk with an entirely different disk, you might try (this is 100% untested) : "metareplace -e d6 c0t3d0s6 c0t5d0s6".

Don't forget to clear any error that might've shown up in "metadb" (run with no arguments). The "W" below, is indicative of a metadb that has a "write error" in it. What we're doing below is running "metadb" to check the state of the metadb's, then we delete the one with an error on it, then we add it back, and finally we check the state of the metadb's again, to verify that the error has been cleared.

bingy-root> metadb
        flags           first blk       block count
     a m  pc luo        16              1034            /dev/dsk/c0t2d0s7
      W   pc l          16              1034            /dev/dsk/c0t3d0s7
     a    pc luo        16              1034            /dev/dsk/c0t4d0s7

bingy-root> metadb -d /dev/dsk/c0t3d0s7

bingy-root> metadb -a /dev/dsk/c0t3d0s7

bingy-root> metadb 
        flags           first blk       block count
     a m  pc luo        16              1034            /dev/dsk/c0t2d0s7
     a        u         16              1034            /dev/dsk/c0t3d0s7
     a    pc luo        16              1034            /dev/dsk/c0t4d0s7
Tue Mar 22 18:04:13

bingy-root>

An experiment: What happens if we force a failure of each disk in a 3-disk RAID 5 in turn: fail a drive, metareplace it back in, then the next drive, and so on - but without first clearing any write errors in metadb?

The first disk went without a hitch
When I failed the second disk, the write error on the first disk was magically cleared. Also, the second disk was giving errors when I powered it back up - I couldn't metareplace -e it, and format couldn't detect the disk's label. So I rebooted, and then metareplace -e of the second disk worked fine. Also, the write error was cleared by the time I got back from lunch...

The third disk... After the third disk, I saw:

bingy-root> metastat
d6: RAID
    State: Needs Maintenance 
    Invoke: metareplace -f d6 c0t2d0s6 <new device>
    Interlace: 32 blocks
    Size: 4184856 blocks
Original device:
    Size: 4185408 blocks
        Device     Start Block  Dbase State        Hot Spare
        c0t2d0s6        330     No    Maintenance  
        c0t3d0s6        330     No    Okay         
        c0t4d0s6        330     No    Last Erred   

Wed Mar 23 12:17:59

bingy-root> metadb
        flags           first blk       block count
      Wm  pc l          16              1034            /dev/dsk/c0t2d0s7
     a    pc luo        16              1034            /dev/dsk/c0t3d0s7
     a    pc luo        16              1034            /dev/dsk/c0t4d0s7
Wed Mar 23 12:18:07

bingy-root> metareplace -e d6 c0t2d0s6
metareplace: bingy.nac.uci.edu: d6: operation requires -f (force) flag

Wed Mar 23 12:18:26

bingy-root> metareplace -e -f d6 c0t2d0s6
d6: device c0t2d0s6 is enabled
Wed Mar 23 12:18:29

bingy-root> metastat
d6: RAID
    State: Resyncing    
    Invoke: metareplace -f  d6 c0t4d0s6 <new device>
    Resync in progress: 0% done
    Interlace: 32 blocks
    Size: 4184856 blocks
Original device:
    Size: 4185408 blocks
        Device     Start Block  Dbase State        Hot Spare
        c0t2d0s6        330     No    Resyncing    
        c0t3d0s6        330     No    Okay         
        c0t4d0s6        330     No    Last Erred   

Wed Mar 23 12:18:33

...and interestingly, I cannot got the RAID 5 volume out of "Needs Maintenance" state by using "metareplace -e d6 c0t2d0s6" or even "metareplace -f -e d6 c0t2d0s6". It's beginning to sound like clearing the write error is important.

SVM and hotspares

What to do...
- To set up a hotspare pool of SCSI targets one and two:
  - metainit hsp001 c0t1d0s2 c0t2d0s2
- To associate a hot spare pool with a RAID 5:
  - metaparam -h hsp001 d10
- To associate a hot spare pool with two submirrors:
  - metaparam -h hsp001 d21
  - metaparam -h hsp001 d22
My testing
- I'm starting with disks with this number of sectors (half a K), so t2 will be the SRCDISK, t1 and t4 will be in the RAID 5 initially (DSTDISKS), and t5 will be a hot spare.
  - /dev/rdsk/c0t1d0s2: 17682083 (SOE internal, "disk 5", does have a disk light, but it's hard to see. It reflects off of t0 a bit (t0 is on the bottom, t1 is on the top), or you can lift an Ultra 1's case a bit to see it directly)
  - /dev/rdsk/c0t2d0s2: 8886314 (lowest DCS external, has a useful disk light)
  - /dev/rdsk/c0t3d0s2: (installed but not used, because it's 2G or less, second lowest DCS external)
  - /dev/rdsk/c0t4d0s2: 71681511 (second from top DCS external, no useful disk light?)
  - /dev/rdsk/c0t5d0s2: 35836799 (top DCS external)
- Changed values in SVM-RAID-5 to use the new devices on bingy.
- Running script... Waiting for RAID 5 to initialize itself using newly-extended notify-when-up, like:
  This gives a display of the percentage on the tty, and sends e-mail and starts up a *dialog window when the Resync is done.
- Created hot spare pool using t5, using the instructions above.
- Pulling out t1... Expecting t5 to be immediately set up using parity... though maybe I will have to generate some disk activity... I've made 0 effort to duplicate the same partitioning onto t5 as on t1, t2 and t4, but the hot spare functionality appears to be coping fine. SVM noticed the problem in less than 5 minutes, but not immediately.
- This command let me move off of the hot spare, and back to t1:
  I'm putting some test load on the RAID 5 volume, as it replaces the hot spare with t1... Worked fine.
- Now I'm trying a run-after, followed by a reboot, to see if the RAID gets messed up. This was a problem with mirroring.

SVM and soft partitioning (getting more partitions out of a disk)

sysunconfig

Creating a concatenation (AKA catenation or "RAID 0"):

# metainit d40 4 1 c0t1d0s2 1 c0t2d0s2 1 c0t2d0s3 1 c0t2d1s3
d40: Concat/Stripe is setup

Checking which disk you're actually booting from, when you have an SVM-based system disk mirror:

The easiest way to do it with the prtpicl command.

prtpicl -v |grep disk
              :disk-write-fix
             disk-label (disk-label, 3d00000032)
              :devfs-path        /packages/disk-label
              :binding-name      disk-label
              :_class    disk-label
              :name      disk-label
          :bootpath      /pci@1c,600000/scsi@2/disk@0,0:a
          :boot-device   disk:a disk1:a
          :diag-device   disk0:f
          :disk  /pci@1c,600000/scsi@2/disk@0,0
          :disk0         /pci@1c,600000/scsi@2/disk@0,0
          :disk1         /pci@1c,600000/scsi@2/disk@1,0
          :disk2         /pci@1c,600000/scsi@2/disk@2,0
          :disk3         /pci@1c,600000/scsi@2/disk@3,0
              :SlotType  disk-slot
                 disk (fru, 3d000006fe)
                  :name  disk
              :SlotType  disk-slot
                 disk (fru, 3d00000701)
                  :name  disk
              :SlotType  disk-slot
              :SlotType  disk-slot

Thanks to Pascal Grostabussiat [pascal@azoria.com] for pointing this
out!

SVM and power failures, from a message to sun-managers:

From: 	David Graves
To: 	sunmanagers@sunmanagers.org
Subject: 	SUMMARY: D1000 power failure with Disksuite: how to restore to running state?
Date: 	Mon, 27 Feb 2006 22:20:16 -0500  (19:20 PST)

Many thanks to all who replied.  This story has a happy ending.

In a situation where an array loses power, and the server does not, each
disk that the system attempts to read will fail.  In a miror situation, it
is possible, then, to have multiple read fails.  For each failure that is
not fatal (i.e. the server still thinks there's an available mirror or
slice), it marks the disk in 'maintenance' mode.  When a read is attempted
from a the last available slice with a failed result, then that disk is
placed into the 'last erred' mode.

In a RAID 5 system, only 1 disk will enter the 'maintenance' mode.  The next
failed read places the failed disk into the 'last erred' mode, and the
entire metadevice is taken offline.  Further attempts at reads result in IO
errors.

As dersmythe pointed out, the Disksuite user manual makes reference a power
failure like this.  The procedure is to use the metareplace command with the
-e switch on the disk in 'maintenance' mode. And example is:  metareplace -e
dx cxtxdxsx (replacing the x's with the metadevice and slice that have
failed).

It is important to use this command first on the 'maintenance' disk before
attempting to enable the 'last erred' disk.

I personally ran into a problem with this method: execution of metareplace
-e failed and reported that I must use the -f (force) switch.  Feeling
uncomfortable with proceeding, I held off to do more research.

A SECOND method of recovery is available as well: it is possible to CLEAR
the metadevice with the metaclear command, and rebuild it, as long as the
slices/disks are ordered just as they were prior to failure.  The order is
revealed with the metastat -p command.  The manual recommends using this
method only when the 1st method is unsuccessful.  All metadevices (mirrors,
raids, concats, etc) can be rebuilt in this fashion according to the manual.

There are valuable documents on docs.sun.com as well as an article on
sunsolve.sun.com referring to the second method of recovery.  Sunsolve
requires a subscription.

As it happens, I employed neither of these procedures (unintentionally).  In
my case, a coincidental memory error caused a panic (I'm wondering if this
was due to the original power outage as this machine is quite stable).
Upon reboot, the metadevice was online and the 'last erred' disk was
cleared.  I used the metareplace -e command to clear the 'maintenance' disk
and all was well.

Thanks again to Prabir Sarkar, dersmythe, Michael T Pins, and Damian Wiest

Fixing a bad boot block in an SVM-mirrored root disk, from a summary on sun-managers by Andy Malato:

It turns out that installboot is the proper method to use to reinstall a
corrupted boot sector even when disksuite is used.

Hits: 8172
Timestamp: 2025-11-22 16:16:41 PST

Back to Dan's tech tidbits

You can e-mail the author with questions or comments: