Ones I wrote, with this license: This software is owned
by The university of California, Irvine,
and is not distributed under
any version of the GPL. GPL is a fine series of licenses, but the owners of the software need it to be distributed under these terms.
Here's the script I wrote
to set up system disk mirroring using SVM. It's called
SVM-root-mirror. I've tested it on two machines so far; both
appear to be working flawlessly now. Both machines run Solaris
9 04/20904.
The script has a bit of an affinity for controller 0, but is
designed to work with other controllers too (except that it won't
set the boot-device in the eeprom for you if your root
filesystems aren't both on controller 0).
Here's another script - this one to set
up RAID 5 relatively easily. It's called simply "SVM-RAID-5". It
should just have four tunables at the top. Note that this
script will destroy ALL of the data in the slices that you add to
your new RAID 5 volume! Apparently, that's business as usual
in the SVM world. Anyway, ignore the "SRCDISK" nomenclature -
it's only a "source" in the sense that the program will look for
$SRCDISK slices in /etc/vfstab, and rewrite them to work with the
new metadevice(s) you're creating. Note that this program can cope
without dup-label, but it likes dup-label.
Outside efforts
metachk
looks really valuable. It detected errors I wouldn't have
realized I needed to be checking for. This is the crontab entry I use
to run metachk periodically:
0 * * * * OUTPUT="`/dcs/bin/metachk`"; if [ "$OUTPUT" = "" ]; then : ; else echo "$OUTPUT" | /bin/mailx -s 'meter metachk' strombrg@dcs.nac.uci.edu dcs@dcs.nac.uci.edu dmelzer@uci.edu; fi
This will create a 10 megabyte partition called /rm-me. After the
install, umount /rm-me and then comment out the partition in
/etc/vfstab, and use it for your metadb's. 5 meg is supposedly
all that's really necessary, but autoinstall didn't want to create
something smaller than 10 meg - and, when I tried to
create 3 metadb's in a 10M partition, SVM said there wasn't enough
room. Creating 2 metadb's in a 10M partition worked though.
Anyway, to use the script, just save it in a file, make it
executable, and then fire up your favorite text editor on it. There
are five tuneable parameters (shell variables) at the top of the
file that will most likely need to be adjusted for your system.
To simulate a disk failure, I used the following. You can do
the same thing with dd though. Do not do this
with slice 2! It's a special slice that has access to the disk
label, and newfs is special-cased to jump past that
label. Most other programs, like dd or
reblock, will clobber
your disk label if you write to the beginning of slice 2! Note
though, that if you do, you can usually get your label back
with sun format's autoconfigure option.
Turn off the disk.
Write something to all relevant partitions through
the filesystem, like: echo foo > /bar for the root
filesystem. (the efficacy of this is speculative)
Make sure all partitions on the turned-off disk are marked
down by checking metastat.
Then turn the disk back on, and wipe the partitions, so you
know you're starting over from scratch:
Another (IMO better) alternative is to use my dup-label
program.
On x86 this
is very different. It's simpler if you have a single partition on your system. It's not at all
complicated to copy the fdisk partitions, but then there are Solaris partitions within the fdisk partitions,
which I'm guessing may not necessarily be at the beginning of the disk, much as with fdisk "extended"
partitions.
Then create metadb's on the new good disk:
metadb -a -c 2 /dev/dsk/c0t2d0s7
Then metainit the replacement disks into the mirror:
metainit /dev/md/dsk/d20 1 1 /dev/dsk/c0t2d0s0
metainit /dev/md/dsk/d21 1 1 /dev/dsk/c0t2d0s1
metainit /dev/md/dsk/d26 1 1 /dev/dsk/c0t2d0s6
Then add the new submirrors to the mirrors:
metattach d0 d20
metattach d1 d21
metattach d6 d26
Upon doing these, the new disk will get really busy, well
after the metattach's look like they've completed. I
recommend waiting until they quiesce again before
continuing. You can run metastat to see if the resyncing is
done or not yet.
That should be it. Now, if your new disk is
relatively quiet, go ahead and reboot. If you don't have a
disk light on the new disk (or if you're working remotely), try
iostat. Or better, look at metastat - it tells you when
something is "Resyncing". (speculative: It's possible you can
go ahead and reboot before the disks quiet down)
Note that Solaris 9 has some sort of UUID-like feature for
disks, so you may have to set that up when actually switching to a
different disk.
See also "metareplace"; it appears to be easier in some cases.
You're probably also going to need to clear any lingering metadb
errors - see immediately below where this is covered under the RAID 5
recovery.
Here's what to do if a disk in a RAID 5 volume fails
Check with metastat, to see what component has failed. It'll
probably give you a metareplace command fragment that's a good
starting point for how to recover from the failure.
If you trust the same disk that already failed once, then you can
probably use, for example "metareplace -e d6 c0t3d0s6".
If you are replacing a disk with an entirely different disk, you
might try (this is 100% untested) : "metareplace -e d6 c0t3d0s6
c0t5d0s6".
Don't forget to clear any error that might've shown up in "metadb"
(run with no arguments). The "W" below, is indicative of a metadb
that has a "write error" in it. What we're doing below is running
"metadb" to check the state of the metadb's, then we delete the
one with an error on it, then we add it back, and finally we check
the state of the metadb's again, to verify that the error has been
cleared.
bingy-root> metadb
flags first blk block count
a m pc luo 16 1034 /dev/dsk/c0t2d0s7
W pc l 16 1034 /dev/dsk/c0t3d0s7
a pc luo 16 1034 /dev/dsk/c0t4d0s7
bingy-root> metadb -d /dev/dsk/c0t3d0s7
bingy-root> metadb -a /dev/dsk/c0t3d0s7
bingy-root> metadb
flags first blk block count
a m pc luo 16 1034 /dev/dsk/c0t2d0s7
a u 16 1034 /dev/dsk/c0t3d0s7
a pc luo 16 1034 /dev/dsk/c0t4d0s7
Tue Mar 22 18:04:13
bingy-root>
An experiment: What happens if we force a failure of each disk in
a 3-disk RAID 5 in turn: fail a drive, metareplace it back in, then
the next drive, and so on - but without first clearing any write
errors in metadb?
The first disk went without a hitch
When I failed the second disk, the write error on the first
disk was magically cleared. Also, the second disk was giving
errors when I powered it back up - I couldn't metareplace -e
it, and format couldn't detect the disk's label. So I
rebooted, and then metareplace -e of the second disk worked
fine. Also, the write error was cleared by the time I got back
from lunch...
The third disk... After the third disk, I saw:
bingy-root> metastat
d6: RAID
State: Needs Maintenance
Invoke: metareplace -f d6 c0t2d0s6 <new device>
Interlace: 32 blocks
Size: 4184856 blocks
Original device:
Size: 4185408 blocks
Device Start Block Dbase State Hot Spare
c0t2d0s6 330 No Maintenance
c0t3d0s6 330 No Okay
c0t4d0s6 330 No Last Erred
Wed Mar 23 12:17:59
bingy-root> metadb
flags first blk block count
Wm pc l 16 1034 /dev/dsk/c0t2d0s7
a pc luo 16 1034 /dev/dsk/c0t3d0s7
a pc luo 16 1034 /dev/dsk/c0t4d0s7
Wed Mar 23 12:18:07
bingy-root> metareplace -e d6 c0t2d0s6
metareplace: bingy.nac.uci.edu: d6: operation requires -f (force) flag
Wed Mar 23 12:18:26
bingy-root> metareplace -e -f d6 c0t2d0s6
d6: device c0t2d0s6 is enabled
Wed Mar 23 12:18:29
bingy-root> metastat
d6: RAID
State: Resyncing
Invoke: metareplace -f d6 c0t4d0s6 <new device>
Resync in progress: 0% done
Interlace: 32 blocks
Size: 4184856 blocks
Original device:
Size: 4185408 blocks
Device Start Block Dbase State Hot Spare
c0t2d0s6 330 No Resyncing
c0t3d0s6 330 No Okay
c0t4d0s6 330 No Last Erred
Wed Mar 23 12:18:33
...and interestingly, I cannot got the RAID 5 volume out of "Needs
Maintenance" state by using "metareplace -e d6 c0t2d0s6" or even
"metareplace -f -e d6 c0t2d0s6". It's beginning to sound like
clearing the write error is important.
SVM and hotspares
What to do...
To set up a hotspare pool of SCSI targets one and two:
metainit hsp001 c0t1d0s2 c0t2d0s2
To associate a hot spare pool with a RAID 5:
metaparam -h hsp001 d10
To associate a hot spare pool with two submirrors:
metaparam -h hsp001 d21
metaparam -h hsp001 d22
My testing
I'm starting with disks with this number of sectors (half a
K), so t2 will be the SRCDISK, t1 and t4 will be in the RAID 5
initially (DSTDISKS), and t5 will be a hot spare.
/dev/rdsk/c0t1d0s2: 17682083 (SOE internal, "disk 5", does
have a disk light, but it's hard to see. It reflects off of
t0 a bit (t0 is on the bottom, t1 is on the top), or you can
lift an Ultra 1's case a bit to see it directly)
/dev/rdsk/c0t2d0s2: 8886314 (lowest DCS external, has a
useful disk light)
/dev/rdsk/c0t3d0s2: (installed but not used, because it's
2G or less, second lowest DCS external)
/dev/rdsk/c0t4d0s2: 71681511 (second from top DCS external,
no useful disk light?)
/dev/rdsk/c0t5d0s2: 35836799 (top DCS external)
Changed values in SVM-RAID-5 to use the new devices on bingy.
Running script... Waiting for RAID 5 to initialize itself
using newly-extended notify-when-up, like:
notify-when-up -f 'metastat | grep "Resync in progress"'
This gives a display of the percentage on the tty, and sends
e-mail and starts up a *dialog window when the Resync is done.
Created hot spare pool using t5, using the instructions above.
Pulling out t1... Expecting t5 to be immediately set up using
parity... though maybe I will have to generate some disk
activity... I've made 0 effort to duplicate the same
partitioning onto t5 as on t1, t2 and t4, but the hot spare
functionality appears to be coping fine. SVM noticed the
problem in less than 5 minutes, but not immediately.
This command let me move off of the hot spare, and back to t1:
bingy-root& metareplace -e d6 c0t1d0s6
d6: device c0t1d0s6 is enabled
I'm putting some test load on the RAID 5 volume, as it replaces
the hot spare with t1... Worked fine.
Now I'm trying a run-after, followed by a reboot, to see if
the RAID gets messed up. This was a problem with mirroring.
SVM and soft partitioning (getting more partitions out of a disk)
Checking which disk you're actually booting from, when you have an SVM-based system disk mirror:
The easiest way to do it with the prtpicl command.
prtpicl -v |grep disk
:disk-write-fix
disk-label (disk-label, 3d00000032)
:devfs-path /packages/disk-label
:binding-name disk-label
:_class disk-label
:name disk-label
:bootpath /pci@1c,600000/scsi@2/disk@0,0:a
:boot-device disk:a disk1:a
:diag-device disk0:f
:disk /pci@1c,600000/scsi@2/disk@0,0
:disk0 /pci@1c,600000/scsi@2/disk@0,0
:disk1 /pci@1c,600000/scsi@2/disk@1,0
:disk2 /pci@1c,600000/scsi@2/disk@2,0
:disk3 /pci@1c,600000/scsi@2/disk@3,0
:SlotType disk-slot
disk (fru, 3d000006fe)
:name disk
:SlotType disk-slot
disk (fru, 3d00000701)
:name disk
:SlotType disk-slot
:SlotType disk-slot
Thanks to Pascal Grostabussiat [pascal@azoria.com] for pointing this
out!
SVM and power failures, from a message to sun-managers:
From: David Graves
To: sunmanagers@sunmanagers.org
Subject: SUMMARY: D1000 power failure with Disksuite: how to restore to running state?
Date: Mon, 27 Feb 2006 22:20:16 -0500 (19:20 PST)
Many thanks to all who replied. This story has a happy ending.
In a situation where an array loses power, and the server does not, each
disk that the system attempts to read will fail. In a miror situation, it
is possible, then, to have multiple read fails. For each failure that is
not fatal (i.e. the server still thinks there's an available mirror or
slice), it marks the disk in 'maintenance' mode. When a read is attempted
from a the last available slice with a failed result, then that disk is
placed into the 'last erred' mode.
In a RAID 5 system, only 1 disk will enter the 'maintenance' mode. The next
failed read places the failed disk into the 'last erred' mode, and the
entire metadevice is taken offline. Further attempts at reads result in IO
errors.
As dersmythe pointed out, the Disksuite user manual makes reference a power
failure like this. The procedure is to use the metareplace command with the
-e switch on the disk in 'maintenance' mode. And example is: metareplace -e
dx cxtxdxsx (replacing the x's with the metadevice and slice that have
failed).
It is important to use this command first on the 'maintenance' disk before
attempting to enable the 'last erred' disk.
I personally ran into a problem with this method: execution of metareplace
-e failed and reported that I must use the -f (force) switch. Feeling
uncomfortable with proceeding, I held off to do more research.
A SECOND method of recovery is available as well: it is possible to CLEAR
the metadevice with the metaclear command, and rebuild it, as long as the
slices/disks are ordered just as they were prior to failure. The order is
revealed with the metastat -p command. The manual recommends using this
method only when the 1st method is unsuccessful. All metadevices (mirrors,
raids, concats, etc) can be rebuilt in this fashion according to the manual.
There are valuable documents on docs.sun.com as well as an article on
sunsolve.sun.com referring to the second method of recovery. Sunsolve
requires a subscription.
As it happens, I employed neither of these procedures (unintentionally). In
my case, a coincidental memory error caused a panic (I'm wondering if this
was due to the original power outage as this machine is quite stable).
Upon reboot, the metadevice was online and the 'last erred' disk was
cleared. I used the metareplace -e command to clear the 'maintenance' disk
and all was well.
Thanks again to Prabir Sarkar, dersmythe, Michael T Pins, and Damian Wiest
Fixing a bad boot block in an SVM-mirrored root disk, from a summary on sun-managers by Andy Malato:
It turns out that installboot is the proper method to use to reinstall a
corrupted boot sector even when disksuite is used.