Note: This web page was automatically created from a PalmOS "pedit32" memo.

Usenix talk: Next Generation Storage Networking

2005-04-10 Usenix in Anaheim

Session starts, 1:30PM

S10 Next generation storage networking

Jacob Farmer presenting
CTO of Cambridge Computer Cervices, Inc.


traditional backups doomed: primary storage growing faster than secondary

ease of use
appropriate cost
hardware independence

The storage industry wants more functionality loaded into net switches,
beyond just passing packets. The speaker's not a fan of this approach,
although he's moderated his opinion on this a bit since writing a
published article that was very opposed.

HSM: Hierachical Storage - popularity comes and goes.
CAS: content addressable storage - sounds like looking up a file by its hash

salient distinctions in storage architectures often appear in how they
handle metadata

The questions to ask when a vendor starts talking about virtualization,
are "what are you virtualizing?" and "where are you virtualizing it?"

"near line storage" implies
removable media?  I'm a bit surprised by this.,,sid5_gci944832,00.html
confirms it though - it's on-site removable storage.

NAS: traditional file server, but it may be packaged up in an "easy to use" way

ms windows NAS edition differs from just using your run of the mill
windows version with filesharing, in that there are no client access
licenses involved, but also, when you lose the hardware, you lose the
software too

Netapp: tends to manhandle their buyers on software licenses - they
tend to make you rebuy software licenses frequently, eg on hardware
upgrades and even when you want to sell your netapp box (EG, on E-bay).
Other than that, "they're good".  John W of NACS indicates that they
don't do this though.

supposed to have ease of use
performance, but not really high
low cost of ownership
fault tolerance

You might want to get your backup solution from the same vendor as your
primary storage solution.
NDMP: remote control protocol related to backups?  "necessary evil"
for Netapp-like boxen to do backups in a better way than backing up
via NFS/CIFS/etc.

Single namespace across fileservers is nice

"life cycle management" mostly a matter of buzzword-compliance, but EMC
bought a company that really groked LCM

A closed-off NAS solution makes backups harder

block-level or volume virtualization
mirroring or striping imply volume management

hardware raid
might have two volume managers, one in the hardware array, one in the
operating system

SAN based disk array - external volume manager

SNIA provides definitions of storage-related terminology:, stands for storage networking industry association
tries to be vendor neutral, but ends up focusing on terminology of large,
established vendors

dedicated gigabit backup: san

"SAN" is a topology, not a refrigerator full of disks
great mechanism for filesharing

san: bunch of disks and storage devices all plugged in together

fibre is basically scsi in a star topology, not a bus
same evolution as network: bus to star

some storage vendors will use a software layer to 'make you a customer
for life', vendor goal is to lock you in

fibre uses a persistent 64 bit identifier for each device, analogous to
a mac address

initiator - host

The SCSI protocol is kind of dumb, not much "are you sure-ing" built into it

out of band: different medium, like ethernet just for metadata, often
makes use of a special driver

zone switches: like vlans
layer two

iSCSI: new terminology, no LUN's anymore

SAN failures are almost always due to a factory technician error
low ease of use, high cost, quirky

interoperability: not really
auto provisioning, not truly automatic

SAN: serves blocks
NAS: serves files

SCSI fcp (scsi over fibre) ATA/ NFS CIFS

Traditional SCSI is now referred to as "parallel SCSI"
FC-AL: fibre channel arbitrated loop, SCSI over fibre
SAS: serial attached scsi, very new, may have even started shipping this week
ATA: also known as IDE

parallel SCSI: at high rates, gets harder to make signal timing close
enough together across multiple wires in a parallel cable, hence Serial
Attached SCSI, which is analogous to SATA

parallel vs serial, SCSI vs IDE: no big deal.  What matters is "does it work?"

serial technologies are good for star topologies, more scalable
can add logic at center of star
cabling simpler

SATA1 is the current generation of "SATA"
SATA2 about to start shipping

SAN: multiple hosts can talk to same storage

shared SCSI array - Lecturer thinks of this as a mini-SAN, but the
storage industry wants to call this Direct Attached Storage or DAS

lecturer calls any "SAN" a "DAS"

SAN backup: doesn't really have to be fibre to take advantage of SAN

Examples of serial SCSI:
fibre channel
firewire aka IEEE 1394
serial storage architecture (IBM SSA product)

Some things that are nice about fibre channel:
good error correction
lots of device id's
flexible fanout, much better than PSCSI

Not so good things about fibre channel:
cost coming down, but still high
interoperability is still poor
$5-10k just to get a connection, without any disk

snapshots are possible without SAN

SAS - serial attached SCSI
why do you want SAS?  Because it's point to point, not daisy chaining
really a SAN topology
With PSCSI, you can get a little 8 port SCSI switch, which can fan out
much more flexible at lower costs

SAS can tunnel SATA through SAS, to mix drive types

ATA: means "AT attached"
It's a two-device bus, 1 master, 1 slave
If the master dies, you (may?) lose the slave - if on the same bus

ATA assumes the filesystem above it will map away bad blocks, so ATA
RAID must look for and remap bad blocks, otherwise the RAID is flakey

SATA: nice but overhyped

42x400G == 16 terabytes possible with a single high storage density box

don't back up to different disks in same storage framework - because if
the storage infrastructure dies, you've lost both copies

no active-active controllers
best practice is active-passive

ATA is lighter weight than alternatives (fewer pounds/kilograms), less
subject to vibration due to rotation when assembling your own RAID

recommends keeping drive types the same within a given raid cabinet,
IE not mixing PSCSI with SATA, etc.

if you tape three drives together, the one in the middle will perform
more slowly than the outer drives, due to more vibration

low cost disk helps with:
tape streaming
eliminate or reduce tape usage

ATA attributes:
lower rotation speeds
can wiggle without probs
can power off when unneeded
wider platters
less robust servo mechanims, which may die

fibre channel disk tends to assume a cooled environment, ATA more tolerant
of heat

RAID-6: can handle a dual failure

a better RAID controller means a faster RAID rebuild after a disk failure,
which in turn means less degraded mode time.

RAID-n: n disk failure ok without data loss

PSCSI 320 mb/s
fibre channel: 200 mb/s
SATA: 150 mb/s
SA-SCSI: 150 mb/s
PATA: 133 mb/s
...but this isn't the be-all end-all measure of overall performance

fibre channel with a single loop means you have to divide the bandwidth
through that loop by the number of active drives
ATA storage solutions tend to be 1 drive per controller so may outperform
fibre channel with adequate striping!

It's a common belief that ATA is poor for database applications, but in
truth it depends on the storage design

ATA may allow more drives (due to lower cost per drive), and hence more
speed at same cost (DRS: due to striping across more spindles?)

ATA uses 3.5" platters, which means more good spindles, but lower average
spindle performance
fibre channel uses 2.5" platters

QOS based on spindle banding, uses outer disk bands for high speed data
requirements, inner for lower cost stuff

EMC has a storage option that just ignores inner, slower bands (cylinders)

ATA -can- rival fibre channel raid!

E-ATA means "enterprise ATA" which in turn just means "better ATA drives"
western digital sells only ATA drives, no SCSI drives, no fibre channel
drives, so they don't mind shaking up the SCSI/FC markets with E-ATA

The magic is in research and development on storage controllers

content is "more than data", and implies:
data life cycle
data accumulation
long term value of data
history/versioning of data
data redundancy (fault tolerance)

ILM: information lifecycle management
suddenly "ILM" applies to everything
EMC does get ILM though due to the purchase of a company that "got" ILM

HSM has an 8 year cycle where it's popular, then not, then popular, then not...

most HSM solutions today are a "tack on" to an existing filesystem,
and are usually only two tier

HSM good if data can move in both directions.  If you can't pull stuff
back out the same way you put it in, it's not so good.

In HSM, vendors often use "stub files", which represent a file that
has been migrated to slower storage, and points at the real file.
The problem is when you do backups, are the backups making a copy of the
real file, or just the stub file?  It's hard to tell, and you may want
to get your backup solution from the same vendor, if you go with HSM,
to avoid this problem.  DRS inference: Also, if you're not backing up
just the stub files on migrated data, are you going to cause the HSM
solution to do heavy migrating in order to do a backup?

Backup time may not decrease with HSM solutions, due to dominating term
in backup time equation being hoardes of tiny files

exhange stores some of the most meaningful data in the wrong place,
when you do not delete an email - it's stored in "microsoft access with
a twist"

CAS or "Content Addressable Storage": describes functionality, not how it works
It uses a modified md5; that number becomes the file identifier.  DRS
inference: Clearly they need to be sure that the odds of an accidental
hash collision are astronomically low by using an effective hash with
a large number of bits.
Nice thing about CAS is that it is expandable to the moon easily
Often there'll be a need for a wrapper to make it look like a traditional
CAS can do write once filesystems (WORM)

2005-04-10 3:30PM: Back from our break

Fibre Channel is here to stay
But ethernet can do still some magic

focus on host connections
switching, routing, wide area are all easy and mature with IP SAN's

don't focus on bandwidth

"TCP offload engine" (TOE) means TCP handling is done in hardware,
which should give you pretty much wire speed

There exists no nice SCSI driver for microsoft nt 4

don't rely on a product that you can only get from a single vendor -
they may lag on updating part of their solution, which means higher
costs down the road

CAS is usually a "write once" solution

Netapp does CAS, and published how they did it as a way to beat EMC,
because the industry started doing CAS Netapps's way, due to the good
documentation on how to do it the Netapp way.

iSCSI - lots of big vendors like it
It's an industry standard
EMC, Dell, HP like it

Can be done 3 ways 
1) In software only
2) run iSCSI on a TOE card
3) all SCSI -and- TCP logic on card (SNIC stack)

Netapp poo poo's half hardware and pure hardware solutions - because
they do it all in software :)

10-20 meg's a second
50-80 meg's a second

Speaker feels this is "just something that should be done in hardware"
for enterprise, but pure software is fine for lower end stuff

soon pretty all operating systems will be able to do iSCSI
iSCSI-HBA - iSCSI "host bus adapter"
Systems will be bootable from SNIC (which is an HBA and often, a network
card too) and iSCSI-HBA

Most folks buying iSCSI are setting up
VLAN's and sometimes CHAP for security:
Also, sometimes folks use IPSEC for security with iSCSI

ISNS: storage name services.  According to
, it pertains to discovery, management and configuration of iSCSI and
Fibre Channel storage.  DRS question: Is this related to SIP?

ethernet and fibre channel are switched differently
fibre channel is like a "vulcan mind meld"

iSCSI storage arrays:
EMC DMX - has iSCSI option
EMC Clarrion has option now too

iSCSI (SAN) addon may undermine usability as a (NAS) fileserver

filer may be a bottleneck, since it may end have to do IP for many clients

iSCSI target software available for Windows, Linux, Mac OS, EG:

storage bridges and routers: not much of an issue anymore.  They can
be a bit of a management problem, but their use is still common in
tape applications.

iSCSI storage pools
20 minutes to (set up an?) iSCSI SAN (with some products?)
getting staggering results from some of them

EMC claims Fibre Channel best for almost everything, then iSCSI for
little stuff
lecturer does not agree with EMC on this - believes iSCSI can be very

Fibre Channel cannot have multiple controllers for a given disk.
iSCSI can, which clearly could help throughput
An IP SAN could have 30 controllers or something if you pile up lots of
"little" iSCSI boxes

"storage virtualization": an overhyped phrase, implies block level but
external to host, there exist iSCSI virtual appliances that just serve
up a bunch of disk via iSCSI

You can have iSCSI and Fibre Channel on same platform

blade-oriented solutions available
higher port data

vendors: McData, Brocade

problem with central disk arrays
eventually you outgrow it, and need to migrate
often you're locked in to a single vendor by your initial choice
At some point something will get maxed out

allocation (LUN's)

No frills:
provisioning (dread disk: bepp)


zones in FC switches
hosts, FC switch, disk, disk controllers
as many controllers, disks, host channels, disk channels

add more processors, more ram, etc.

Lsi - same 
Nstore - FC but not feature rich

commodity hardware, better fault tolerance

catastrophic data loss usually due to vendor action

new HBA comes out with new implementation, cheap speed booost!

replication appliance, using zoning in switch, easy migration, sits in
between clients and previous server, until data is moved - so proxy for
a while, then pure server.

virtual storage solution allows more flexible designs, "enterprise arrays"
more rigid

like (Compaq?  Now HP - and may or may not be the same thing now)
Proliant product line from few years ago

volume-level snapshots

falconstor (be sure to leave off the "e")
solid state disk - memory device
no latency
hot spot to solid state
can have fault tolerant solid state

remote replication

dynamic capacity provisioning
get more when needed
allocate up to a given max i assume

disk virtualization


Lecturer does not like "intelligent" storage switches.  Infostor, April, 2004

What some vendors are selling is an inexpensive switch with linux in it
api's on switch, including virtualization

latency in block level storage can be an issue for iSCSI, not so much for
FC, but hotspotting to solid state disk will often handle iSCSI latency

replication software may sometimes require software on clients
(above-mentioned proxy-to-transition).  Avoid that

Say an array fails
At the top tier of the storage market, if it breaks, it's their
responsibility, not yours
EMC, IBM, Hitachi, more

low cost HP MSA or Proliant

out of band virtualization
aka asymmetrical
data path separate from metadata path

don't put a different I/O chain in and expect the original vendor to
support it - but with OOB (out of band), they should, data has same path,
there's just different volume management overtop

monosphere - block level hsm
hottest blocks on solid state, then fc, then something elsre

OOB implies host based software driver

replication can also be done with out of band replication, also meaning
not disrupting the preexisting I/O path

Cisco - hook in switch, assymmetric not in host, but in switch.
Good place for it

convergence of SAN and NAS
EMC, Hitachi, IBM are all doing it now, but were reluctant earlier on

Many vendors are saying "we can manage their stuff too"

virtualization tech "throws sand up in the air sgain".  Free for all again

should be able to have both SAN and NAS in one solution

falconstore: linux based, turnkey

bluearc DRS: "The world's fasted network storage server"

block storage over ip: less compelling in big uses

A 1000-5000 node compute cluster divides the bandwidth greatly with high
end SAN.  Better off with NAS

NAS clusters: multiple fileservers talking to backend SAN, and sending
out to hosts on lan

metadata server could be a bottleneck

Cambridge Computer Services: lecturer's company, and wants to bid on
the DDSS (ESMF storage system)

IBRIX: By a professor out of Yale, who's a personal friend of the
lecturer.  Each IBRIX is a metadata frontend for a bunch of backend
SAN boxes

NFS could be deal breaker

move filesystem logic out to clients, what if clients could have direct
into SAN backend

IBRIX is just metadata routing

enormously scalable

ms DFS - no additional cost, moves around from server to server
AFS - common root, all sorts of caching, not many corporations running
it, many univiversities are using it though

UNC paths

Netapp does (some of) what ms DFS does

wide area file systems
not true distributed filesystem, caching gateways
WAN fileservers

variety of products that optimize this back and forthing of chatty
protocols like NFS, CIFS, MAPI (DRS: ms exchange protocol).  These
products spoof the original protocol.  Open file on fileserver, tcp
window accel.  Bits cached on both ends 100 x perf boost sometimes. DRS:
Sounds very NoMachine NX-like.

Originally, the intent was that Infiniband would be to PCI as fibre is to SCSI.
It did not quite evolve that way.  PCI-X happened instead, and then
Infiniband host "channel" adapter (not bus).  Vulcan mind meld,
drops stuff right into your memory.  Really high bandwidth: 10-30
gigabit Infiniband common, whereas 10 gigabit ethernet remains uncommon.
Compares well with myrinet.  Super low latency in Infiniband.  Has RDMA or
"remote DMA".  Memory to memory transfer "without I/O".  Infiniband almost
died, but then a couple of vendors got traction.

RDMA NIC card or "RNIC": plug into PCI or PCI-X, gives ultra low latency
ethernet.  $400/card.

ISER : iscsi extention for rdma.  Patented?

socket extention for rdma for tcp

modules exist to implement protocol emulation from infiniband to Fibre
Channel or gigabit ethernet

costs as much as fibre or myrinet,  but if you need both, it might
be cheaper with infiniband + conversion to FC and GigE.  Provides Low
latency ethernet and fibre channel

The lecturer's company allows him to give storage technology presentations
to universities at no charge.


Back to Dan's palm memos