2005-04-10 Usenix in Anaheim Session starts, 1:30PM S10 Next generation storage networking Jacob Farmer presenting CTO of Cambridge Computer Cervices, Inc. jacobf@ traditional backups doomed: primary storage growing faster than secondary storage ease of use appropriate cost hardware independence The storage industry wants more functionality loaded into net switches, beyond just passing packets. The speaker's not a fan of this approach, although he's moderated his opinion on this a bit since writing a published article that was very opposed. HSM: Hierachical Storage - popularity comes and goes. CAS: content addressable storage - sounds like looking up a file by its hash salient distinctions in storage architectures often appear in how they handle metadata The questions to ask when a vendor starts talking about virtualization, are "what are you virtualizing?" and "where are you virtualizing it?" "near line storage" implies removable media? I'm a bit surprised by this. http://searchstorage.techtarget.com/sDefinition/0,,sid5_gci944832,00.html confirms it though - it's on-site removable storage. NAS: traditional file server, but it may be packaged up in an "easy to use" way ms windows NAS edition differs from just using your run of the mill windows version with filesharing, in that there are no client access licenses involved, but also, when you lose the hardware, you lose the software too Netapp: tends to manhandle their buyers on software licenses - they tend to make you rebuy software licenses frequently, eg on hardware upgrades and even when you want to sell your netapp box (EG, on E-bay). Other than that, "they're good". John W of NACS indicates that they don't do this though. NAS: supposed to have ease of use performance, but not really high low cost of ownership fault tolerance ... You might want to get your backup solution from the same vendor as your primary storage solution. NDMP: remote control protocol related to backups? "necessary evil" for Netapp-like boxen to do backups in a better way than backing up via NFS/CIFS/etc. Single namespace across fileservers is nice "life cycle management" mostly a matter of buzzword-compliance, but EMC bought a company that really groked LCM A closed-off NAS solution makes backups harder block-level or volume virtualization mirroring or striping imply volume management hardware raid might have two volume managers, one in the hardware array, one in the operating system SAN based disk array - external volume manager SNIA provides definitions of storage-related terminology: www.snia.org, stands for storage networking industry association tries to be vendor neutral, but ends up focusing on terminology of large, established vendors dedicated gigabit backup: san "SAN" is a topology, not a refrigerator full of disks great mechanism for filesharing san: bunch of disks and storage devices all plugged in together fibre is basically scsi in a star topology, not a bus same evolution as network: bus to star some storage vendors will use a software layer to 'make you a customer for life', vendor goal is to lock you in fibre uses a persistent 64 bit identifier for each device, analogous to a mac address initiator - host target The SCSI protocol is kind of dumb, not much "are you sure-ing" built into it out of band: different medium, like ethernet just for metadata, often makes use of a special driver zone switches: like vlans layer two iSCSI: new terminology, no LUN's anymore SAN failures are almost always due to a factory technician error low ease of use, high cost, quirky interoperability: not really auto provisioning, not truly automatic SAN: serves blocks NAS: serves files SCSI fcp (scsi over fibre) ATA/ NFS CIFS Traditional SCSI is now referred to as "parallel SCSI" FC-AL: fibre channel arbitrated loop, SCSI over fibre SAS: serial attached scsi, very new, may have even started shipping this week ATA: also known as IDE SATA iSCSI infiniband parallel SCSI: at high rates, gets harder to make signal timing close enough together across multiple wires in a parallel cable, hence Serial Attached SCSI, which is analogous to SATA parallel vs serial, SCSI vs IDE: no big deal. What matters is "does it work?" serial technologies are good for star topologies, more scalable can add logic at center of star cabling simpler SATA1 is the current generation of "SATA" SATA2 about to start shipping SAN: multiple hosts can talk to same storage shared SCSI array - Lecturer thinks of this as a mini-SAN, but the storage industry wants to call this Direct Attached Storage or DAS lecturer calls any "SAN" a "DAS" SAN backup: doesn't really have to be fibre to take advantage of SAN Examples of serial SCSI: fibre channel firewire aka IEEE 1394 serial storage architecture (IBM SSA product) Some things that are nice about fibre channel: good error correction lots of device id's flexible fanout, much better than PSCSI Not so good things about fibre channel: cost coming down, but still high interoperability is still poor $5-10k just to get a connection, without any disk snapshots are possible without SAN SAS - serial attached SCSI why do you want SAS? Because it's point to point, not daisy chaining really a SAN topology With PSCSI, you can get a little 8 port SCSI switch, which can fan out much more flexible at lower costs SAS can tunnel SATA through SAS, to mix drive types ATA: means "AT attached" It's a two-device bus, 1 master, 1 slave If the master dies, you (may?) lose the slave - if on the same bus ATA assumes the filesystem above it will map away bad blocks, so ATA RAID must look for and remap bad blocks, otherwise the RAID is flakey SATA: nice but overhyped 42x400G == 16 terabytes possible with a single high storage density box don't back up to different disks in same storage framework - because if the storage infrastructure dies, you've lost both copies no active-active controllers best practice is active-passive ATA is lighter weight than alternatives (fewer pounds/kilograms), less subject to vibration due to rotation when assembling your own RAID recommends keeping drive types the same within a given raid cabinet, IE not mixing PSCSI with SATA, etc. if you tape three drives together, the one in the middle will perform more slowly than the outer drives, due to more vibration low cost disk helps with: tape streaming eliminate or reduce tape usage ... ATA attributes: lower rotation speeds can wiggle without probs can power off when unneeded wider platters less robust servo mechanims, which may die fibre channel disk tends to assume a cooled environment, ATA more tolerant of heat RAID-6: can handle a dual failure a better RAID controller means a faster RAID rebuild after a disk failure, which in turn means less degraded mode time. RAID-n: n disk failure ok without data loss PSCSI 320 mb/s fibre channel: 200 mb/s SATA: 150 mb/s SA-SCSI: 150 mb/s PATA: 133 mb/s ...but this isn't the be-all end-all measure of overall performance fibre channel with a single loop means you have to divide the bandwidth through that loop by the number of active drives ATA storage solutions tend to be 1 drive per controller so may outperform fibre channel with adequate striping! It's a common belief that ATA is poor for database applications, but in truth it depends on the storage design ATA may allow more drives (due to lower cost per drive), and hence more speed at same cost (DRS: due to striping across more spindles?) ATA uses 3.5" platters, which means more good spindles, but lower average spindle performance fibre channel uses 2.5" platters QOS based on spindle banding, uses outer disk bands for high speed data requirements, inner for lower cost stuff EMC has a storage option that just ignores inner, slower bands (cylinders) ATA -can- rival fibre channel raid! E-ATA means "enterprise ATA" which in turn just means "better ATA drives" western digital sells only ATA drives, no SCSI drives, no fibre channel drives, so they don't mind shaking up the SCSI/FC markets with E-ATA The magic is in research and development on storage controllers content is "more than data", and implies: data life cycle data accumulation long term value of data history/versioning of data data redundancy (fault tolerance) ILM: information lifecycle management suddenly "ILM" applies to everything EMC does get ILM though due to the purchase of a company that "got" ILM HSM has an 8 year cycle where it's popular, then not, then popular, then not... most HSM solutions today are a "tack on" to an existing filesystem, and are usually only two tier HSM good if data can move in both directions. If you can't pull stuff back out the same way you put it in, it's not so good. www.archivebuilders.com In HSM, vendors often use "stub files", which represent a file that has been migrated to slower storage, and points at the real file. The problem is when you do backups, are the backups making a copy of the real file, or just the stub file? It's hard to tell, and you may want to get your backup solution from the same vendor, if you go with HSM, to avoid this problem. DRS inference: Also, if you're not backing up just the stub files on migrated data, are you going to cause the HSM solution to do heavy migrating in order to do a backup? Backup time may not decrease with HSM solutions, due to dominating term in backup time equation being hoardes of tiny files exhange stores some of the most meaningful data in the wrong place, when you do not delete an email - it's stored in "microsoft access with a twist" CAS or "Content Addressable Storage": describes functionality, not how it works It uses a modified md5; that number becomes the file identifier. DRS inference: Clearly they need to be sure that the odds of an accidental hash collision are astronomically low by using an effective hash with a large number of bits. Nice thing about CAS is that it is expandable to the moon easily Often there'll be a need for a wrapper to make it look like a traditional filesystem CAS can do write once filesystems (WORM) 2005-04-10 3:30PM: Back from our break IP SAN's Fibre Channel is here to stay But ethernet can do still some magic focus on host connections switching, routing, wide area are all easy and mature with IP SAN's don't focus on bandwidth "TCP offload engine" (TOE) means TCP handling is done in hardware, which should give you pretty much wire speed There exists no nice SCSI driver for microsoft nt 4 don't rely on a product that you can only get from a single vendor - they may lag on updating part of their solution, which means higher costs down the road CAS is usually a "write once" solution Netapp does CAS, and published how they did it as a way to beat EMC, because the industry started doing CAS Netapps's way, due to the good documentation on how to do it the Netapp way. iSCSI - lots of big vendors like it It's an industry standard EMC, Dell, HP like it iSCSI Can be done 3 ways 1) In software only 2) run iSCSI on a TOE card 3) all SCSI -and- TCP logic on card (SNIC stack) Netapp poo poo's half hardware and pure hardware solutions - because they do it all in software :) 10-20 meg's a second 50-80 meg's a second Speaker feels this is "just something that should be done in hardware" for enterprise, but pure software is fine for lower end stuff soon pretty all operating systems will be able to do iSCSI iSCSI-HBA - iSCSI "host bus adapter" Systems will be bootable from SNIC (which is an HBA and often, a network card too) and iSCSI-HBA Most folks buying iSCSI are setting up VLAN's and sometimes CHAP for security: http://www.webopedia.com/TERM/C/CHAP.html Also, sometimes folks use IPSEC for security with iSCSI ISNS: storage name services. According to http://www.networksorcery.com/enp/protocol/isns.htm , it pertains to discovery, management and configuration of iSCSI and Fibre Channel storage. DRS question: Is this related to SIP? ethernet and fibre channel are switched differently fibre channel is like a "vulcan mind meld" iSCSI storage arrays: EMC DMX - has iSCSI option EMC Clarrion has option now too iSCSI (SAN) addon may undermine usability as a (NAS) fileserver filer may be a bottleneck, since it may end have to do IP for many clients iSCSI target software available for Windows, Linux, Mac OS, EG: http://sourceforge.net/projects/linux-iscsi storage bridges and routers: not much of an issue anymore. They can be a bit of a management problem, but their use is still common in tape applications. iSCSI storage pools 20 minutes to (set up an?) iSCSI SAN (with some products?) getting staggering results from some of them EMC claims Fibre Channel best for almost everything, then iSCSI for little stuff lecturer does not agree with EMC on this - believes iSCSI can be very competitive Fibre Channel cannot have multiple controllers for a given disk. iSCSI can, which clearly could help throughput An IP SAN could have 30 controllers or something if you pile up lots of "little" iSCSI boxes "storage virtualization": an overhyped phrase, implies block level but external to host, there exist iSCSI virtual appliances that just serve up a bunch of disk via iSCSI You can have iSCSI and Fibre Channel on same platform blade-oriented solutions available higher port data vendors: McData, Brocade problem with central disk arrays eventually you outgrow it, and need to migrate often you're locked in to a single vendor by your initial choice At some point something will get maxed out allocation (LUN's) caching striping/redundancy abstraction ... No frills: stripe cache provisioning (dread disk: bepp) ... Subtract: zones in FC switches hosts, FC switch, disk, disk controllers as many controllers, disks, host channels, disk channels SCSI, FC, ATA add more processors, more ram, etc. Lsi - same Nstore - FC but not feature rich commodity hardware, better fault tolerance catastrophic data loss usually due to vendor action new HBA comes out with new implementation, cheap speed booost! replication appliance, using zoning in switch, easy migration, sits in between clients and previous server, until data is moved - so proxy for a while, then pure server. virtual storage solution allows more flexible designs, "enterprise arrays" more rigid like (Compaq? Now HP - and may or may not be the same thing now) Proliant product line from few years ago volume-level snapshots falconstor (be sure to leave off the "e") solid state disk - memory device no latency hot spot to solid state can have fault tolerant solid state remote replication dynamic capacity provisioning get more when needed allocate up to a given max i assume disk virtualization falconstor Lecturer does not like "intelligent" storage switches. Infostor, April, 2004 What some vendors are selling is an inexpensive switch with linux in it api's on switch, including virtualization latency in block level storage can be an issue for iSCSI, not so much for FC, but hotspotting to solid state disk will often handle iSCSI latency replication software may sometimes require software on clients (above-mentioned proxy-to-transition). Avoid that Say an array fails At the top tier of the storage market, if it breaks, it's their responsibility, not yours EMC, IBM, Hitachi, more low cost HP MSA or Proliant out of band virtualization aka asymmetrical data path separate from metadata path don't put a different I/O chain in and expect the original vendor to support it - but with OOB (out of band), they should, data has same path, there's just different volume management overtop monosphere - block level hsm hottest blocks on solid state, then fc, then something elsre OOB implies host based software driver replication can also be done with out of band replication, also meaning not disrupting the preexisting I/O path Cisco - hook in switch, assymmetric not in host, but in switch. Good place for it convergence of SAN and NAS EMC, Hitachi, IBM are all doing it now, but were reluctant earlier on Many vendors are saying "we can manage their stuff too" virtualization tech "throws sand up in the air sgain". Free for all again should be able to have both SAN and NAS in one solution falconstore: linux based, turnkey bluearc DRS: "The world's fasted network storage server" block storage over ip: less compelling in big uses A 1000-5000 node compute cluster divides the bandwidth greatly with high end SAN. Better off with NAS NAS clusters: multiple fileservers talking to backend SAN, and sending out to hosts on lan metadata server could be a bottleneck Cambridge Computer Services: lecturer's company, and wants to bid on the DDSS (ESMF storage system) IBRIX: By a professor out of Yale, who's a personal friend of the lecturer. Each IBRIX is a metadata frontend for a bunch of backend SAN boxes NFS could be deal breaker move filesystem logic out to clients, what if clients could have direct into SAN backend IBRIX is just metadata routing enormously scalable ms DFS - no additional cost, moves around from server to server AFS - common root, all sorts of caching, not many corporations running it, many univiversities are using it though UNC paths Netapp does (some of) what ms DFS does wide area file systems not true distributed filesystem, caching gateways WAN fileservers variety of products that optimize this back and forthing of chatty protocols like NFS, CIFS, MAPI (DRS: ms exchange protocol). These products spoof the original protocol. Open file on fileserver, tcp window accel. Bits cached on both ends 100 x perf boost sometimes. DRS: Sounds very NoMachine NX-like. Originally, the intent was that Infiniband would be to PCI as fibre is to SCSI. It did not quite evolve that way. PCI-X happened instead, and then Infiniband host "channel" adapter (not bus). Vulcan mind meld, drops stuff right into your memory. Really high bandwidth: 10-30 gigabit Infiniband common, whereas 10 gigabit ethernet remains uncommon. Compares well with myrinet. Super low latency in Infiniband. Has RDMA or "remote DMA". Memory to memory transfer "without I/O". Infiniband almost died, but then a couple of vendors got traction. RDMA NIC card or "RNIC": plug into PCI or PCI-X, gives ultra low latency ethernet. $400/card. ISER : iscsi extention for rdma. Patented? www.rdmaconsortium.com socket extention for rdma for tcp modules exist to implement protocol emulation from infiniband to Fibre Channel or gigabit ethernet costs as much as fibre or myrinet, but if you need both, it might be cheaper with infiniband + conversion to FC and GigE. Provides Low latency ethernet and fibre channel The lecturer's company allows him to give storage technology presentations to universities at no charge.