It's now time to talk about the bad part of my friend's experience building a software iSCSI SAN...
Many software vendors in this space are wary of customers doing their own installs, and insist on bringing in an integrator with experience of the product to perform the installation. That's a fine plan in theory, but is totally reliant on the abilities of the integrator, which is where the plan frequently falls apart. Small software vendors tend to get stuck with small integrators who are basically "two men and a dog" outfits, and frequently the dog is the brains of the organization!
Anyway, enough preamble, on to the gory details of the implementation. The consultant (recommended as one of the best by the vendor) came in with a plan to use dual GigE NICs in an adapter teaming configuration to provide load balancing and fail-over. Unfortunately adapter teaming requires switches that support the Link Aggregation Control Protocol (LACP) which meant buying a new GigE switch at around $5K on top of the original price, in addition to upgrading firmware on the servers. Implementing adapter teaming also required the entire VLAN architecture of the data center to be redesigned with addition of two new two VLANs (one for iSCSI traffic and one for everything else) which is where things started to go wrong. The Broadcom based GigE interfaces used in Dell servers assign the same MAC address to each VLAN created on a particular NIC. Use of a single MAC address causes problems with iSCSI because iSCSI requires the MAC addresses to be unique otherwise it can't work out which VLAN to bind to. The integrator insisted that they could work around this problem by creating the iSCSI VLAN, configuring iSCSI, and then creating the second VLAN. They insisted that this approach had worked for them before. Unfortunately it relies on the system creating the VLANs in the same order every time it boots, and at least for Windows Server 2003 this is not the case. After some days spent trouble shooting the problems this caused, a support call to the software vendor revealed that this was an unsupported configuration for the network. In the end it was decided to use the NICs in a simple fail-over pair with no load balancing or link aggregation which worked correctly.
Strike #1 for the integrator - they'd recommended a network architecture that would never work, and cost my friend $5K extra for a switch, plus several days redesigning VLANs and troubleshooting iSCSI problems, none of which would have been necessary had the integrator known what they were doing.
The next hurdle related to the software licensing, the system was initially installed with 30-day evaluation licenses. Four days before the licenses were set to expire (a Monday) my friend sent emails to the vendor and integrator asking whether the licenses would expire at 12pm Thursday or 12pm Friday. He got no response, so he emailed them again on the Tuesday, and on the Wednesday, and on the Thursday! He finally got a response from the integrator on the Thursday who assured him that the license would expire on Friday night, and that they would install the full licenses during business hours on the Friday. I'm sure you can guess what happened next, around 12pm on the Thursday the mail system went down because the licenses for the iSCSI SAN expired and the iSCSI volumes disappeared! My friend spent the Thursday night rebuilding everything, while entertaining himself with thoughts of boiling oil, integrators, and excruciating agony.
Strike #2 for the integrator - they didn't check the correct expiration date for the licenses, causing significant downtime for mail servers.
The final job for the integrator was to move the mail store volume over to use thin-provisioning. The plan was to break the original mirror, associate one half of that original mirror with a thin-provisioned volume, synchronize the new mirror, then repeat the process for the other half of the mirror. Not rocket science you'd think, but the integrator fouled it up with consummate skill. Instead of breaking the mirror as my friend suggested, he unmapped the entire mirrored pair from the application server, with entirely predictable results and more downtime for the mail system.
Strike #3 for the integrator - they clearly didn't understand how the software worked, despite being strongly recommended by the software vendor!
Posted by: Nik Simpson