The last time I blogged about SSD performance I had a Intel MLC based SSD, intended mostly for laptop or read intensive applications. Looking back at that blog, I reported pretty decent performance numbers with the X25-M Intel drive.
Christmas came early this year for me - I recently received several Intel X25-E enterprise SLC SSDs for evaluation. As an analyst I normally don't get the chance or have the time to get down and dirty, but this opportunity was too good to pass up. Besides, my career has been spent developing products and diving into details. Its hard to leave that legacy behind while looking at a box of SSDs just begging to be run through their paces. As, apparently, the only analyst to receive these drives, I felt obligated to take them for a ride and see if my previous enthusiasm was justified.
Lets get right to it:
I ran tests, using IOMeter on a 2.66GHz quad Core 2 Intel CPU, 45nm, 12MB L2 cache, with a 1333 front side bus, 4G memory, SATA II 300MB/s ports. The tests were run on a single SSD, both the M and E version, as well as the E version in a 4 drive RAID 5 configuration. Unfortunately, I don't have a decent RAID adapter (hint) so I used the onboard NVidia MediaShield RAID function.
While I have more data, for simplicity I've plotted IO transaction rates for 512, 4096 and 32768 block sizes for random reads and writes. Using all random reads and writes provides significant stress on the SSD and is a good reference point for comparison to HDD performance.
Take a look at the graph:
In the graph, I plot the transactional performance of the X25-M, X25-E, X25-E in RAID 5 and a SATA HDD as a function of block size.
Its worth pointing out that the tests I ran are far from real world, but they do highlight performance under extreme conditions. Measuring performance can be a tricky business, but I believe the tests I’ve run are a good reference point and easily repeatable – except for a weirdness that I’ll point out in a few…
Take a close look at the results. The performance for the X25-E is very compelling. For random reads, the X25-E's (as a single drive and RAIDed) performance tops out around 12,000 IOPs as does the X25-M. You'll need to look closely to see the plotted lines as they overlap at the top of the graph. I suspect that the drives are capable of much more and are bottlenecked by the upstream motherboard and driver stack limitations. I didn't spend much time tuning my system so I suspect that the read number could be far higher. In any case, the values leave the poor SATA HDD in the dust.
The random write performance is equally compelling for the X25-E, operating far faster than the X25-M and making the HDD look like a stone.
The "X25-E RAID5 - Write" test, using 4 disks, stands out like a turkey in a chicken ranch. The RAID performance is actually worse than a single disk. Hmm, why is that?
When doing writes in a RAID 5 configuration, an XOR operation is required (not so when reading). Since the RAID function on my motherboard is driver based, no doubt my system is the bottleneck. This limitation does point out the stress placed on RAID adapters when dealing with high transaction rate devices. Most RAID adapters are best suited to dealing with single threaded devices (e.g. hard disks) operating at hundreds of IOPs not thousands of IOPs as SSDs can do. I'll have to wait to get my hands on a decent RAID adapter (hint number 2) before this can be explored further.
but there is some weirdness, look at the following graph:
As I prepared to collect performance data, I ran the random 4k block write test a few times. I noticed that the result varied over time and depended on the state of the SSD before the test was run. That's weird. With a hard disk, performance is very predictable and constant over time. Apparently not so for an SSD. I think we knew this but the graph proves the point. Before the test, I had conditioned the X25-E with 64K random block writes. Not scientific, but the results shown in the above graph are curious none the less. The random write performance varied four to one over the period of 30 minutes where I collected performance data at 5 minute intervals.
While much more performance testing and analysis is needed, such as the examination of latency values, I'll leave that to others with more time on their hands....
The performance of the Intel X25-E is remarkable compared to a hard disk. Unfortunately, the unexpected performance variability was a surprise and adds a new dimension to interpreting performance data.
Oh, and btw, the X25-E hardly got warm to the touch throughout the testing. So while I don't have a way to measure power, the X25-E clearly uses far less power than my SATA HDD that I can use as a donut warmer.
So this brings up a good point, and I'll end the blog on this note:
The industry needs a standard way to test SSDs. Period.
Please feel free to comment.
Posted by Gene Ruth


Gene,
I suspect that your RAID-5 results have little or nothing to do with your RAID controller. Instead, you join a small but growing list of astute observers who are uncovering an intrinsic problem with Flash, a fundamental problem at the lowest levels of flash device architecture.
A problem that no one at the Flash SSD hype party seems to want to acknowledge. The proverbial "turd in the punchbowl".
In -isolation-, Flash reads are much faster than spinning disks. Likewise, -in isolation- Flash writes are also faster, though not by as wide a margin. This asymmetry of read vs. write performance is a big problem, but it's not a deal breaker for the Flash value proposition.
What you discovered in your RAID-5 test however IS the deal-breaker -- the thing about flash that no one seems to see. Those who have seen it seem to chalk it up to some other system problem. After all, it's simply not possible that a Flash device, with no moving parts, can be substantially SLOWER than disk. Certainly we understand that Flash may not always, in every application, be 100x faster than disk...but if a test shows Flash substantially slower than spinning disk...well...it's got to be the test setup, right? Must be the in RAID controller or some other system gremlin (as you observe).
The truth is that this really and truly is a fundamental limit of Flash technology, and IMO it is the reason why NAND Flash will not find widespread use in ANY brick-and-mortar enterprise applications.
Whenever Flash devices (either SLC or MLC) are performing BOTH reads and writes simultaneously, they are MUCH slower than when workloads are either 100% read or 100% write. Check the Intel spec sheets for the details. Intel reports that a 70/30 mix of reads and writes (IOmeter) yields far fewer IOPS than what should be expected based on a simple formula of 0.7x read-only IOPS plus 0.3x write-only IOPS. Yet the performance is still faster than disk, so this isn't (quite) the deal killer either.
The achilles heel of flash is synchronous IO.
Whenever synchronous I/O is involved, such as in all forms of parity-based RAID and...oh-by-the-way -Transaction Processing- and just about every other performance-critical enterprise application, reads cannot happen (wait in queue) until previous writes are completed.
The combination of the three factors; (a) performance asymmetry, (b) slowdown when reads and writes happen simultaneously, and (c) the impact on application behavior when these are encountered in synchronous IO workloads results in Flash being substantially slower than spinning disk...as it was in your test of RAID5.
In case you were wondering, this is why no one has yet published a TPC benchmark with Flash. This is why you haven't seen ANY audited enterprise application benchmarks with Flash SSD.
Real applications subject storage to synchronous IO, something that benchmarks don't do. Nor do I/O "traces" taken from applications, even if the I/O workloads were originally synchronous in nature.
Of course...with all the cash flowing into Flash SSD startups and the hype-cycle (profitable in it's own right), people have been ignoring this interesting little "feature" of Flash silicon.
Don't chalk it up to your RAID cointroller. Get in touch with your contact at Intel and ask them directly. Afterward, you may be interested enough to dig in to the underlying issue at the silicon level and see whether or when this problem can be solved.
On the other hand, nobody wants to be the one that points out the turd in the punchbowl.
Enjoy!
J Scouter
Posted by: jscouter | January 13, 2009 at 09:48 AM
J,
I do agree that performance does seem to vary, however, we have played with an E series drive and get different results in some cases.
The read results that Gene sees are pretty good.... about 11K to 12K IOps. I work in an enterprise, so my interest lies in the 8K to 128K block size.
The E series drives are extremely fast. In fact, we can put as much as 14K IOps into an E series drive with 8K blocks.
As far as the M series drives, these are almost as fast, however they have a rate limiter On a traditional hard disk, it's even worse. You will have to wait a long time for the read to that kicks in at a rate higher than the average user, but potentially below an enterprise. (Think whole disk raid rebuild...). If I had a power system at home or the office, I'd buy an M series (160GB is out now).
There are two flaws to this test. First of all, even a decent raid controller will flame out in the 12,000 IOps range for reads. For writes on a RAID 5, it's worse in two ways. Each random write requires the controller to perform two reads (data and parity) and then a data and parity write. So that's a multiplication of four. If you have battery cache, your application won't notice, but your IOps goes up on the back end. This is probably why the RAID 5 write performance was worse than one drive.
Finally, RAID controllers tend to flame out at a range of IOps. Since each SSD can push as many or more IOps than a RAID card, I think that this is a problem that is brewing in the RAID industry.
I would propose the following two tests:
--Repeat the test on a mirror and see if read and write iops improve. It may not if you're at a small number of threads.
--Repeat the test on a SATA bus directly connected to the system. Avoid a RAID controller port, and find out if it supports SATA II.
--If it were a workstation and I had the money, I'd rather buy some M drives and mirror them than raid together E's. Still, with that many IOps, you're already doing great!
Posted by: Ken | January 14, 2009 at 07:21 PM
The expected performance of a 70/30 blend of actions that take different times is not 0.7 * A + 0.3 * B.
That, would be a straight weighted average. The correct prediction is calculated in the time domain, which is a weighted geometric average in the iops domain.
Example:
You drive 3 miles at 10 MPH, and 70 miles at 100 MPH. Your average speed is NOT 0.3 * 10 + 0.7 * 100 = 73MPH. Its 1/(0.3 * (1/10) + 0.7 * (1/100) = 1/.037 =~ 27 MPH.
I have tested Intel's drives and under most read/write mixes the performance is actually BETTER than this expectation.
For my application, random read performance (latency) while concurrent random and sequential reads are taken place is critical.
Every manufacturer we tried failed at that test other than Intel. We did not test MTron, Samsung, or the other of the more expensive "enterprise" SLC vendors. All "cheap" consumer grade devices failed miserably. Even writing at only 10% of the peak rate of these devices caused huge spikes in read latency.
Intel's drives can become read-starved after the write load goes past about 97% of the peak write load. Write performance will vary depending on the difference in the current write pattern from the historical one (that affects the time it takes to 'garbage collect' a new erase page and keep ahead of the write demand).
J Scouter, your comments have many fallacies. sync() calls don't block all reads on a controller with full NCQ on necessarily -- that is an OS level thing. Write barriers will have an effect on writes (and their use is OS and file system dependant).
Although an individual flash DEVICE may have many of the characteristics you mention, these newer PRODUCTS have many parallel devices, so while one bank's data may be inaccessable while a page is erasing, there are many other devices that can respond to reads or writes concurrently. Its all in the controller design, the LBA <> physical block remapping algorithms, and how much flash is overallocated (a 64GB drive likely has a lot more actual flash on it than what is exposed).
I'm using hundreds of these things in production mission critical servers with no issues, only radically better performance. I plan to use more in the future.
Last comment:
I will confirm Ken's excellent points. Putting a RAID controller in front of one of Intel's drives actually SLOWS it down, especially in high IOPS situations and worse, mixed read/write situations.
A raid controller will favor writes over reads, and starve them out. It will in particular favor sequential writes over all else. For hard drives, this is a good thing. For SSD's with good controllers it makes the most sense to get out of the way, mix reads and writes together, and slightly favor reads over writes near the saturation point.
I have seen the best SSD performance when directly attached to a SAS or SATA non-raid port.
Posted by: Scott | February 19, 2009 at 10:30 PM
We have actually tested our real-world CRM database on these X25-E's (in RAID-0 and RAID-5 configuration).
Without having the time to go into details, I can tell you that the performance of our database transactions using the X25-E drives (4 drives in RAID-0) compared to standard 15K rpm drives (RAID-1)improved by 6 - 20 times, depending on complexity of the transaction [the more complex, the more improvement we noticed]
Obviously YMMV and we noticed CPU suddenly becoming a bottleneck, where it was previoulsy flat-lining because of disk constraint.
While it is hard to confirm any vendor's claim about performance, we were actually very surprised to find Intel's claims matched with reality. (This was on a DELL R900, connected as DAS to a PERC6 controller)
Posted by: Wolfgang | March 30, 2009 at 11:24 PM
I want to add that J Scouters observations about SSD performance numbers don't ring true to me.
Like Wolfgang, I have taken a real world database application and put it on SSD drives for testing.
The results were astounding, performance improvements across the board. As expected, read heavy operations were through the roof, but all our operations, even those with a mixture of writes were still faster on than the old setup.
I have a different speculation why you don't see SSD on the TPC-C benchmarks...its just too new.
Storage vendors are going with drives that still cost $100 per gigabye. Not the $15 per gigabyte drives that are being tested here.
And for whatever reason, only Pillar is going with X-25E, the other storage vendors are dabbling in the impossibly expensive...
Just a marketing decision based on what vendors think will extract the most dollars from their customers...nothing to do with technology.
Posted by: Jason | May 29, 2009 at 08:10 AM
Scott, your critique of my post here is wildly inaccurate. To begin with, your math was WAY wrong when you said:
"The expected performance of a 70/30 blend of actions that take different times is not 0.7 * A + 0.3 * B." and "The correct prediction is calculated in the time domain, which is a weighted geometric average in the iops domain."
This is nonsense, though it is well adorned with improperly applied math terminology.
The mistake you made in your example was to substitute "miles" for IOPS, as the latter ALREADY INCORPORATES the time domain, while "miles" does not. IOPS is "I/O's Per Second", remember?
Of course you don't need to actually DO the math to see how far off you are when you said:
"(3 miles at 10MPH) plus (70 miles at 100MPH) = average speed 27 MPH"
You are off by a factor of almost three. Average speed over that distance is in fact 73MPH because 73 miles were travelled in one hour. Likewise, the rest of my math was also correct.
The fact that the performance results you measured on the X25E DO, however match up with your erroneous math underscores my point. X25E is WAY underperforming when workloads include both reads and writes -- almost exactly according to the analysis I provided.
Your critique went downhill from there. For example, you fail (fundamentally) to understand the "synchronous" I/O problem as it relates to transactional database operations and parity based RAID. This has NOTHING whatsoever to do with "sync() calls at the controller" or NCQ.
You may wish to refer to Jim Gray's original papers on transactional database theories (ACID).
In any case, here we are seven months later and still no TPC, nor ANY audited APPLICATION benchmarks of ANY flash SSD devices.
WHY ON EARTH NOT??? Flash SSD is supposedly changing the world, right????
Lots of "theories" as to why not, but no results. For that matter, where are the SPC results?
I saw an interesting e-mail response from a benchmarking organization that did a performance test for Intel (under contract). The paper is out on the Intel website.
They compared an array of X-25Es to a much larger array of SAS disks and concluded that they were able to replace 24 spinning disks with six X25E's. While four-to-one is nowhere near the hype-cycle numbers of dozens/hundreds of spinners per SSD, it still sounds pretty good. Turns out though that even THESE numbers were "cooked"!
Intel's "independent" lab did NOT disclose the hardware setup in terms of devices per interface channel. In the e-mail I saw, when pressed explicitly for the configuration, this third-party testing organization admits that they put all 24 spinning disks on a SINGLE 3GBPS SAS channel, while they connected EACH of the SSD's to it's OWN independent 3GBPS SATA channel.
That's 18Gbps of bandwidth provided to the six SSDs, while the 24 spinning disks were starved with only 3Gbps total ON A SINGLE ARBITRATED SAS CHANNEL.
Now in this setup, it's not the bandwidth so much as the >>channel contention<< that bottlenecked the disks. NOBODY puts more than 5-6 disks (spinners or Flash) on a single arbitrated bus/channel when small-block IO performance is the goal!!!
This is among the oldest and lamest "benchmarketing" tricks in the book. Shockingly deceptive, especially for a company like Intel.
Worse, then they ran some obscure "transactional workload" that was impossible to define/duplicate. I couldn't find anywhere that this workload had previously been used as a benchmark.
All of this deception to get to the point where Intel can assert that a single X25E can replace a measley 4 spinning disks.
Anybody wanna know what the numbers would be if they put the 24 spinners on six 3Gbps channels, same as the SSDs?
Intel, are you listening?
Meanwhile we still have no audited benchmark from any vendor on Flash SSD, and we only have theories as to why not. Not one SSD vendor has publically commented as to why they won't publish a TPC or SPC benchmark.
Another theory is that the Flash SSDs are just too new.
Ok, its now June, eight months after the introduction of the X25E. So much for that theory.
IBM and Fusion-IO built a HUGE TPC-class system around Flash SSD in 2008, and did a few press releases about it. It was called "quicksilver" if I remember correctly. IBM is the No. 1 publisher of TPC benchmark results, and they are esperts on the benchmark. Does anybody here think they DIDN'T run the TPC suite on that system?
Sure they did. So where are the results? The silence has become deafening.
We're hearing lots of anonymous posters talk about their amazing performance numbers -- why arent we seeing these numbers published by the boatload in vendor "case studies"???
The best remaining theory? The reason we have not seen TPC, SPC, etc. is because people will be shocked to discover what the REAL application performance is, once Flash SSD technology is exposed to AUDITED benchmarks.
We will see cost-per-transaction-per-minute at more than 3x that of spinning disk. We will see vendors struggle to get to three-to-one replacement ratios.
And these ridiculously expensive "enterprise class" SSDs from any one of a number of array manufacturers (starting with EMC and all the startups)? I expect we'll NEVER see TPC numbers on that class of devices, because nobody wants to pay 20x more dollars-per-transaction-per-minute, and none of these vendors wants those numbers published.
Posted by: jscouter | June 24, 2009 at 07:48 AM
Well, it's a start.
Dell and Fusion-IO have published the first audited database benchmark using a Flash-memory device for storage. The submission is still under review -- no telling yet that the results will pass the audit, but as I said...its a start.
The benchmark is the TPC-H, which is a "decision support" benchmark. This is important because TPC-H entails an almost exclusively "read-only" workload pattern -- perfect context for SSD as it completely avoids the write performance problem.
In the scenarios painted for us by the Flash SSD hypesters, this application represents a soft-pitch, a gentle lob right over the plate and an easy "home run" for a Flash-SSD database application.
The results? The very first audited benchmark for a flash SSD based database application, and a READ ONLY one at that resulted in (drum roll here)....a fifth place finish among comparable (100GB) systems tested.
This Flash SSD system produced about 30% as many Queries-per-hour as a disk-based system result published by Sun...in 2007.
Surely though it will be in price/performance that this Flash-based system will win, right?
At $1.46 per Query-per-hour, the Dell/Fusion-IO system finishes in third place. More than twice the cost-per-query-per-hour of the best disk-based system.
And this is for a read-only application.
Can't wait to see TPC-C, TPC-E, or SPC-1
Posted by: jscouter | June 24, 2009 at 08:43 AM
I think with a hard disk, performance is very predictable and constant over time.
Posted by: cheap computers | January 20, 2010 at 11:43 AM