« Jerry Yang out as CEO of Yahoo! | Main | Its an Operating System Jim, But Not as We Know It »

November 18, 2008




I suspect that your RAID-5 results have little or nothing to do with your RAID controller. Instead, you join a small but growing list of astute observers who are uncovering an intrinsic problem with Flash, a fundamental problem at the lowest levels of flash device architecture.

A problem that no one at the Flash SSD hype party seems to want to acknowledge. The proverbial "turd in the punchbowl".

In -isolation-, Flash reads are much faster than spinning disks. Likewise, -in isolation- Flash writes are also faster, though not by as wide a margin. This asymmetry of read vs. write performance is a big problem, but it's not a deal breaker for the Flash value proposition.

What you discovered in your RAID-5 test however IS the deal-breaker -- the thing about flash that no one seems to see. Those who have seen it seem to chalk it up to some other system problem. After all, it's simply not possible that a Flash device, with no moving parts, can be substantially SLOWER than disk. Certainly we understand that Flash may not always, in every application, be 100x faster than disk...but if a test shows Flash substantially slower than spinning disk...well...it's got to be the test setup, right? Must be the in RAID controller or some other system gremlin (as you observe).

The truth is that this really and truly is a fundamental limit of Flash technology, and IMO it is the reason why NAND Flash will not find widespread use in ANY brick-and-mortar enterprise applications.

Whenever Flash devices (either SLC or MLC) are performing BOTH reads and writes simultaneously, they are MUCH slower than when workloads are either 100% read or 100% write. Check the Intel spec sheets for the details. Intel reports that a 70/30 mix of reads and writes (IOmeter) yields far fewer IOPS than what should be expected based on a simple formula of 0.7x read-only IOPS plus 0.3x write-only IOPS. Yet the performance is still faster than disk, so this isn't (quite) the deal killer either.

The achilles heel of flash is synchronous IO.

Whenever synchronous I/O is involved, such as in all forms of parity-based RAID and...oh-by-the-way -Transaction Processing- and just about every other performance-critical enterprise application, reads cannot happen (wait in queue) until previous writes are completed.

The combination of the three factors; (a) performance asymmetry, (b) slowdown when reads and writes happen simultaneously, and (c) the impact on application behavior when these are encountered in synchronous IO workloads results in Flash being substantially slower than spinning disk...as it was in your test of RAID5.

In case you were wondering, this is why no one has yet published a TPC benchmark with Flash. This is why you haven't seen ANY audited enterprise application benchmarks with Flash SSD.

Real applications subject storage to synchronous IO, something that benchmarks don't do. Nor do I/O "traces" taken from applications, even if the I/O workloads were originally synchronous in nature.

Of course...with all the cash flowing into Flash SSD startups and the hype-cycle (profitable in it's own right), people have been ignoring this interesting little "feature" of Flash silicon.

Don't chalk it up to your RAID cointroller. Get in touch with your contact at Intel and ask them directly. Afterward, you may be interested enough to dig in to the underlying issue at the silicon level and see whether or when this problem can be solved.

On the other hand, nobody wants to be the one that points out the turd in the punchbowl.


J Scouter


I do agree that performance does seem to vary, however, we have played with an E series drive and get different results in some cases.
The read results that Gene sees are pretty good.... about 11K to 12K IOps. I work in an enterprise, so my interest lies in the 8K to 128K block size.
The E series drives are extremely fast. In fact, we can put as much as 14K IOps into an E series drive with 8K blocks.
As far as the M series drives, these are almost as fast, however they have a rate limiter On a traditional hard disk, it's even worse. You will have to wait a long time for the read to that kicks in at a rate higher than the average user, but potentially below an enterprise. (Think whole disk raid rebuild...). If I had a power system at home or the office, I'd buy an M series (160GB is out now).
There are two flaws to this test. First of all, even a decent raid controller will flame out in the 12,000 IOps range for reads. For writes on a RAID 5, it's worse in two ways. Each random write requires the controller to perform two reads (data and parity) and then a data and parity write. So that's a multiplication of four. If you have battery cache, your application won't notice, but your IOps goes up on the back end. This is probably why the RAID 5 write performance was worse than one drive.
Finally, RAID controllers tend to flame out at a range of IOps. Since each SSD can push as many or more IOps than a RAID card, I think that this is a problem that is brewing in the RAID industry.
I would propose the following two tests:
--Repeat the test on a mirror and see if read and write iops improve. It may not if you're at a small number of threads.
--Repeat the test on a SATA bus directly connected to the system. Avoid a RAID controller port, and find out if it supports SATA II.
--If it were a workstation and I had the money, I'd rather buy some M drives and mirror them than raid together E's. Still, with that many IOps, you're already doing great!


The expected performance of a 70/30 blend of actions that take different times is not 0.7 * A + 0.3 * B.
That, would be a straight weighted average. The correct prediction is calculated in the time domain, which is a weighted geometric average in the iops domain.

You drive 3 miles at 10 MPH, and 70 miles at 100 MPH. Your average speed is NOT 0.3 * 10 + 0.7 * 100 = 73MPH. Its 1/(0.3 * (1/10) + 0.7 * (1/100) = 1/.037 =~ 27 MPH.

I have tested Intel's drives and under most read/write mixes the performance is actually BETTER than this expectation.
For my application, random read performance (latency) while concurrent random and sequential reads are taken place is critical.

Every manufacturer we tried failed at that test other than Intel. We did not test MTron, Samsung, or the other of the more expensive "enterprise" SLC vendors. All "cheap" consumer grade devices failed miserably. Even writing at only 10% of the peak rate of these devices caused huge spikes in read latency.
Intel's drives can become read-starved after the write load goes past about 97% of the peak write load. Write performance will vary depending on the difference in the current write pattern from the historical one (that affects the time it takes to 'garbage collect' a new erase page and keep ahead of the write demand).

J Scouter, your comments have many fallacies. sync() calls don't block all reads on a controller with full NCQ on necessarily -- that is an OS level thing. Write barriers will have an effect on writes (and their use is OS and file system dependant).
Although an individual flash DEVICE may have many of the characteristics you mention, these newer PRODUCTS have many parallel devices, so while one bank's data may be inaccessable while a page is erasing, there are many other devices that can respond to reads or writes concurrently. Its all in the controller design, the LBA <> physical block remapping algorithms, and how much flash is overallocated (a 64GB drive likely has a lot more actual flash on it than what is exposed).

I'm using hundreds of these things in production mission critical servers with no issues, only radically better performance. I plan to use more in the future.

Last comment:
I will confirm Ken's excellent points. Putting a RAID controller in front of one of Intel's drives actually SLOWS it down, especially in high IOPS situations and worse, mixed read/write situations.
A raid controller will favor writes over reads, and starve them out. It will in particular favor sequential writes over all else. For hard drives, this is a good thing. For SSD's with good controllers it makes the most sense to get out of the way, mix reads and writes together, and slightly favor reads over writes near the saturation point.

I have seen the best SSD performance when directly attached to a SAS or SATA non-raid port.


We have actually tested our real-world CRM database on these X25-E's (in RAID-0 and RAID-5 configuration).
Without having the time to go into details, I can tell you that the performance of our database transactions using the X25-E drives (4 drives in RAID-0) compared to standard 15K rpm drives (RAID-1)improved by 6 - 20 times, depending on complexity of the transaction [the more complex, the more improvement we noticed]
Obviously YMMV and we noticed CPU suddenly becoming a bottleneck, where it was previoulsy flat-lining because of disk constraint.
While it is hard to confirm any vendor's claim about performance, we were actually very surprised to find Intel's claims matched with reality. (This was on a DELL R900, connected as DAS to a PERC6 controller)


I want to add that J Scouters observations about SSD performance numbers don't ring true to me.

Like Wolfgang, I have taken a real world database application and put it on SSD drives for testing.

The results were astounding, performance improvements across the board. As expected, read heavy operations were through the roof, but all our operations, even those with a mixture of writes were still faster on than the old setup.

I have a different speculation why you don't see SSD on the TPC-C benchmarks...its just too new.

Storage vendors are going with drives that still cost $100 per gigabye. Not the $15 per gigabyte drives that are being tested here.

And for whatever reason, only Pillar is going with X-25E, the other storage vendors are dabbling in the impossibly expensive...

Just a marketing decision based on what vendors think will extract the most dollars from their customers...nothing to do with technology.


Scott, your critique of my post here is wildly inaccurate. To begin with, your math was WAY wrong when you said:

"The expected performance of a 70/30 blend of actions that take different times is not 0.7 * A + 0.3 * B." and "The correct prediction is calculated in the time domain, which is a weighted geometric average in the iops domain."

This is nonsense, though it is well adorned with improperly applied math terminology.

The mistake you made in your example was to substitute "miles" for IOPS, as the latter ALREADY INCORPORATES the time domain, while "miles" does not. IOPS is "I/O's Per Second", remember?

Of course you don't need to actually DO the math to see how far off you are when you said:

"(3 miles at 10MPH) plus (70 miles at 100MPH) = average speed 27 MPH"

You are off by a factor of almost three. Average speed over that distance is in fact 73MPH because 73 miles were travelled in one hour. Likewise, the rest of my math was also correct.

The fact that the performance results you measured on the X25E DO, however match up with your erroneous math underscores my point. X25E is WAY underperforming when workloads include both reads and writes -- almost exactly according to the analysis I provided.

Your critique went downhill from there. For example, you fail (fundamentally) to understand the "synchronous" I/O problem as it relates to transactional database operations and parity based RAID. This has NOTHING whatsoever to do with "sync() calls at the controller" or NCQ.

You may wish to refer to Jim Gray's original papers on transactional database theories (ACID).

In any case, here we are seven months later and still no TPC, nor ANY audited APPLICATION benchmarks of ANY flash SSD devices.

WHY ON EARTH NOT??? Flash SSD is supposedly changing the world, right????

Lots of "theories" as to why not, but no results. For that matter, where are the SPC results?

I saw an interesting e-mail response from a benchmarking organization that did a performance test for Intel (under contract). The paper is out on the Intel website.

They compared an array of X-25Es to a much larger array of SAS disks and concluded that they were able to replace 24 spinning disks with six X25E's. While four-to-one is nowhere near the hype-cycle numbers of dozens/hundreds of spinners per SSD, it still sounds pretty good. Turns out though that even THESE numbers were "cooked"!

Intel's "independent" lab did NOT disclose the hardware setup in terms of devices per interface channel. In the e-mail I saw, when pressed explicitly for the configuration, this third-party testing organization admits that they put all 24 spinning disks on a SINGLE 3GBPS SAS channel, while they connected EACH of the SSD's to it's OWN independent 3GBPS SATA channel.

That's 18Gbps of bandwidth provided to the six SSDs, while the 24 spinning disks were starved with only 3Gbps total ON A SINGLE ARBITRATED SAS CHANNEL.

Now in this setup, it's not the bandwidth so much as the >>channel contention<< that bottlenecked the disks. NOBODY puts more than 5-6 disks (spinners or Flash) on a single arbitrated bus/channel when small-block IO performance is the goal!!!

This is among the oldest and lamest "benchmarketing" tricks in the book. Shockingly deceptive, especially for a company like Intel.

Worse, then they ran some obscure "transactional workload" that was impossible to define/duplicate. I couldn't find anywhere that this workload had previously been used as a benchmark.

All of this deception to get to the point where Intel can assert that a single X25E can replace a measley 4 spinning disks.

Anybody wanna know what the numbers would be if they put the 24 spinners on six 3Gbps channels, same as the SSDs?

Intel, are you listening?

Meanwhile we still have no audited benchmark from any vendor on Flash SSD, and we only have theories as to why not. Not one SSD vendor has publically commented as to why they won't publish a TPC or SPC benchmark.

Another theory is that the Flash SSDs are just too new.

Ok, its now June, eight months after the introduction of the X25E. So much for that theory.

IBM and Fusion-IO built a HUGE TPC-class system around Flash SSD in 2008, and did a few press releases about it. It was called "quicksilver" if I remember correctly. IBM is the No. 1 publisher of TPC benchmark results, and they are esperts on the benchmark. Does anybody here think they DIDN'T run the TPC suite on that system?

Sure they did. So where are the results? The silence has become deafening.

We're hearing lots of anonymous posters talk about their amazing performance numbers -- why arent we seeing these numbers published by the boatload in vendor "case studies"???

The best remaining theory? The reason we have not seen TPC, SPC, etc. is because people will be shocked to discover what the REAL application performance is, once Flash SSD technology is exposed to AUDITED benchmarks.

We will see cost-per-transaction-per-minute at more than 3x that of spinning disk. We will see vendors struggle to get to three-to-one replacement ratios.

And these ridiculously expensive "enterprise class" SSDs from any one of a number of array manufacturers (starting with EMC and all the startups)? I expect we'll NEVER see TPC numbers on that class of devices, because nobody wants to pay 20x more dollars-per-transaction-per-minute, and none of these vendors wants those numbers published.


Well, it's a start.

Dell and Fusion-IO have published the first audited database benchmark using a Flash-memory device for storage. The submission is still under review -- no telling yet that the results will pass the audit, but as I said...its a start.

The benchmark is the TPC-H, which is a "decision support" benchmark. This is important because TPC-H entails an almost exclusively "read-only" workload pattern -- perfect context for SSD as it completely avoids the write performance problem.

In the scenarios painted for us by the Flash SSD hypesters, this application represents a soft-pitch, a gentle lob right over the plate and an easy "home run" for a Flash-SSD database application.

The results? The very first audited benchmark for a flash SSD based database application, and a READ ONLY one at that resulted in (drum roll here)....a fifth place finish among comparable (100GB) systems tested.

This Flash SSD system produced about 30% as many Queries-per-hour as a disk-based system result published by Sun...in 2007.

Surely though it will be in price/performance that this Flash-based system will win, right?

At $1.46 per Query-per-hour, the Dell/Fusion-IO system finishes in third place. More than twice the cost-per-query-per-hour of the best disk-based system.

And this is for a read-only application.

Can't wait to see TPC-C, TPC-E, or SPC-1

cheap computers

I think with a hard disk, performance is very predictable and constant over time.

The comments to this entry are closed.

  • Burton Group Free Resources Stay Connected Stay Connected Stay Connected Stay Connected

Blog powered by Typepad