PeriodicPreoccupationsProjectsPicturesPersonPing

Recent musings

Pause for Testing

To recap, the major components of the ZFS storage server were:

  • PCI Case's IPC-C3E-BAR65-XP-SAS 3U chassis w/16 hot-swap SAS/SATA2 bays
  • 2× AMD Opteron 275 processors (dual-core, 2.2GHz)
  • Tyan S3892 (K8HM) motherboard.
  • 8GB ECC memory
  • 2× 80GB boot drives (connected to the motherboard)
  • Supermicro AOC-SAT2-MV8 SATA 8 port RAID controller
  • 16× 500 Gb Seagate enterprise class SATA Hard drives
This configuration is intended to act as a streaming multimedia recorder/server, with the most demanding disk I/O workflow intended to be writing a stream of data coming in from two GigE links. Imagine what it might take to store uncompressed high-definition video. There are likely to be other demanding tasks, but this was the most extreme.

Tyan's S3892 suffers from some ambiguous documentation. The PDF specification sheet states in the text that there are two 133MHz PCI-X slots, and one 100MHz slot, whereas the block diagram says they all run at 133MHz. The manual says nothing about it. There was no answer from emailing Tyan support. Using the block diagram as my guide, I split the two Supermicro controllers amongst the two PCI busses, and decided to start testing the hardware configuration to be sure.

Basically, I wanted to test and compare the model I previously blogged about.

    For mirrored configurations:
  • Small, random reads scale linearly with the number of disks; writes scale linearly with the number of mirror sets.
  • Sequential read throughput scales linearly with the number of disks; write throughput scales linearly with the number of mirror sets.
    For parity (RAID-Z, RAID-Z2) configurations:
  • Small, random I/O reads and writes scale linearly with the number of RAID sets.
  • Sequential read and write throughput scales linearly with the number of data (non-parity) disks.

Bonnie-64 was designed to turn up performance bottlenecks. That is precisely what I was looking for. Can I tell that one controller is on a bus that runs 75% the speed of the other? I could, in fact, but the overall combined performance was very decent. The tests did show limitations in my hardware, however.

I compared configurations from two to fifteen disks (I always want to have a hot spare running), with 2+1 RAIDZ vdevs, 4+1 RAIDZ vdevs, and (single) mirror vdevs. So each graph reflects fifteen runs of Bonnie-64:

  • With the 2+1 RAID-Z: 3, 6, 9, 12, or 15 disks in the zpool,
  • With the 4+1 RAID-Z: 5, 10, or 15 disks in a zpool, and
  • With the mirror configuration: 2, 4, 6, 8, 10, 12, or 14 disks in the zpool.
All of the graphs measure the number of data disks (i.e., the total number of mirror disks, but only the number of non-parity disks for the RAID-Z configurations), or in the case of random seeks, the number of RAID/mirror sets. All of the tests were performed with 32GB test files: from the results, it's pretty clear we're exceeding any cache issues.

Block Writes, MB/sec

Block writes were always going to be the metric I was most sensitive to, because of the above-described workflow. You can see that there is a strong levelling-off of block write performance just below 390 MBytes/sec. The mirrored configurations increase their write speed at half the rate of the RAID-Z configurations, as we would expect from the slower writes indicated in the model. The "ideal" line is fairly arbitrary, as it's an extrapolation of performance from fairly few data points. It is, however, indicative of what the performance model might predict.

Block Reads, MB/sec

Sustained read performance is much less limited than with writes. The "ideal" line also has a 33% steeper slope than with the writes: it appears we consistently achieve four block reads in the same time as it take to do three block writes. Strangely, the 4+1 RAID-Z groups underperform by a fair bit (I can't comment as to the statistical significance, at the moment, but it seems fairly consistent). The 7×2 mirror configuration tops out at 735 MB/s on reads, which seems fairly decent.

Random Seeks /sec

I'll admit that the random seek performance figures baffle me a bit. Everything I've read so far suggested that random seek performance would scale linearly with the number of vdevs (or disks in the mirror). Instead, the numbers line up fairly well with a logarithmic graph. Am I running into lots of vibration? Am I hitting an unexpected bottleneck that's unrelated to data transfer over the bus?

This is a beast of a post already. I'll push this out to the world, and start writing up the next installment, wherein I note that one of the SATA controllers is, in fact, on a slower PCI-X bus, and what I do to fix it.

edit: All of this was on Solaris Express Community Edition, Nevada 70, with the ZFS boot patch applied.

Related Entries:
Further benchmarks, and a step back for consideration
Install 2 of N. Continue?
Install 1 of N. Begin?
More storage desires
ZFS performance models for a streaming server
Comments (4)  Permalink

Comments

mrb @ 17.09.2007 19:15 London/GMT
The I/O throughput difference between reads and writes (on raidz) is explained by the fact that when reading from raidz , zfs doesn't read the parity blocks. So reads are faster.

With a 2+1 raidz, the read throughput is theoretically 50% higher than the write throughput (3/2). This is close to what you observe with 4 data disks: ~280 MB/s vs. ~200 MB/s.

With a 4+1 raidz, the read throughput is theoretically 25% higher (5/4). Again this is close to your 4-data-disk observation: 250 MB/s vs. 200 MB/s.
mrb @ 17.09.2007 19:30 London/GMT
Your server config has been very well chosen by the way. You got an incredible bang-for-the-buck with these CPUs, this SATA card and the disks.
adam @ 17.09.2007 23:40 London/GMT
@mrb: Thanks! Bang for buck was really a primary concern of mine when speccing things out. I saw ZFS as being a way of whipping a collection of big, dumb disks into respectable shape.

I need to look at my numbers again to see if I can square what you say with my test results. That 4/3 ratio comes out very strongly in further numbers (not published yet) when I run into strong PCI bus contention (to come in my next zfs-related entry).
mrb @ 18.09.2007 04:33 London/GMT
Don't forget that the more disks you have, the more you will see your experimental results diverge from these ratios. Because other factors come into play: CPU usage, I/O scheduling, etc...

The max theoretical bandwidth of a 64-bit 100-MHz PCI-X bus is 800 MByte/s (100*64/8). In practice, expect 75-80% of this: ~600 MByte/s. Even if your 2 cards where on 2 100-MHz buses, this would be *way* sufficient to handle the 735 MByte/s read throughput you notice in the 14-disk mirror case (367.5 MByte/s per bus).

Try reporting the output of "vmstat 1" in your next blog entry, to try to guess what are the bottlenecks (if any).
No new comments allowed (anymore) on this post.