Pause for Testing
To recap, the major components of the ZFS storage server were:
- PCI Case's IPC-C3E-BAR65-XP-SAS 3U chassis w/16 hot-swap SAS/SATA2 bays
- 2× AMD Opteron 275 processors (dual-core, 2.2GHz)
- Tyan S3892 (K8HM) motherboard.
- 8GB ECC memory
- 2× 80GB boot drives (connected to the motherboard)
- 2× Supermicro AOC-SAT2-MV8 SATA 8 port RAID controller
- 16× 500 Gb Seagate enterprise class SATA Hard drives
Tyan's S3892 suffers from some ambiguous documentation. The PDF specification sheet states in the text that there are two 133MHz PCI-X slots, and one 100MHz slot, whereas the block diagram says they all run at 133MHz. The manual says nothing about it. There was no answer from emailing Tyan support. Using the block diagram as my guide, I split the two Supermicro controllers amongst the two PCI busses, and decided to start testing the hardware configuration to be sure.
Basically, I wanted to test and compare the model I previously blogged about.
- For mirrored configurations:
- Small, random reads scale linearly with the number of disks; writes scale linearly with the number of mirror sets.
- Sequential read throughput scales linearly with the number of disks; write throughput scales linearly with the number of mirror sets.
- For parity (RAID-Z, RAID-Z2) configurations:
- Small, random I/O reads and writes scale linearly with the number of RAID sets.
- Sequential read and write throughput scales linearly with the number of data (non-parity) disks.
Bonnie-64 was designed to turn up performance bottlenecks. That is precisely what I was looking for. Can I tell that one controller is on a bus that runs 75% the speed of the other? I could, in fact, but the overall combined performance was very decent. The tests did show limitations in my hardware, however.
I compared configurations from two to fifteen disks (I always want to have a hot spare running), with 2+1 RAIDZ vdevs, 4+1 RAIDZ vdevs, and (single) mirror vdevs. So each graph reflects fifteen runs of Bonnie-64:
- With the 2+1 RAID-Z: 3, 6, 9, 12, or 15 disks in the zpool,
- With the 4+1 RAID-Z: 5, 10, or 15 disks in a zpool, and
- With the mirror configuration: 2, 4, 6, 8, 10, 12, or 14 disks in the zpool.
Block writes were always going to be the metric I was most sensitive to, because of the above-described workflow. You can see that there is a strong levelling-off of block write performance just below 390 MBytes/sec. The mirrored configurations increase their write speed at half the rate of the RAID-Z configurations, as we would expect from the slower writes indicated in the model. The "ideal" line is fairly arbitrary, as it's an extrapolation of performance from fairly few data points. It is, however, indicative of what the performance model might predict.
Sustained read performance is much less limited than with writes. The "ideal" line also has a 33% steeper slope than with the writes: it appears we consistently achieve four block reads in the same time as it take to do three block writes. Strangely, the 4+1 RAID-Z groups underperform by a fair bit (I can't comment as to the statistical significance, at the moment, but it seems fairly consistent). The 7×2 mirror configuration tops out at 735 MB/s on reads, which seems fairly decent.
I'll admit that the random seek performance figures baffle me a bit. Everything I've read so far suggested that random seek performance would scale linearly with the number of vdevs (or disks in the mirror). Instead, the numbers line up fairly well with a logarithmic graph. Am I running into lots of vibration? Am I hitting an unexpected bottleneck that's unrelated to data transfer over the bus?
This is a beast of a post already. I'll push this out to the world, and start writing up the next installment, wherein I note that one of the SATA controllers is, in fact, on a slower PCI-X bus, and what I do to fix it.
edit: All of this was on Solaris Express Community Edition, Nevada 70, with the ZFS boot patch applied.
Comments
The I/O throughput difference between reads and writes (on raidz) is explained by the fact that when reading from raidz , zfs doesn't read the parity blocks. So reads are faster.
With a 2+1 raidz, the read throughput is theoretically 50% higher than the write throughput (3/2). This is close to what you observe with 4 data disks: ~280 MB/s vs. ~200 MB/s.
With a 4+1 raidz, the read throughput is theoretically 25% higher (5/4). Again this is close to your 4-data-disk observation: 250 MB/s vs. 200 MB/s.
Your server config has been very well chosen by the way. You got an incredible bang-for-the-buck with these CPUs, this SATA card and the disks.
@mrb: Thanks! Bang for buck was really a primary concern of mine when speccing things out. I saw ZFS as being a way of whipping a collection of big, dumb disks into respectable shape.
I need to look at my numbers again to see if I can square what you say with my test results. That 4/3 ratio comes out very strongly in further numbers (not published yet) when I run into strong PCI bus contention (to come in my next zfs-related entry).
Don't forget that the more disks you have, the more you will see your experimental results diverge from these ratios. Because other factors come into play: CPU usage, I/O scheduling, etc...
The max theoretical bandwidth of a 64-bit 100-MHz PCI-X bus is 800 MByte/s (100*64/8). In practice, expect 75-80% of this: ~600 MByte/s. Even if your 2 cards where on 2 100-MHz buses, this would be *way* sufficient to handle the 735 MByte/s read throughput you notice in the 14-disk mirror case (367.5 MByte/s per bus).
Try reporting the output of "vmstat 1" in your next blog entry, to try to guess what are the bottlenecks (if any).