PeriodicPreoccupationsProjectsPicturesPersonPing

Recent musings

Further benchmarks, and a step back for consideration

So, when last I blogged about the ZFS RAID server, I may have ended on a down note, suggesting disappointment. I hope readers will understand that's not the case.

When I started this project, I sat down and examined what was most important to me in the server.

    My requirements:
  • Saturate an aggegated 2× GigE link for sustained reads and writes
  • Do it cheaply
    My strong desires:
  • ZFS for its reliability, redundancy, flexibility, and ease of use
  • Maximise the amount of usable space
ZFS wasn't a requirement. It couldn't be: it's a solution, and defining requirements in terms of pre-ordained solutions is, at best, compromised. Maximising IOPS wasn't a first priority, sustained write performance was. Still, I want to have decent random seek performance, because there will always be a case where performance falls to that level.

I ran some additional Bonnie-64 tests to see the difference between the SATA controllers and their PCI-X buses. There was a small (a couple percent) but consistent difference between the two controllers. I believe one ran at 133MHz, and the other ran at 100MHz (but, to be honest, I don't know what tools I would use to verify such a thing). So I moved a disk controller from the 100MHz bus to a PCI-X slot on the shared 133MHz bus, and ran the same tests as before.

The results are as follows:

Block Writes, MB/sec

I perceive a strong levelling-off of streaming write performance, even lower than with the previous test. The peak for three 4+1 RAID-Z groups is 387.5 MB/s, while the peak for five 2+1 groups is 354.5 MB/sec. The mirrored scenario's limits are even lower, at 258.5 MB/sec.

Block Reads, MB/sec

The continuous read performance is even more interesting. Now, it's clear that the two controllers are maxing out a single, contended PCI-X bus where they hadn't before. The read limit is at 520 MB/sec. That, to me, sounds very much like one half of the throughput of a 64-bit, 133MHz bus (1064 MB/s). (It's within 2.5% of half that figure.) One conclusion could be that ZFS performs two reads for every block requested from disk, whether it be RAID-Z or mirror.

Taking a step back, should we find significance in the fact that one-third of our PCI-X bus throughput is 354.67 MB/s, while the most we could squeeze out of the 2+1 RAID-Z configuration was 354.5? It would certainly square with what commenter "mrb" stated: for 2+1 RAID-Z sets, expect 50% higher throughput on reads than writes.

Random Seeks /sec

The random seek performance doesn't yet tell me much other than the theory of IOPS scaling linearly with the number of vdevs or mirror disks simply does not hold on my system. Frankly, I'm stumped at how it only increases logarithmically. Well, at least it increases monotonically.

Let's pit theory against practice. I originally posted a crude, back-of-the-envelope model of read/write/random ZFS performance for 14 or 15 disks in 2+1, 4+1 or 2× mirror configurations. What happens to our experimental results when I factor out the (estimated) base performance of a single drive/vdev set?

Sequential I/O
config Random Reads Read Write Capacity
RAIDZ: 3×(4+1) 3y 12z 12z 6.0TB
RAIDZ: 5×(2+1) 5y 10z 10z 5.0TB
mirror: 7×2 14y 14z 7z 3.5TB
Random Reads
×90/sec
Sequential I/O
×72MB/sec
config Read Write Capacity
RAIDZ: 3×(4+1) 2.3y 7.2z 5.3z 6.0TB
RAIDZ: 5×(2+1) 3.3y 8.8z 5.2z 5.0TB
mirror: 7×2 3.8y 10.2z 4.8z 3.5TB

It's clear that ZFS is demanding enough that it can hit the limits of the PCI-X bus on a poorly thought-out system. I can sketch out those limits on my own system, in some cases. It's also true that I could have chosen another motherboard with 2 independent 133MHz PCI-X buses, or gone with a PCIe solution that would have eliminated any concerns about bus bandwidth. In theory, with this many disks, I could be seeing twice the performance in some situations. However, I should look at the numbers: 390MB/s far exceeds my ability to get data into or out of the machine via the network.

The machine does what it is supposed to, and surprisingly affordably, too. Any "disappointment" I have is purely theoretical.


As a postscript, I should make a call out to anyone who would like further data with the 133+100MHz controller configuration. The server is leaving the workshop and going into the rack now, but the system will be under test for a few weeks more. Contact me via the comments, the contact form on this site, or the zfs-discuss list if you have a particular scenario you'd like me to run.

Related Entries:
Pause for Testing
Install 2 of N. Continue?
Install 1 of N. Begin?
More storage desires
ZFS performance models for a streaming server
Comments (1)  Permalink

Pause for Testing

To recap, the major components of the ZFS storage server were:

  • PCI Case's IPC-C3E-BAR65-XP-SAS 3U chassis w/16 hot-swap SAS/SATA2 bays
  • 2× AMD Opteron 275 processors (dual-core, 2.2GHz)
  • Tyan S3892 (K8HM) motherboard.
  • 8GB ECC memory
  • 2× 80GB boot drives (connected to the motherboard)
  • Supermicro AOC-SAT2-MV8 SATA 8 port RAID controller
  • 16× 500 Gb Seagate enterprise class SATA Hard drives
This configuration is intended to act as a streaming multimedia recorder/server, with the most demanding disk I/O workflow intended to be writing a stream of data coming in from two GigE links. Imagine what it might take to store uncompressed high-definition video. There are likely to be other demanding tasks, but this was the most extreme.

Tyan's S3892 suffers from some ambiguous documentation. The PDF specification sheet states in the text that there are two 133MHz PCI-X slots, and one 100MHz slot, whereas the block diagram says they all run at 133MHz. The manual says nothing about it. There was no answer from emailing Tyan support. Using the block diagram as my guide, I split the two Supermicro controllers amongst the two PCI busses, and decided to start testing the hardware configuration to be sure.

Basically, I wanted to test and compare the model I previously blogged about.

    For mirrored configurations:
  • Small, random reads scale linearly with the number of disks; writes scale linearly with the number of mirror sets.
  • Sequential read throughput scales linearly with the number of disks; write throughput scales linearly with the number of mirror sets.
    For parity (RAID-Z, RAID-Z2) configurations:
  • Small, random I/O reads and writes scale linearly with the number of RAID sets.
  • Sequential read and write throughput scales linearly with the number of data (non-parity) disks.

Bonnie-64 was designed to turn up performance bottlenecks. That is precisely what I was looking for. Can I tell that one controller is on a bus that runs 75% the speed of the other? I could, in fact, but the overall combined performance was very decent. The tests did show limitations in my hardware, however.

I compared configurations from two to fifteen disks (I always want to have a hot spare running), with 2+1 RAIDZ vdevs, 4+1 RAIDZ vdevs, and (single) mirror vdevs. So each graph reflects fifteen runs of Bonnie-64:

  • With the 2+1 RAID-Z: 3, 6, 9, 12, or 15 disks in the zpool,
  • With the 4+1 RAID-Z: 5, 10, or 15 disks in a zpool, and
  • With the mirror configuration: 2, 4, 6, 8, 10, 12, or 14 disks in the zpool.
All of the graphs measure the number of data disks (i.e., the total number of mirror disks, but only the number of non-parity disks for the RAID-Z configurations), or in the case of random seeks, the number of RAID/mirror sets. All of the tests were performed with 32GB test files: from the results, it's pretty clear we're exceeding any cache issues.

Block Writes, MB/sec

Block writes were always going to be the metric I was most sensitive to, because of the above-described workflow. You can see that there is a strong levelling-off of block write performance just below 390 MBytes/sec. The mirrored configurations increase their write speed at half the rate of the RAID-Z configurations, as we would expect from the slower writes indicated in the model. The "ideal" line is fairly arbitrary, as it's an extrapolation of performance from fairly few data points. It is, however, indicative of what the performance model might predict.

Block Reads, MB/sec

Sustained read performance is much less limited than with writes. The "ideal" line also has a 33% steeper slope than with the writes: it appears we consistently achieve four block reads in the same time as it take to do three block writes. Strangely, the 4+1 RAID-Z groups underperform by a fair bit (I can't comment as to the statistical significance, at the moment, but it seems fairly consistent). The 7×2 mirror configuration tops out at 735 MB/s on reads, which seems fairly decent.

Random Seeks /sec

I'll admit that the random seek performance figures baffle me a bit. Everything I've read so far suggested that random seek performance would scale linearly with the number of vdevs (or disks in the mirror). Instead, the numbers line up fairly well with a logarithmic graph. Am I running into lots of vibration? Am I hitting an unexpected bottleneck that's unrelated to data transfer over the bus?

This is a beast of a post already. I'll push this out to the world, and start writing up the next installment, wherein I note that one of the SATA controllers is, in fact, on a slower PCI-X bus, and what I do to fix it.

edit: All of this was on Solaris Express Community Edition, Nevada 70, with the ZFS boot patch applied.

Related Entries:
Further benchmarks, and a step back for consideration
Install 2 of N. Continue?
Install 1 of N. Begin?
More storage desires
ZFS performance models for a streaming server
Comments (4)  Permalink

Install 2 of N. Continue?

This is a followup to the first in the series.

After an encouraging comment to my last OpenSolaris blog entry, I decided to do what was necessary to make a patched ZFS boot installer. I used the Netinstall script/procedure from Lori Alt. The name is a bit misleading, as I was able to run the install from a DVD, as I'm usually comfortable doing.

The actual install was much easier than expected, and – because of the pfinstall procedure – much more efficient than the usual rigamarole I go through. Creating the modified boot DVD was the hard part, and really, that was mostly in the logistics of moving DVD images back and forth. I would

  • download the DVD segments to my desktop,
  • assemble the pieces, (really, I have to complain about Sun's download policy: it really gets in the way when trying to do this kind of work.)
  • upload to the Solaris box, (and try again with newly split files because of a 2GB limit),
  • mount the ISO image,
  • copy the image to a working directory,
  • apply the patch, and include a draft install profile on disk,
  • create a bootable ISO image, and
  • move it (in pieces) from the server room to the MacBook Pro so I can actually burn it.

Google led me to Tom Haynes' blog, and the linked entry along with the followups that follow in sequence give enough magic sauce to get a bootable/installable ISO image. With the disk burned, installation was a snap. And a ZFS boot disk… simply works. (I do need to dig deeper into arranging a ZVOL or some such as a dump volume. At the moment, it's on one of the unused array disks.)

When I last saw the machine, I had started running Bonnie-64 on it, and looked good so far. I hope to have very comprehensive test results to post.

In my previous entry, I vowed to give more details about the hardware.

My requirements were to come up with high-density, high-throughput, easy-to-manage storage on a budget. This will be doing multimedia streaming all the way up to (potentially uncompressed) Hi-Def. It's not your typical mailserver, in other words. My interest in ZFS has been in its reliability, ease-of-administration, and conceptual simplicity: disks are dumb and all too prone to failure. If an all-software solution allows me not to worry about the disks and not put the workload onto another point of failure between the application and the hardware (read: RAID cards die, get old, and obsolete), then I'm all for it.

A local systems integrator (with whom my department has had a long relationship) provided the system, and collaborated a fair bit on the specifications (but he readily admits that Solaris is not his speciality).

We started with the chassis: Steve works fairly exclusively with PCICase, and recommended their 16× SATA drive chassis in a 3U rackmount. It's a bit anonymously black, but it certainly looks like it will do the job of high-density storage on a budget.

After going back and forth, we settled on a motherboard from Tyan. The S2892 seemed just the ticket, and got a thumbs-up from the ZFS list. Unfortunately, Steve couldn't find the board from any of his suppliers, because it's apparently been end-of-lifed. He suggested the S3892 (Tyan Thunder K8HM) in its place. Having seen some measure of support for the southbridge controllers on the HCL (thanks to Paul Richards) from a kissing-cousin relative of the new motherboard, I agreed.

Originally, Steve proposed a 3.0GHz dual-core Opteron. I wanted something cheaper with more cores (cos Solaris tends to handle that rather well), so we ended up agreeing on two dual-core AMD Opteron 275 (2.2GHz) processors.

Combine this with 8GB RAM, dual boot drives (internal to the case, connected to the motherboard), 2× Supermicro's AOC-SAT2-MV8 SATA 8 port RAID controller, 16× 500GB Seagate SATA hard drives, and we have a pretty serious 8 Terabyte system for under £5000. The question at this point is how serious. I'm waiting for benchmark results to tell me just that.

Related Entries:
Further benchmarks, and a step back for consideration
Pause for Testing
Install 1 of N. Begin?
More storage desires
ZFS performance models for a streaming server
 Permalink

Install 1 of N. Begin?

I took delivery of my project's RAID server yesterday. It's a heady bit of hardware, and I'm sure I'll blog about that part soon (once I know what does and doesn't work). The idea was to do some serious RAID (16 spindles), but using the end-to-end software approach from ZFS.

I was debating whether or not I would try to tackle the somewhat baroque instructions for the much more experimental ZFS boot support. I was pretty put off by the instructions, however, so I set them aside. Instead, I jumped in and decided to install using the more familiar UFS/slice method.

So I just got back from setting up one of what I'm sure will be the first of many attempts to get the server configuration just right. And the experience of setting up the fdisk and pdisk slices, with their permanent choice of size, and with the involved UFS mirroring procedure still ahead of me… Well, I think I will give ZFS boot installation a try.

(next in series)
Related Entries:
Further benchmarks, and a step back for consideration
Pause for Testing
Install 2 of N. Continue?
ZFS performance models for a streaming server
Notes on using Time Machine to a ZFS backing store
Comments (1)  Permalink
1-4/4