PeriodicPreoccupationsProjectsPicturesPersonPing

Recent musings

Notes on using Time Machine to a ZFS backing store

I'm trying to set down the steps that I did to enable me to do Time Machine backups to an AFP store on top of ZFS running on OpenSolaris. It just happens to be the combination I'm running. There's no real need to combine all these steps as they apply to you: I just want to make note of what I had to do.

First, I installed netatalk, mostly working from the instructions listed at confessions of a unix junkie. Since Leopard strongly deprecates sending AFP passwords in the clear, I had to build with OpenSSL. Since I used pkgsrc to build the dependencies, my ./configure command looked more like:
LDFLAGS=-R/opt/local/lib RANLIB=echo CC=gcc ./configure --prefix=/opt/local --without-pam --disable-ddp --disable-tcp-wrappers --disable-srvloc --with-bdb=/opt/local/include/db4 --with-cnid-dbd-backend --with-ssl-dir=/opt/local

The Diffie-Hellman Exchange UAM (uams_dhx.so) is fairly critical to use with Leopard, so the "unix junkie" recommendations for the cleartext password should be ignored.

With that accomplished, set up the backing store. For me, it's a ZFS dataset (filesystem) on a centralised mount point. Make sure that the owner points to the user who will be using the AFP share (with a chown). Make sure the AppleVolumes.default entry for each share points to the right directory and user, such as:
/Storage/atl-backup "TimeMachine" rwlist:atl allow:atl

Once the mac could connect securely, there were only two steps needed. Allow Time Machine to deal with unsupported AFP shares:defaults write com.apple.systempreferences TMShowUnsupportedNetworkVolumes 1and tickle the mounted share into being recognised for this unsupported hack:
touch /Volumes/TimeMachine/.com.apple.timemachine.supported

That was enough to get the backup started. As I'm still running the initial backup, I have no idea of the stability of this solution, but I hope to be able to report back. I also notice that reports say you must mount the "sparsebundle" disk image manually in order for the Time Machine recovery GUI to be of use.

Related Entries:
More storage desires
ZFS performance models for a streaming server
Further benchmarks, and a step back for consideration
Pause for Testing
Install 2 of N. Continue?
 Permalink

More storage desires

I've written before about what a ZFS-based storage appliance might look like. I think I've glimpsed a first step in that direction. Yesterday, by chance, I came across Norco's recently-announced DS-520, a build-your-own barebones storage server. You provide the operating system, disks, and RAM. I hadn't seen such an open, x86-based platform in that form factor before, so naturally my mind wandered towards OpenSolaris.

My first concern was the chipset compatibility. Norco emailed me back quite promptly saying that the Marvell 88sx6081 was the SATA chipset they used. Cool. That's what goes into the Thumper, so Solaris support seems trivial, at least for the one component that might cause the most trouble.

It sounds like the makings of a very nice home-server platform:

  • $729 for the 1GHz / 2×GigE unit
  • add 1 GB RAM,
  • a fast CF card for the OS, and
  • five 750GB disks (looks like they're on the right side of the price/capacity "knee" now), topped off with
  • OpenSolaris stripped down to a fairly minimal set

Not cheap, or trivially easy at this point, but it could make for an interesting project, and a great piece of kit to sit on the shelf.

Related Entries:
Notes on using Time Machine to a ZFS backing store
Further benchmarks, and a step back for consideration
Pause for Testing
Install 2 of N. Continue?
ZFS performance models for a streaming server
 Permalink

Further benchmarks, and a step back for consideration

So, when last I blogged about the ZFS RAID server, I may have ended on a down note, suggesting disappointment. I hope readers will understand that's not the case.

When I started this project, I sat down and examined what was most important to me in the server.

    My requirements:
  • Saturate an aggegated 2× GigE link for sustained reads and writes
  • Do it cheaply
    My strong desires:
  • ZFS for its reliability, redundancy, flexibility, and ease of use
  • Maximise the amount of usable space
ZFS wasn't a requirement. It couldn't be: it's a solution, and defining requirements in terms of pre-ordained solutions is, at best, compromised. Maximising IOPS wasn't a first priority, sustained write performance was. Still, I want to have decent random seek performance, because there will always be a case where performance falls to that level.

I ran some additional Bonnie-64 tests to see the difference between the SATA controllers and their PCI-X buses. There was a small (a couple percent) but consistent difference between the two controllers. I believe one ran at 133MHz, and the other ran at 100MHz (but, to be honest, I don't know what tools I would use to verify such a thing). So I moved a disk controller from the 100MHz bus to a PCI-X slot on the shared 133MHz bus, and ran the same tests as before.

The results are as follows:

Block Writes, MB/sec

I perceive a strong levelling-off of streaming write performance, even lower than with the previous test. The peak for three 4+1 RAID-Z groups is 387.5 MB/s, while the peak for five 2+1 groups is 354.5 MB/sec. The mirrored scenario's limits are even lower, at 258.5 MB/sec.

Block Reads, MB/sec

The continuous read performance is even more interesting. Now, it's clear that the two controllers are maxing out a single, contended PCI-X bus where they hadn't before. The read limit is at 520 MB/sec. That, to me, sounds very much like one half of the throughput of a 64-bit, 133MHz bus (1064 MB/s). (It's within 2.5% of half that figure.) One conclusion could be that ZFS performs two reads for every block requested from disk, whether it be RAID-Z or mirror.

Taking a step back, should we find significance in the fact that one-third of our PCI-X bus throughput is 354.67 MB/s, while the most we could squeeze out of the 2+1 RAID-Z configuration was 354.5? It would certainly square with what commenter "mrb" stated: for 2+1 RAID-Z sets, expect 50% higher throughput on reads than writes.

Random Seeks /sec

The random seek performance doesn't yet tell me much other than the theory of IOPS scaling linearly with the number of vdevs or mirror disks simply does not hold on my system. Frankly, I'm stumped at how it only increases logarithmically. Well, at least it increases monotonically.

Let's pit theory against practice. I originally posted a crude, back-of-the-envelope model of read/write/random ZFS performance for 14 or 15 disks in 2+1, 4+1 or 2× mirror configurations. What happens to our experimental results when I factor out the (estimated) base performance of a single drive/vdev set?

Sequential I/O
config Random Reads Read Write Capacity
RAIDZ: 3×(4+1) 3y 12z 12z 6.0TB
RAIDZ: 5×(2+1) 5y 10z 10z 5.0TB
mirror: 7×2 14y 14z 7z 3.5TB
Random Reads
×90/sec
Sequential I/O
×72MB/sec
config Read Write Capacity
RAIDZ: 3×(4+1) 2.3y 7.2z 5.3z 6.0TB
RAIDZ: 5×(2+1) 3.3y 8.8z 5.2z 5.0TB
mirror: 7×2 3.8y 10.2z 4.8z 3.5TB

It's clear that ZFS is demanding enough that it can hit the limits of the PCI-X bus on a poorly thought-out system. I can sketch out those limits on my own system, in some cases. It's also true that I could have chosen another motherboard with 2 independent 133MHz PCI-X buses, or gone with a PCIe solution that would have eliminated any concerns about bus bandwidth. In theory, with this many disks, I could be seeing twice the performance in some situations. However, I should look at the numbers: 390MB/s far exceeds my ability to get data into or out of the machine via the network.

The machine does what it is supposed to, and surprisingly affordably, too. Any "disappointment" I have is purely theoretical.


As a postscript, I should make a call out to anyone who would like further data with the 133+100MHz controller configuration. The server is leaving the workshop and going into the rack now, but the system will be under test for a few weeks more. Contact me via the comments, the contact form on this site, or the zfs-discuss list if you have a particular scenario you'd like me to run.

Related Entries:
Pause for Testing
Install 2 of N. Continue?
Install 1 of N. Begin?
More storage desires
ZFS performance models for a streaming server
Comments (1)  Permalink

Pause for Testing

To recap, the major components of the ZFS storage server were:

  • PCI Case's IPC-C3E-BAR65-XP-SAS 3U chassis w/16 hot-swap SAS/SATA2 bays
  • 2× AMD Opteron 275 processors (dual-core, 2.2GHz)
  • Tyan S3892 (K8HM) motherboard.
  • 8GB ECC memory
  • 2× 80GB boot drives (connected to the motherboard)
  • Supermicro AOC-SAT2-MV8 SATA 8 port RAID controller
  • 16× 500 Gb Seagate enterprise class SATA Hard drives
This configuration is intended to act as a streaming multimedia recorder/server, with the most demanding disk I/O workflow intended to be writing a stream of data coming in from two GigE links. Imagine what it might take to store uncompressed high-definition video. There are likely to be other demanding tasks, but this was the most extreme.

Tyan's S3892 suffers from some ambiguous documentation. The PDF specification sheet states in the text that there are two 133MHz PCI-X slots, and one 100MHz slot, whereas the block diagram says they all run at 133MHz. The manual says nothing about it. There was no answer from emailing Tyan support. Using the block diagram as my guide, I split the two Supermicro controllers amongst the two PCI busses, and decided to start testing the hardware configuration to be sure.

Basically, I wanted to test and compare the model I previously blogged about.

    For mirrored configurations:
  • Small, random reads scale linearly with the number of disks; writes scale linearly with the number of mirror sets.
  • Sequential read throughput scales linearly with the number of disks; write throughput scales linearly with the number of mirror sets.
    For parity (RAID-Z, RAID-Z2) configurations:
  • Small, random I/O reads and writes scale linearly with the number of RAID sets.
  • Sequential read and write throughput scales linearly with the number of data (non-parity) disks.

Bonnie-64 was designed to turn up performance bottlenecks. That is precisely what I was looking for. Can I tell that one controller is on a bus that runs 75% the speed of the other? I could, in fact, but the overall combined performance was very decent. The tests did show limitations in my hardware, however.

I compared configurations from two to fifteen disks (I always want to have a hot spare running), with 2+1 RAIDZ vdevs, 4+1 RAIDZ vdevs, and (single) mirror vdevs. So each graph reflects fifteen runs of Bonnie-64:

  • With the 2+1 RAID-Z: 3, 6, 9, 12, or 15 disks in the zpool,
  • With the 4+1 RAID-Z: 5, 10, or 15 disks in a zpool, and
  • With the mirror configuration: 2, 4, 6, 8, 10, 12, or 14 disks in the zpool.
All of the graphs measure the number of data disks (i.e., the total number of mirror disks, but only the number of non-parity disks for the RAID-Z configurations), or in the case of random seeks, the number of RAID/mirror sets. All of the tests were performed with 32GB test files: from the results, it's pretty clear we're exceeding any cache issues.

Block Writes, MB/sec

Block writes were always going to be the metric I was most sensitive to, because of the above-described workflow. You can see that there is a strong levelling-off of block write performance just below 390 MBytes/sec. The mirrored configurations increase their write speed at half the rate of the RAID-Z configurations, as we would expect from the slower writes indicated in the model. The "ideal" line is fairly arbitrary, as it's an extrapolation of performance from fairly few data points. It is, however, indicative of what the performance model might predict.

Block Reads, MB/sec

Sustained read performance is much less limited than with writes. The "ideal" line also has a 33% steeper slope than with the writes: it appears we consistently achieve four block reads in the same time as it take to do three block writes. Strangely, the 4+1 RAID-Z groups underperform by a fair bit (I can't comment as to the statistical significance, at the moment, but it seems fairly consistent). The 7×2 mirror configuration tops out at 735 MB/s on reads, which seems fairly decent.

Random Seeks /sec

I'll admit that the random seek performance figures baffle me a bit. Everything I've read so far suggested that random seek performance would scale linearly with the number of vdevs (or disks in the mirror). Instead, the numbers line up fairly well with a logarithmic graph. Am I running into lots of vibration? Am I hitting an unexpected bottleneck that's unrelated to data transfer over the bus?

This is a beast of a post already. I'll push this out to the world, and start writing up the next installment, wherein I note that one of the SATA controllers is, in fact, on a slower PCI-X bus, and what I do to fix it.

edit: All of this was on Solaris Express Community Edition, Nevada 70, with the ZFS boot patch applied.

Related Entries:
Further benchmarks, and a step back for consideration
Install 2 of N. Continue?
Install 1 of N. Begin?
More storage desires
ZFS performance models for a streaming server
Comments (4)  Permalink

Install 2 of N. Continue?

This is a followup to the first in the series.

After an encouraging comment to my last OpenSolaris blog entry, I decided to do what was necessary to make a patched ZFS boot installer. I used the Netinstall script/procedure from Lori Alt. The name is a bit misleading, as I was able to run the install from a DVD, as I'm usually comfortable doing.

The actual install was much easier than expected, and – because of the pfinstall procedure – much more efficient than the usual rigamarole I go through. Creating the modified boot DVD was the hard part, and really, that was mostly in the logistics of moving DVD images back and forth. I would

  • download the DVD segments to my desktop,
  • assemble the pieces, (really, I have to complain about Sun's download policy: it really gets in the way when trying to do this kind of work.)
  • upload to the Solaris box, (and try again with newly split files because of a 2GB limit),
  • mount the ISO image,
  • copy the image to a working directory,
  • apply the patch, and include a draft install profile on disk,
  • create a bootable ISO image, and
  • move it (in pieces) from the server room to the MacBook Pro so I can actually burn it.

Google led me to Tom Haynes' blog, and the linked entry along with the followups that follow in sequence give enough magic sauce to get a bootable/installable ISO image. With the disk burned, installation was a snap. And a ZFS boot disk… simply works. (I do need to dig deeper into arranging a ZVOL or some such as a dump volume. At the moment, it's on one of the unused array disks.)

When I last saw the machine, I had started running Bonnie-64 on it, and looked good so far. I hope to have very comprehensive test results to post.

In my previous entry, I vowed to give more details about the hardware.

My requirements were to come up with high-density, high-throughput, easy-to-manage storage on a budget. This will be doing multimedia streaming all the way up to (potentially uncompressed) Hi-Def. It's not your typical mailserver, in other words. My interest in ZFS has been in its reliability, ease-of-administration, and conceptual simplicity: disks are dumb and all too prone to failure. If an all-software solution allows me not to worry about the disks and not put the workload onto another point of failure between the application and the hardware (read: RAID cards die, get old, and obsolete), then I'm all for it.

A local systems integrator (with whom my department has had a long relationship) provided the system, and collaborated a fair bit on the specifications (but he readily admits that Solaris is not his speciality).

We started with the chassis: Steve works fairly exclusively with PCICase, and recommended their 16× SATA drive chassis in a 3U rackmount. It's a bit anonymously black, but it certainly looks like it will do the job of high-density storage on a budget.

After going back and forth, we settled on a motherboard from Tyan. The S2892 seemed just the ticket, and got a thumbs-up from the ZFS list. Unfortunately, Steve couldn't find the board from any of his suppliers, because it's apparently been end-of-lifed. He suggested the S3892 (Tyan Thunder K8HM) in its place. Having seen some measure of support for the southbridge controllers on the HCL (thanks to Paul Richards) from a kissing-cousin relative of the new motherboard, I agreed.

Originally, Steve proposed a 3.0GHz dual-core Opteron. I wanted something cheaper with more cores (cos Solaris tends to handle that rather well), so we ended up agreeing on two dual-core AMD Opteron 275 (2.2GHz) processors.

Combine this with 8GB RAM, dual boot drives (internal to the case, connected to the motherboard), 2× Supermicro's AOC-SAT2-MV8 SATA 8 port RAID controller, 16× 500GB Seagate SATA hard drives, and we have a pretty serious 8 Terabyte system for under £5000. The question at this point is how serious. I'm waiting for benchmark results to tell me just that.

Related Entries:
Further benchmarks, and a step back for consideration
Pause for Testing
Install 1 of N. Begin?
More storage desires
ZFS performance models for a streaming server
 Permalink

Install 1 of N. Begin?

I took delivery of my project's RAID server yesterday. It's a heady bit of hardware, and I'm sure I'll blog about that part soon (once I know what does and doesn't work). The idea was to do some serious RAID (16 spindles), but using the end-to-end software approach from ZFS.

I was debating whether or not I would try to tackle the somewhat baroque instructions for the much more experimental ZFS boot support. I was pretty put off by the instructions, however, so I set them aside. Instead, I jumped in and decided to install using the more familiar UFS/slice method.

So I just got back from setting up one of what I'm sure will be the first of many attempts to get the server configuration just right. And the experience of setting up the fdisk and pdisk slices, with their permanent choice of size, and with the involved UFS mirroring procedure still ahead of me… Well, I think I will give ZFS boot installation a try.

(next in series)
Related Entries:
Further benchmarks, and a step back for consideration
Pause for Testing
Install 2 of N. Continue?
ZFS performance models for a streaming server
Notes on using Time Machine to a ZFS backing store
Comments (1)  Permalink

A lifer, again

Damn Joyent for continually being able to come up with products I want to buy. It's clear that they tap into a psychological type, and I fit that profile. I've written about the freedom offered by Joyent's pricing model for the "lifetime" plans, and all I can say is they've done it again: their virtual private server offering, the OpenSolaris-based "Accelerator," has a limited-customer offer for "lifetime" hosting.

Never mind that Joyent have repeatedly denied that they would (or even could) offer lifetime hosting on this utility-computing-style product. They did it. And halved the price for existing customers like me.

Never mind that while I had been intrigued and tempted by the Accelerators, I set the idea of signing up aside for two reasons: I have numerous containers at my disposal with my OpenSolaris servers at work, anyway, and the extra month's setup fee for an Accelerator didn't make it worth it for me. I bought it.

Absolutely brilliant of them: provide a product that I wouldn't pay $150 to try, but pay $500 more to "buy outright." I'm freed up from the pressure of having to take advantage of it right away. What surprises me most right now is that there is no news of the offer ending, yet. I would have sworn that there would be 200 takers within hours.

Related Entries:
Business hosting at TextDrive
I call that a bargain
Customer loyalty
Tiers shed by Jason
It'll end in tiers
Comments (1)  Permalink

ZFS performance models for a streaming server

I've been spending a fair bit of the last week puzzling through the various postings on ZFS performance. While Richard Elling's blog posts were informative, they didn't really tell me much about the workflow that interested me most: high-throughput multimedia streaming. I eventually took the question to the ZFS-discuss list, and got a lot of knowledgeable feedback. The essence of what I learned about various ZFS configurations, for my purposes (once I understood the superficial size/reliability/performance tradeoffs), gets boiled down to one choice:


Which is more important to your ZFS workflow:
random access or write performance?


I suppose this is old hat to people who are very familiar with RAID systems and/or ZFS, but it took some digging for me to find out. I'll set it out in words, in contrast with the pretty but dizzying graphs at relling's site. These guidelines obviously set aside any other bottlenecks, but as was consistently pointed out, media speed is usually the bounding factor with performance.

    For mirrored configurations:
  • Small, random reads scale linearly with the number of disks; writes scale linearly with the number of mirror sets.
  • Sequential read throughput scales linearly with the number of disks; write throughput scales linearly with the number of mirror sets.
    For parity (RAID-Z, RAID-Z2) configurations:
  • Small, random I/O reads and writes scale linearly with the number of RAID sets.
  • Sequential read and write throughput scales linearly with the number of data (non-parity) disks.

In other words, mirrors suffer on writes, collapsing to the number of mirrors, essentially. RAID-Z groups suffer most with random I/O, collapsing to the number of RAID groups, performance-wise, in those situations. A hypothetical table with two different configurations of 12 disks (four 3-way mirror sets vs. two RAID-Z2 sets) helps show the strong contrast:

Random I/OSequential I/O
config Read Write Read Write
mirror: 4*3 12y 4y 12z 4z
RAIDZ2: 2*(4+2) 2y 2y 8z 8z

where y is the number of random, short IOPS and z is the sustained media throughput on the drives


My case is fairly clear: I want to both read and write multimedia streams fairly equally, so I favour RAID-Z groups. I don't need the same sort of long data life that others do, so I set RAID-Z2 aside for now.


In my particular case, I have 16 500GB SATAII drives to work with for the RAID. I am committed to one hot spare, so I'm down to 15 drives. Once I get my server, I need to know how much—and when—performance degrades when excessive numbers of streams are added and/or Random I/O requests are added to the mix.


For a long time, I had assumed I would use three sets of five-drive RAID sets. Looking at three-drive sets, I have to consider whether a ~17% drop in peak streaming performance is worth a 67% improvement in baseline small, random I/O (essentially the worst-case scenario).


How do I arrive at that? Five 2+1 RAIDZ groups have 10 data disks compared to the 12 in three 4+1 RAIDZ groups. If I go from 4+1 to 2+1 groups, I lose 2 disks worth of data storage and the equivalent amount of max streaming capacity, but gain two more RAIDZ groups for working on the seeking for random I/O. Actually, another table really makes the picture quite clear. I throw in a set of mirrored drives as further food for thought.



Random I/OSequential I/O
config Read Write Read Write Capacity
RAIDZ: 3*(4+1) 3y 3y 12z 12z 6.0TB
RAIDZ: 5*(2+1) 5y 5y 10z 10z 5.0TB
mirror: 7*2 14y 7y 14z 7z 3.5TB

Related Entries:
Notes on using Time Machine to a ZFS backing store
More storage desires
Further benchmarks, and a step back for consideration
Pause for Testing
Install 2 of N. Continue?
 Permalink
1-8/8