After setting up a Solaris 10 machine with ZFS as the new NFS server, I'm stumped by some serious performance problems. Here are the details:
The machine in question is a dual-amd64 box with 2GB RAM and two
broadcom gigabit NICs. The Broadcom BRCMbcme package was installed to use
the interfaces. The OS is Solaris 10 6/06 and the filesystem consists of
a single zpool stripe across the two halfs of an Apple XRaid (each half
configured as RAID5), providing a pool of 5.4 TB. On the pool, I've
created a total of 60 filesystems, each of them shared via NFS, each of
them with compression turned on. The clients (NetBSD) mount the
filesystems with '
All of this is perfectly acceptable. Compared with the old NFS server (which runs on IRIX), we get:
Alright, so what's my beef? Well, here's the fun part: when I try to actually use this NFS share as my home directory (as I do with the IRIX NFS mount), then somehow performance plummets. Reading my inbox (~/.mail) will take around 20 seconds (even though it has only 60 messages in it).
When I try to run '
Neither the ktrace nor the mutt command can be killed right away -- they're blocking on I/O.
Alright, so after it finally finished, I try something a bit simpler.
On the IRIX NFS share, this takes about 60 seconds.
On the Solaris NFS share, this takes... forever. (I interrupted it after 10 minutes, when it had managed to create 2500 directories.)
tcpdump and snoop show me that traffic zips by as it should for the operations described above ((1), (2) and (3)), but become very "bursty" when doing reads and writes simultaneously or when creating the directories. That is, instead of a constant stream of packets zipping by, the tcpdump give me about 15 lines every second, but I can't find any packet loss.
I've tried to see if this is a problem with ZFS itself: I ran the same tests on the file server on the ZFS, and everything seems to work just fine there.
I've tried to mount the filesystem over TCP and with different read/write sizes, with NFSv2 and NFSv3 (the clients don't support NFSv4).
I've tried to see if it's the NIC or the network by testing regular network speeds and connecting the machine to a different switch etc. all to no avail.
I've played with every setting in
Alright, so in my next attempt to see if I'm crazy or not, I installed Solaris 6/06 on another workstation. From there, mounting a ZFS works just dandy, all the above tests are fast.
So I reinstall the other machine. After importing the old zpool, nothing has changed. I destroy the zpool and recreate it. Still the same problem.
To ensure that it's not the SAN switch, I connect the Solaris machine directly to the XRaid, and again, no change.
I destroy the Raid-5 config on the XRaid and build a Raid-0 across 7 of the disks. Creating a zpool of only this one (striped) disk also does not change performance at all. Creating a regular UFS on this disk, however, immediately fixes the problems! So it's not the fibre channel switch, it's not the fibre channel cables, it's not the fibre channel card, it's not the gigabit card, it's not the machine, it's not the mount options, it simply appears to be ZFS. ZFS on an Apple XRaid, to be precise. (Maybe it's ZFS on fibre-channel, I don't know; it's not ZFS per se, as the other freshly installed machine with ZFS on a SATA local disk worked fine.)
ZFS on XRaid; somewhat of a bummer, since I'd waited for this release to finally make use of ZFS.
July 25, 2006