Running ZFS over NFS as a VMware Store


NFS is definitely a very well rounded high performance file storage system and it certainly serves VMware Stores successfully over many storage products. Recently one of my subscribers asked me if there was a reason why my blogs were more centric to iSCSI. Thus the question was probing for a answer to a question many of us ask ourselves. Is NFS superior to block based iSCSI and which one should I choose for VMware. The answer to this question is not which protocol is superior but which protocol serves to provision the features and function you require most effectively. I use both protocols and find they both have desirable capability and functionality and conversely have some negative points as well.

NFS typically is generally more accessible because its a file level protocol and sits higher up on the network stack. This makes it very appealing when working with VMware virtual disks aka vmdk’s simply because they also exist at the same layer. NFS is ubiquitous across NAS vendors and can be provisioned by multiple agnostic implementation endpoints.  An NFS protocol hosts the capability to be virtualized and encapsulated within any Hypevisor instance either clustered or standalone. The network file locking and share semantics of NFS grant it a multitude of configurable elements which can serve a wide range of applications.

In this blog entry we will explore how to implement an NFS share for VMware ESX using OpenSolaris and ZFS. We will also explore a new way of accelerating the servers I/O performance with a new product called the DDRdrive X1.

OpenSolaris is an excellent choice for provisioning NFS storage volumes on VMware.  It hosts many advanced desirable storage features that set it far ahead of other Unix flavors. We can use the advanced networking features and ZFS including the newly integrated dedup functionality to craft the best NFS functionality available today.

Let start by examining the overall NAS storage architecture.


NFS OpenSolaris/VMware Architecture by Mike La Spina



In this architecture we are defining a fault tolerant configuration using two physical 1Gbe switches with a quad or dual Ethernet adapter(s). On the OpenSolaris storage head we are using IPMP aka IP Multipathing to establish a single IP address to serve our NFS store endpoint. A single IP is more appropriate for VMware environments as they do not support multiple NFS IP targets per NFS mount point.  IPMP provisions layer 3 load balancing and interface fault tolerance. IPMP commonly uses ICMP and default routes to determine interface failure states thus it well suited for a NAS protocol service layer. In a effort to reduce excessive ICMP rates we will aggregate the two dual interfaces into a single channel connection to each switch. This will allow us to define two test IP addresses for the IPMP service and keep our logical interface count down to a minimum. We are also defining a 2 port trunk/aggregate between the two physical switches which provides more path availability and reduces  switch failure detection times.

On the ESX host side we are defining 1 interface per switch. This type of configuration requires that only one of the VMware interfaces is an active team member vmnic within a single vSwitch definition. If this is not configured this way the ESX host will fail to detect and activate the second nic under some failure modes. This is not a bandwidth constraint issue since the vmkernel IP interface will only activity use one nic.

With an architecture set in place let now explore some of the pros and cons of running VMware on Opensolaris NFS.

Some of the obvious pros are:

  • VMware uses NFS in a thin provisioned format.
  • VMDKs are stored as files and are mountable over a variety of hosts.
  • Simple backup and recovery.
  • Simple cloning and migration.
  • Scalable storage volumes.

And some of the less obvious pros:

  • IP based transports can be virtualized and encapsulated for disaster recovery.
  • No vendor lock-in
  • ZFS retains NFS share properties within the ZFS filesystem.
  • ZFS will dedup VMDKs files at the block level.

And there are the cons:

  • Every write I/O from VMware is an O_SYNC write.
  • Firewall setups are complex.
  • Limited in its application. Only NFS clients can consume NFS file systems.
  • General  protocol security challenges. (RPC)
  • VMware kernel constraints
  • High CPU overhead.
  • Bursty data flow.

Before we break out into the configuration detail level lets examine some of the VMware and NFS behaviors so as to gain some insight into the reason I primarily use iSCSI for most VMware implementations.

I would like demonstrate some characteristics that are primarily a VMware client side behavior and it’s important that you are aware of them when your considering NFS as a Datastore.

This VMware performance chart of an IOMeter generated load reveals the burst nature of the NFS protocol. The VMware NFS client exclusively uses a O_SYNC flag on write operations which requires a committed response for the NFS server. At some point the storage system will not be able to complete every request and thus a pause in transmission will occur. The same occurs on reads when the network component buffers reach saturation. In this example chart we are observing a single 1Gbe interface at saturation from a read stream.


NFS VMware Network I/O Behavior by Mike La Spina


In this output we are observing a read stream across vh0 which is one of two active ESX4 host VMs loading our OpenSolaris NFS store and we can see the maximum network throughput is achieved which is ~81MB/s. If you examine the average value of 78MB/s you can see the burst events do not have significant impact and is not a bandwidth concern with ~3MB/s of loss.


NFS VMware Network Read I/O Limit Behavior by Mike La Spina


At the same time we are recording this write stream chart on vh3 a second ESX 4 host loading the same NFS OpenSolaris store.  As I would expect, its very similar to the read stream except that we can see the write performance is lower and that’s to be expected with any write operations. We can also identify that we are using a full duplex path transmission across to our OpenSolaris NFS host since vh0 is reading (recieving) and vh3 is writing(transmitting).


NFS VMware Network Write I/O Limit Behavior by Mike La Spina


In this chart we are observing a limiting characteristic of the VMware vmkernel NFS client process. We have introduced a read stream in combination with a preexisting active write stream on a single ESX host. As you can see the transmit and receive packet rates are both reduced and now sum to a maximum of ~75MB/s.



NFS VMware Network Mixed Read Write I/O Limit Behavior by Mike La Spina


Transitioning from read to write active streams confirms the transmission is limited to ~75Mb/s regardless the full duplex interface capability.  This information demonstrates that a host using 1Gbe ethernet connections will be constrained based on its available resources. This is a important element to consider when using NFS as a VMware datastore.


NFS VMware Network Mixed Read Write I/O Flip Limit Behavior by Mike La Spina


Another important element to consider is the CPU load impact of running the vmkernel NFS client. There is a significant CPU cycle cost on VMware hosts and this is very apparent under heavier loads. The following screen shot depicts a running IOmeter load test against our OpenSolaris NFS store. The important elements are as follows. IOMeter is performing 32KB reads in a 100% sequential access mode which drives a CPU load on the VM of ~35% however this is not the only CPU activity that occurs for this VM.


NFS IOMeter ZFS Throughput 32KB-Seq


When we examine the ESX host resource summary for the running VM we can now observe the resulting overhead load which is realized by viewing the Consumed Host CPU value. The VM in this case is granted 2 CPUs each are a 3.2Ghz Intel hypervisor resource. We can see that the ESX host is running at 6.6Ghz to drive the vmkernel NFS I/O load.


NFS VMware ESX 4 CPU Load


Lets see the performance chart results when we svMotion the activily loaded running VM on the same ESX host to an iSCSI VMFS based store on the same OpenSolaris storage host. The only elements changing in this test are the underlying storage protocols. Here we can clearly see CPU object 0 is the ESX host CPU load. During the svMotion activity we begin to see some I/O drop off due to the addition background disk load. Finally we observe the VM transition at the idle point and the resultant CPU load of iSCSI I/O impact. We clearly see the ESX host CPU load drop from 6.6Ghz to 3.5Ghz which makes it very apparent the NFS requires substantially higher CPU that iSCSI.


VM Trasitioned with vMotion from NFS to iSCSI on same ZFS Storage host


With the svMotion completed we now observe the same IOMeter screen shot retake and its very obvious that our throughput and IOPS have increased significantly and the VM granted CPU load has not changed significantly.   A decrease of ESX host CPU load in the order of  ~55% and and increase of ~32% in IOPS and 45% of throughput shows us there are some negative behaviors to be cognizant of. Keep in mind that this is not that case when the I/O type is small and random like that of a Database in those cases  NFS is normally the winner, however VMware normally hosts mixed loads and thus we need to consider this negative effect at design time and when targeting VM I/O characteristics.


iSCSI IOMeter ZFS X1DDR Cache Throughput 32KB-Seq Mike La Spina

iSCSI ESX 4 CPU Load by Mike La Spina


With a clear understanding of some important negative aspects to implementing NFS for VMware ESX hosts we can proceed to the storage system build detail. The first order of business is the hardware configuration detail. This build is simply one of my generic white boxes and it hosts the following hardware:


GA-EP45-DS3L Mobo with an Intel 3.2Ghz E8500 Core Duo

1 x 70GB OS Disk

2 x 500GB SATA II ST3500320AS disks

2GB of Ram

1 x Intel Pro 1000 PT Quad Network Adapter


As a very special treat on this configuration I am also privileged to run an DDRDrive X1 Cache Accelerator which I am currently testing some newly developed beta drivers for OpenSoalris. Normally I would use 4GB of ram as a minimum but I needed to constraint this build in a effort to load down the dedicated X1 LOG drive and the physical SATA disks thus this instance is running only 2GB of ram. In this blog entry I will not be detailing the OpenSolaris install process, we will begin from a Live CD installed OS.

OpenSolaris will default to a dynamic network service configuration named nwam, this needs to be disabled and the physical:default service enabled.

root@uss1:~# svcadm disable svc:/network/physical:nwam
root@uss1:~# svcadm enable svc:/network/physical:default

To establish an aggregation we need to un-configure any interfaces that we previously configured before proceeding.

root@uss1:~# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 10.1.0.1 netmask ffff0000 broadcast 10.255.255.255
ether 0:50:56:bf:11:c3
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
inet6 ::1/128

root@uss1:~# ifconfig e1000g0 unplumb

Once cleared the assignment of the physical devices is possible using the following commands

dladm create-aggr –d e1000g0 –d e1000g1 –P L2,L3 1
dladm create-aggr –d e1000g2 –d e1000g3 –P L2,L3 2

Here we have set the policy allowing layer 2 and 3 and defined two aggregates aggr1 and aggr2. We can now define the VLAN based interface shown here as VLAN 500 instances 1 are 2 respective of the aggr instances. You just need to apply the following formula for defining the VLAN interface.

(Adaptor Name) + vlan * 1000 + (Adaptor Instance)

ifconfig aggr500001 plumb up 10.1.0.1 netmask 255.0.0.0
ifconfig aggr500002 plumb up 10.1.0.2 netmask 255.0.0.0

Each pair of interfaces needs to be attached to a trunk definition on its switch path. Typically this will be a Cisco or HP switch in most environments. Here is a sample of how to configure each brand.

Cisco:

configure terminal
interface port-channel 1
interface ethernet 1/1
channel-group 1
interface ethernet 1/2
channel-group 1
interface ethernet po1
switchport mode trunk allowed vlan 500
exit

HP Procurve:

trunk 1-2 trk1 trunk
vlan 500
name “eSAN1″
tagged trk1

 

Once we have our two physical aggregates setup we can define the IP multipathing interface components.  As a best practice we should define the IP addresses in our hosts file and then refer to those names in the remaining configuration tasks.

Edit /etc/hosts to have the following host entries.

::1 localhost
127.0.0.1 uss1.local localhost loghost
10.0.0.1 uss1 uss1.domain.name
10.1.0.1 uss1.esan.data1
10.1.0.2 uss1.esan.ipmpt1
10.1.0.3 uss1.esan.ipmpt2

Here we have named the IPMP data interface aka a public IP as uss1.esan-data1 this ip will be the active connection for our VMware storage consumers.  The other two named uss1.esan-ipmpt1 and uss1.esan-ipmpt2 are beacon probe  IP test addresses and will not be available to external connections.

IPMP functionallity is included with OpenSolaris and is configured with the ifconfig utility. The follow sets up the first aggregate with a real public IP and a test address. The deprecated keyword defines the IP as a test address and the failover keyword defines if the IP can be moved in the event of interface failure.

ifconfig aggr500001 plumb uss1.esan.ipmpt1 netmask + broadcast + group ipmpg1 deprecated -failover up addif uss1.esan.data1 netmask + broadcast + failover up
ifconfig aggr500002 plumb uss1.esan.ipmpt2 netmask + broadcast + group ipmpg1 deprecated -failover up

To persist the IPMP network configuration on boot you will need to create hostname files matching the interface names with the IPMP configuration statement store in them. The following will address it.

echo uss1.esan.ipmpt1 netmask + broadcast + group ipmpg1 deprecated -failover up addif uss1.esan.data1 netmask + broadcast + failover up > /etc/hostname.aggr500001

echo uss1.esan.ipmpt1 netmask + broadcast + group ipmpg1 deprecated -failover up > /etc/hostname.aggr500002

The resulting interfaces will look like the following:

root@uss1:~# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
aggr1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
inet 10.1.0.2 netmask ff000000 broadcast 10.255.255.255
groupname ipmpg1
ether 0:50:56:bf:11:c3
aggr2: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3
inet 10.1.0.3 netmask ff000000 broadcast 10.255.255.255
groupname ipmpg1
ether 0:50:56:bf:6e:2f
ipmp0: flags=8001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,IPMP> mtu 1500 index 5
inet 10.1.0.1 netmask ff000000 broadcast 10.255.255.255
groupname ipmpg1
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
inet6 ::1/128

In order for IPMP to detect failures in this configuration you will need to define target probe addresses for IPMP use. For example I use multiple ESX hosts as probe target on the storage network.

e.g.

root@uss1:~# route add -host 10.1.2.1 10.1.2.1 -static
root@uss1:~# route add -host 10.1.2.2 10.1.2.2 -static

This network configuration yields 2,2Gbe aggregate paths bound to a single logical active IP address on 10.1.0.1, with  interfaces aggr1 and aggr2 the keyword deprecated directs the IPMP mpathd service daemon to prevent application session connection packets establishment and the nofailover keyword instructs mpathd not to allow the bound IP to failover to any other interface in the IPMP group.

There are many other possible configurations but I prefer this method because it remains logically easy to diagnose and does not introduce unnecessary complexity.

Now that we have layer 3 network connectivity we should establish the other essential OpenSolaris static TCP/IP configuration elements. We need to ensure we have a persistent default gateway and our DNS client resolution enabled.

The persistent default gateway is very simple to define as is done with the route utility command as follows.

root@uss1:~# route -p add default 10.1.0.254
add persistent net default: gateway

When using NFS I prefer provisioning name resolution as a additional layer of access control. If we use names to define NFS shares and clients we can externally validate the incoming IP  with a static file or DNS based name lookup. An OpenSolaris NFS implementation inherently grants this methodology.  When a client IP requests access to an NFS share we can define a forward lookup to ensure the IP maps to a name which is granted access to the targeted share. We can simply define the desired FQDNs against the NFS shares.

In small configurations static files are acceptable as is in the case here. For large host farms the use of a DNS service instance would ease the admin cycle. You would just have to be careful that your cached TimeToLive (TTL) value is greater that 2 hours thus preventing excessive name resolution traffic. The TTL value will control how long the name is cached and this prevents constant external DNS lookups.

To configure name resolution for both file and DNS we simply copy the predefined config file named nsswitch.dns to the active config file nsswitch.conf as follows:

root@uss1:~# cp /etc/nsswitch.dns /etc/nsswitch.conf

Enabling DNS will require the configuration of our /etc/resolv.conf file which defines our name servers and namespace.

e.g.

root@ss1:~# cat /etc/resolv.conf
domain laspina.ca
nameserver 10.1.0.200
nameserver 10.1.0.201

You can also use the static /etc/hosts file to define any resolvable name to IP mapping.

With OpenSolaris you should always define your NFS share properties using the ZFS administrative tools. When this method is used we can the take advantage of keeping the NFS share properties inside of ZFS. This is really useful when you replicate or clone the ZFS file system to an alternate host as all the share properties will be retained. Here are the basic elements of an NFS share configuration for use on VMware storage consumers.

zfs create -p sp1/nas/vol1
zfs set mountpoint=/export/uss1-nas-vol1 sp1/nas/vol1
zfs set sharenfs=rw,nosuid,root=vh3-nas:vh2-nas:vh1-nas:vh0-nas sp1/nas/vol1

The ACL NFS share property of rw sets the entire share as read write, you could alternately use rw=hostname for each host but it seems redundant to me.  The nosuid prevents any incoming connection from switching user ids for example from a non-root value to 0. Finally the root=hostname property grants the incoming host name access to the share with root access permissions.  Any files created by the host will be as the root id. While these steps are some level of access control it falls well short of secure thus I also keep the NAS subnets fully isolated or firewalled to prevent external network access to the NFS share hosts.

Once our NFS share is up and running we can proceed to configure the VMware network components and share connection properties. VMware requires a vmkernel network interface definition to provision NFS connectivity. You should dedicate a vmnic team and a vswitch for your storage network.

Here is a visual  example of a vmkernel configuration with a teamed pair of vmnics

vmkernel eNAS-Interface by Mike La Spina

As you can see we have dedicated the vSwitch and vmnics on VLAN 500, no other traffic should be permitted on this network. You should also set the default vmkernel gateway to its own address. This will promote better performance as there is no need to leave this network.

For eNAS-Interface1 you should define one active and one standby vmnic. This will ensure proper interface fail-over in all failure modes.  The VMware NFS kernel instance will only use a single vmnic so your not loosing any bandwidth. The vmnic team only serves as a fault tolerant connection and is not a load balanced configuration.

VMkernel Team Stanby by Mike La Spina


At this point you should validate your network connectivity by pinging the vmkernel IP address from the OpenSolaris host. If you chose to ping from ESX use vmkping instead of ping otherwise you will not get a response.

Provided your network connectivity is good you can define your vmkernel NFS share properties. Here is a visual example.

Add an NFS share by Mike La Spina

And if you prefer an ESX command line method:

esxcfg-nas -a -o uss1-nas -s /export/uss1-nas-vol1 uss1-nas-vol1

In this example we are using a DNS based name of uss1-nas. This would allow you to change the host IP without having to reconfigure VMware hosts. You will want to make sure the DNS name cache TTL in not a small value for two reasons. One an DNS outage would impact the IP resolution and as well you do not want excessive resolution traffic on the eSAN subnet(s).

The NFS share configuration info is maintained in the /etc/vmware/esx.conf file and looks like the following example.

/nas/uss1-nas-vol1/enabled = “true”
/nas/uss1-nas-vol1/host = “uss1-nas”
/nas/uss1-nas-vol1/readOnly = “false”
/nas/uss1-nas-vol1/share = “/export/uss1-nas-vol1″

If your trying to change NFS share parameters and the NFS share is not available after a successful configuration you could run into a messed up vmkernel NFS state and you’ll receive the following message:

Unable to get Console path for Mount

You will need to reboot the ESX server to clean it up so don’t mess with anything else until that is performed. (I’ve wasted a few hours on that buggy VMware kernel NFS client behavior).

Once the preceeding steps are successful the result will be a NAS based NFS share which is now available like this example.

Running NFS shares by Mike La Spina

With a working NFS storage system we can now look at optimizing the I/O capability of ZFS and NFS.

VMware performs write operations over NFS using an O_SYNC control flag. This will force the storage system to commit all write operations to disk to ensure VM file integrity. This can be very expensive when it comes to high performance IOPS especially when using SATA architecture. We could disable our ZIL aka ZFS Intent Log but this could result in severe corruption in the event of a systems fault or environmental issue. A much better alternative is to use a non-volatile ZIL device. In this case we have an DDRdrive X1 which is a 4GB high speed externally powered dram bank with a high speed SCSI interface and also hosts 4GB of flash for long term shutdowns.  The DDRdrive X1 IO capability reaches the 200,000/sec range and up. By using an external UPS power source we can economically prevent ZFS corruption and reap the high speed benefits of dram even when unexpected system interruptions occur.

In this blog our storage host is using Seagate ST3500320AS disk which are challenged to achieve ~180 IOPS. And that IO rate is under ideal sequential read write loads. With a cache we can expect that these disks will deliver no greater than 360 IOPS under ideal conditions.

Now lets see if this is true based on some load tests using Microsoft’s SQLIO tool. First we will disable our ZFS ZIL caching DDRdrive X1 show here as device c9t0d0

NAME        STATE     READ WRITE CKSUM
sp1         DEGRADED     0     0     0
mirror-0  ONLINE       0     0     0
c6t1d0  ONLINE       0     0     0
c6t2d0  ONLINE       0     0     0
logs
c9t0d0  OFFLINE      0     0     0

No lets run the SQLIO test for 5 minutes with random 8K I/O write requests which are simply brutal for any SATA disk to keep up with.  We have defined a file size of 32GB to ensure we hit the disk by exceeding our 2GB cache memory foot print. As you can see from the output we achieve 227 IOs/sec which is below the mirrored drive pair capability.

C:Program FilesSQLIO>sqlio -kW -s300 -frandom -o4 -b8 -LS -Fparam.txt
sqlio v1.5.SG
using system counter for latency timings, 3579545 counts per second
parameter file used: param.txt
file c:testfile.dat with 2 threads (0-1) using mask 0×0 (0)
2 threads writing for 300 secs to file c:testfile.dat
using 8KB random IOs
enabling multiple I/Os per thread with 4 outstanding
using specified size: 32768MB for file: c:testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:   227.76
MBs/sec:     1.77

latency metrics:
Min_Latency(ms): 8
Avg_Latency(ms): 34
Max_Latency(ms): 1753
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1 29  7  3  2  1  1  1 54

new  name   name  attr  attr lookup rddir  read read  write write
file remov  chng   get   set    ops   ops   ops bytes   ops bytes
0     0     0   300     0      0     0     3   16K   146 1.12M /export/uss1-nas-vol1
0     0     0   617     0      0     0     0     0   309 2.39M /export/uss1-nas-vol1
0     0     0   660     0      0     0     0     0   329 2.52M /export/uss1-nas-vol1
0     0     0   677     0      0     0     0     0   338 2.63M /export/uss1-nas-vol1
0     0     0   638     0      0     0     0     0   321 2.46M /export/uss1-nas-vol1
0     0     0   496     0      0     0     0     0   246 1.88M /export/uss1-nas-vol1
0     0     0    44     0      0     0     0     0    21  168K /export/uss1-nas-vol1
0     0     0   344     0      0     0     0     0   172 1.32M /export/uss1-nas-vol1
0     0     0   646     0      0     0     0     0   323 2.51M /export/uss1-nas-vol1
0     0     0   570     0      0     0     0     0   285 2.20M /export/uss1-nas-vol1
0     0     0   695     0      0     0     0     0   350 2.72M /export/uss1-nas-vol1
0     0     0   624     0      0     0     0     0   309 2.38M /export/uss1-nas-vol1
0     0     0   562     0      0     0     0     0   282 2.15M /export/uss1-nas-vol1


Now lets enable the DDRdrive X1 ZIL cache and see where that takes us.

NAME        STATE     READ WRITE CKSUM
sp1         ONLINE       0     0     0
mirror-0  ONLINE       0     0     0
c6t1d0  ONLINE       0     0     0
c6t2d0  ONLINE       0     0     0
logs
c9t0d0  ONLINE       0     0     0

Again we run the identical SQLIO test and results are dramatically different, we immediately see a 4X improvement in IOPS but whats much more important is the reduction in latency which will make any database workload fly.

C:Program FilesSQLIO>sqlio -kW -s300 -frandom -o4 -b8 -LS -Fparam.txt
sqlio v1.5.SG
using system counter for latency timings, 3579545 counts per second
parameter file used: param.txt
file c:testfile.dat with 2 threads (0-1) using mask 0×0 (0)
2 threads writing for 300 secs to file c:testfile.dat
using 8KB random IOs
enabling multiple I/Os per thread with 4 outstanding
using specified size: 32768 MB for file: c:testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:   865.75
MBs/sec:     6.76

latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 8
Max_Latency(ms): 535
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 56 13  9  3  1  0  1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  7

new  name   name  attr  attr lookup rddir  read read  write write
file remov  chng   get   set    ops   ops   ops bytes   ops bytes
0     0     0   131     0      0     0     0     0    66  516K /export/uss1-nas-vol1
0     0     0 3.23K     0      0     0     0     0 1.62K 12.8M /export/uss1-nas-vol1
0     0     0    95     0      0     0     2    8K    43  324K /export/uss1-nas-vol1
0     0     0 2.62K     0      0     0     0     0 1.31K 10.3M /export/uss1-nas-vol1
0     0     0   741     0      0     0     0     0   369 2.78M /export/uss1-nas-vol1
0     0     0 1.99K     0      0     0     0     0  1019 7.90M /export/uss1-nas-vol1
0     0     0 1.34K     0      0     0     0     0   687 5.32M /export/uss1-nas-vol1
0     0     0   937     0      0     0     0     0   468 3.62M /export/uss1-nas-vol1
0     0     0 2.60K     0      0     0     0     0 1.30K 10.3M /export/uss1-nas-vol1
0     0     0 2.02K     0      0     0     0     0 1.01K 7.84M /export/uss1-nas-vol1
0     0     0 1.91K     0      0     0     0     0   978 7.58M /export/uss1-nas-vol1
0     0     0 1.94K     0      0     0     0     0   992 7.67M /export/uss1-nas-vol1

DDRdrive X1 Performance Chart by Mike La Spina


NFSStat Chart I/O DB Cache Compare by Mike La Spina


When we look at ZFS ZIL caching devices there are some important elements to consider. For most provisioned VMware storage systems you do not require large volumes of ZIL cache to generate good I/O performance.  What you need to do is carefully determine the active data write footprint size. Remember that ZIL is a write only world and that those writes will be relocated to a slower disk at some point. These relocation functions are processed in batches or as Ben Rockwood likes to say in a regular breathing cycle.  This means that random I/O operations can queued up and converted to a more sequential like behavior characteristic. Random synchronous write operations can be safely acknowledged immediately and then the ZFS DMU can process them more efficiently in the background. This means that if we provision cache devices that are closer to the system bus and have lower latency the back end core compute hardware will be able to move the data ahead of the bursting I/O peak up ramps and thus we deliver higher IOPS with significantly less cache requirements. Devices like the DDRdrive X1 are a good example of implementing this strategy.

I hope you found this blog entry to be interesting and useful.

Regards,

Mike

Share
  • Share

Tags: , , , , , , ,

Site Contents: © 2010  Mike La Spina

30 Comments

  • Chris says:

    those devices are nice but what happens in power outages you dont want to loose your log device. Here is the one i use it is a little more expensive i think but backs up the data and has a battery backup

    http://www.acard.com.tw/english/fb01-product.jsp?idno_no=270&prod_no=ANS-9010&type1_title=%20Solid%20State%20Drive&type1_idno=13

  • PiroNet says:

    Great post, thx for that!
    For those who are interested, you can enable dedup with the following commands:
    zfs set dedup=on tank/
    For example zfs set dedup=on tank/myvmfs

    Cheers,
    Didier

  • Hey Chris,

    Thanks for your comment.

    I am aware of the Acard product line, but I think I would chose otherwise as they don’t really compare to a bus driven SCSI device. They are publishing performance results that are very questionable, for example how does a SATA interface that can only produce 20,000 IOPS deliver 210,000? I would say the performance results are not based on trustworthy co-validated results. Additionally they’re also recommending an external UPS power source which is what you would use for the DDRdrive X1. The DDRdrive X1 could be setup to backup the dram to flash on power loss. But I would think that you would provide your complete system a UPS power source in addition to the DDRdrives X1 UPS and thus shutdown properly in an environment power loss state. Also what happens if the system is power cycled accidentally? Will the flash backup operation prevent the SLOG form being available? That event would render it failed and offline and that’s an even more serious state to experience.

    Just my thoughts on it.

    Regards,

    Mike

  • Niklas says:

    Thanks for the very valuable information.

  • Thanks for the info, Mike. Attacking latency with high performance I/O is a perfect combination for the DDRDrive, ZFS, and VMWare.
    — richard

  • Luiz Ozaki says:

    Does IOMeter make SYNC IO operations ?

  • To the best of my knowledge it does not set the sync flag on.

  • Don’t forget that the ZIL cache is only used on small block I/O and all larger block patterns are delivered dirctly to the disks. It is important to keep this in mind when considering the economics of ZIL-based acceleration of ZFS storage pools. Likewise, the implication is that for large I/O nothing replaces fast rust. That said, the use of ZIL in mixed IO environments like VMware (iSCSI or NFS) reduces the IO pressure on your spinning rust by making small IO patterns look like big ones. The theory goes: If you can significantly shift IO from 4-8K block patterns to 32-64k patterns throug coalescense in the ZIL, the aggregate will be higher IOPs and BW overall (like shifting the torque curve on a small block to realize more power.)

    The issue is more obvious in NFS than iSCSI under VMware because of the sync issue. However, some tradeoffs still exist beyond the sync-write problem, or pool raid/mirror group and numbers, and the chief one (IMHO) would be response time related to queuing. Too much latency can be a killer too, and trading BW for response time is another compromise that can be remedied (somewhat) with ZIL and ARC/L2ARC caching. Queue depth tunables are available both on ESX and ZFS but require a sound understanding of workload and storage makeup (hard to optimize a 2-disk SATA mirror…)

    As for IOmeter, the direct io tunable on IOmeter attempts to defeat local write caching, but does nothing to change the way NFS operates. There are tunables on the VMware side that should be considered in NFS environments – recapped here by Scott Lowe at http://blog.scottlowe.org/2010/01/31/emc-celerra-optimizations-for-vmware-on-nfs/) as ESX does not assume a NFS storage solution and is only minimally configured for it.

    Great article and depth of content, Mike!

  • Andrew says:

    I’m seeing different results to you. Out of curiosity what was the hardware you used to test on the ESX box in this article?

  • Andrew,
    The ESX 4.x VM instance was running on an IBM x346 host with an OpenSolaris storage head running on the same system hardware listed in the blog.

    GA-EP45-DS3L Mobo with an Intel 3.2Ghz E8500 Core Duo
    DDRdrive X1
    1 x 70GB OS Disk
    2GB of Ram
    1 x Intel Pro 1000 PT Quad Network Adapter

  • Huy says:

    Hi Mike, thanks for the great articles, they’ve provided tremendous insight.

    I was wondering if you’ve ever come across ESXi 4 throwing lost connection errors.
    I have solaris 10 update 8 on a thumper x4500 connected to two ESXi 4 servers, a PE2900 and HP DL360

    I only seem to run across this issue right after major zfs operations like zfs send, or zfs destroy.

    In most cases the connection is restored within a few seconds. Although sometimes longer, in which case my VMs start to crash.

    Any help would be greatly appreciated. Thanks!

  • Hi Huy,

    I run OpenSolaris snv_101 and snv_134 (Indiana) and neither exhibit that behavior.
    What storage protocol are your running NFS or iSCSI? Are you running a SLOG?
    Where are your VMware memory swap files pointing to? Local VMware disk or the shared store?
    How many snaps are you keeping?

    If your running NFS then make sure you set the Max NFS.Volumes to 64 on VMware which will allocate a larger buffer.
    As a workaround for windows hosts, increase the disk timeout registry value HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesDisk Timeoutvalue
    Set it to 180 Decimal.

    I would guess your seeing write delays on sync based disk IO calls.

    Regards,

    Mike

  • Huy says:

    Hi Mike,

    Wow thanks for the quick reply!

    I’m running NFS and I don’t have a SLOG. VMware memory swapfiles are located with the VMs on the NFS datastore. Since I’m using free ESXi and vMotion is not available, do you recommend moving the swap files to the local datastore?

    Most of my VMs have no snaps but one or two have a few. How does this impact performance?
    For backups I do zfs snapshots daily and zfs send them to an external usb drive once a week. It seemed the cheapest fastest route given the slow transfer over the vi client.

    I didn’t realize changing the max allowed NFS store would increase the buffer as well. I had only planned to use three NFS datastores at most so didn’t think that step was necessary. I will try that once things slow down at my company. The lost connection issues don’t pop up as much if I limit the zfs operations and make sure not too much is happening at once.

    Would you recommend switching over to opensolaris from solaris 10? there shouldn’t be any problems with exporting the zfs pool created in solaris 10 then importing them into an opensolaris installation would there?

    Again much thanks for your insights!

    Regards,
    Huy

  • Huy,

    Your welcome!

    Since your using NFS your best course would be to add a mirrored SLOG to the system. When you do a ZFS snapshot there is a point at which all in flight write requests are committed thus allowing a consistent static state for a snapshot to reference. Without a SLOG this must be committed to the slow physical disk. I highly suspect that is the suspend state your experiencing and moving to OpenSolaris would not resolve that. There are no issues exporting and importing ZFS provided that the source version of ZFS is supported on the target.

    Snapshots at the VMware level cause significant increases in VMware disk IO and lock activity and thus it kills your performance.

    For your VMware swaps you need to monitor if there is swap activity first. If swapping is occurring I would suggest creating an iSCSI LUN for VMFS and place the swaps on that store at the VMware host level. iSCSI Block storage will not use sync based IO unless the application specifies to do so with a SCSI sync flag and thus you get the benefit of cached IO on that shared swap store.

    Using the local disk store is an option as well. I prefer shared stores because it allows fast failure recovery from a hypervisor instance outage.

    Regards,

    Mike

  • Jason says:

    Hi Mike,

    We’re using a pair of x4500s and the ZFS/NFS combo as part of our ESX 3.5 environment. We are currently planning a migration to vSphere and I notice that the x4500 is only supported on 3.5. Have you heard anything about why this is and more importantly if it will change in the near future?

    Thanks in advance,

    Jason

  • Hi Jason,

    I have not installed ESX4 on the x4500 but I see no reason for it to be an issue. The supported list is only for servers that are certified and with the SUN/Oracle merger it’s going to delay those efforts but will not stop them.
    Also I have run the X4500 as an NFS/iSCSI target with ESX4 consumers since it’s release and with the current patch levels. No issues to date.

    Regards,
    Mike

  • Jason says:

    Thanks Mike for the quick response and your insight!

    Do you have a sense on what position VMWare takes when dealing with support issues? Does one piece of your environment not being on the supported list cause them to not support the entire setup or make them more likely to blame the piece they don’t “support”?

    I find it unlikely that they are going to go deep into the bowels of Solaris 10 to help us with issues regardless of the gear being supported or not. :-)

    Thanks again,

    Jason

  • Jason,

    I have not seen that behavior from VMware however I rarely need to call them, they tend to help first and if they hit a wall they would ask that you rule out the unknown before they proceed.
    They also would not go deep into any other vendors system bowels, EMC, NetApp, Various Engineo flavors it’s all the same they will point you to them for a solution. You see the disclaimers on multipathing and round robin in every VMware certified solution doc.

    Regards,

    Mike

  • Chris Twa says:

    Excellent post! Thank you very much for contributing your experiences

  • Gil Vidals says:

    I’ve read this blog through a couple of times, but one thing still doesn’t make sense to me:

    For eNAS-Interface1 you should define one active and one standby vmnic. This will ensure proper interface fail-over in all failure modes. The VMware NFS kernel instance will only use a single vmnic so your not loosing any bandwidth. The vmnic team only serves as a fault tolerant connection and is not a load balanced configuration.

    Since the NAS is configured with IPMP and is connected to two switches, but the VM only has one NIC that is active (the other is standby), then how can does this topology work when traffic flows from the NAS to to switch that the standby cable is connected to?

  • Gil,

    Since the switches are bridged, data can flow across the active VM nic path to either IPMP virtual binding. IPMP uses a ping to determine if a path is up. Should a path fail, e.g. a switch failure on either the active or standby path this will result in a transition to the standby side if the active end dies and no change if the standby path fails. IPMP will detect if either side is down and will transfer to the only link which responds. The VM traffic is never active on the standby path when the VM is bound to the team and no failure is present because the standby interface presents no response to the IPMP ping probes.

    Regards,

    Mike

  • [...] Running ZFS over NFS as a VMware store [...]

  • martijn says:

    Hey,
    We have about the same configuration up and running. What we want now is offsite backups of our vmdk files.
    I was wondering if you, or anyone else, have experience with getting the .vmdk disk images to a server on a colocation with zfs send/receive?

  • Hi Martijn,

    Many of us use ZFS send/receive to replicate vmdk’s to remote hosts and or a local devices or ZFS file streams. If you describe what your desired result is we may have someone that can describe their experience with it.

    For some examples of remote host ZFS replication have a look at this entry.
    http://blog.laspina.ca/ubiquitous/encapsulating-vt-d-accelerated-zfs-storage-within-esxi

    Regards,
    Mike

  • Gil Vidals says:

    Hi Mike,
    I’d like to know if you still believe that “all failure modes” still applies with vSphere 5. I know many changes have been made in vSphere since your wrote this blog.

    For eNAS-Interface1 you should define one active and one standby vmnic. This will ensure proper interface fail-over in all failure modes.

    Thanks,
    Gil

  • Hi Gil,

    vSphere 5 may have changed the obsevation I encountered on 4.1 however I have not tested it as of yet. The issue discovered was layer2 vlan loss failure detection. e.g. someone or thing changes the vlan tag on one trunk and the link stays up thus no physical link signal loss is acted on. It may be possible to resolve it using beacon probing at the team port level.

    Regards,
    Mike

  • Anonymous says:

    So, given that NFS clients use the O_SYNC flag to insure that writes are actually committed to storage… how is that different from the behavior of iSCSI clients (initiators)? I would hope that iSCSI clients use similar semantics to insure that writes are actually committed to storage. Shouldn’t write performance be similar, given a choice between NFS and iSCSI?

  • There are significant differences between block storage and file storage systems. In the case of iSCSI it’s just a transport and we the look to the initiator to target block storage processor handling. VMware’s VMFS driver requests can choose whether it needs to write asynchronously or not based on the upstream Virtual Machine SCSI command semantics. If the command descriptor block (CDB) sync flag is on from the VM it will wait for a commit CDB response and pass it back to the caller. Otherwise it is free to buffer and write asynchronously. NFS does not map to a VM CDB it must emulate and it cannot map to the semantics without integrity risk and performance penalties thus it uses the more safe of the two modes which is everything over O_SYNC. If they allowed async based I/O over NFS you would eventually have serious corruption issues.

Leave a Reply

XHTML: You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>