Running ZFS over NFS as a VMware Store


NFS is definitely a very well rounded high performance file storage system and it certainly serves VMware Stores successfully over many storage products. Recently one of my subscribers asked me if there was a reason why my blogs were more centric to iSCSI. Thus the question was probing for a answer to a question many of us ask ourselves. Is NFS superior to block based iSCSI and which one should I choose for VMware. The answer to this question is not which protocol is superior but which protocol serves to provision the features and function you require most effectively. I use both protocols and find they both have desirable capability and functionality and conversely have some negative points as well.

NFS typically is generally more accessible because its a file level protocol and sits higher up on the network stack. This makes it very appealing when working with VMware virtual disks aka vmdk’s simply because they also exist at the same layer. NFS is ubiquitous across NAS vendors and can be provisioned by multiple agnostic implementation endpoints.  An NFS protocol hosts the capability to be virtualized and encapsulated within any Hypevisor instance either clustered or standalone. The network file locking and share semantics of NFS grant it a multitude of configurable elements which can serve a wide range of applications.

In this blog entry we will explore how to implement an NFS share for VMware ESX using OpenSolaris and ZFS. We will also explore a new way of accelerating the servers I/O performance with a new product called the DDRdrive X1.

OpenSolaris is an excellent choice for provisioning NFS storage volumes on VMware.  It hosts many advanced desirable storage features that set it far ahead of other Unix flavors. We can use the advanced networking features and ZFS including the newly integrated dedup functionality to craft the best NFS functionality available today.

Let start by examining the overall NAS storage architecture.


NFS OpenSolaris/VMware Architecture by Mike La Spina



In this architecture we are defining a fault tolerant configuration using two physical 1Gbe switches with a quad or dual Ethernet adapter(s). On the OpenSolaris storage head we are using IPMP aka IP Multipathing to establish a single IP address to serve our NFS store endpoint. A single IP is more appropriate for VMware environments as they do not support multiple NFS IP targets per NFS mount point.  IPMP provisions layer 3 load balancing and interface fault tolerance. IPMP commonly uses ICMP and default routes to determine interface failure states thus it well suited for a NAS protocol service layer. In a effort to reduce excessive ICMP rates we will aggregate the two dual interfaces into a single channel connection to each switch. This will allow us to define two test IP addresses for the IPMP service and keep our logical interface count down to a minimum. We are also defining a 2 port trunk/aggregate between the two physical switches which provides more path availability and reduces  switch failure detection times.

On the ESX host side we are defining 1 interface per switch. This type of configuration requires that only one of the VMware interfaces is an active team member vmnic within a single vSwitch definition. If this is not configured this way the ESX host will fail to detect and activate the second nic under some failure modes. This is not a bandwidth constraint issue since the vmkernel IP interface will only activity use one nic.

With an architecture set in place let now explore some of the pros and cons of running VMware on Opensolaris NFS.

Some of the obvious pros are:

  • VMware uses NFS in a thin provisioned format.
  • VMDKs are stored as files and are mountable over a variety of hosts.
  • Simple backup and recovery.
  • Simple cloning and migration.
  • Scalable storage volumes.

And some of the less obvious pros:

  • IP based transports can be virtualized and encapsulated for disaster recovery.
  • No vendor lock-in
  • ZFS retains NFS share properties within the ZFS filesystem.
  • ZFS will dedup VMDKs files at the block level.

And there are the cons:

  • Every write I/O from VMware is an O_SYNC write.
  • Firewall setups are complex.
  • Limited in its application. Only NFS clients can consume NFS file systems.
  • General  protocol security challenges. (RPC)
  • VMware kernel constraints
  • High CPU overhead.
  • Bursty data flow.

Before we break out into the configuration detail level lets examine some of the VMware and NFS behaviors so as to gain some in site into the reason I primarily use iSCSI for most VMware implementations.

I would like demonstrate some characteristics that are primarily a VMware client side behavior and it’s important that you are aware of them when your considering NFS as a Datastore.

This VMware performance chart of an IOMeter generated load reveals the burst nature of the NFS protocol. The VMware NFS client exclusively uses a O_SYNC flag on write operations which requires a committed response for the NFS server. At some point the storage system will not be able to complete every request and thus a pause in transmission will occur. The same occurs on reads when the network component buffers reach saturation. In this example chart we are observing a single 1Gbe interface at saturation from a read stream.


NFS VMware Network I/O Behavior by Mike La Spina


In this output we are observing a read stream across vh0 which is one of two active ESX4 host VMs loading our OpenSolaris NFS store and we can see the maximum network throughput is achieved which is ~81MB/s. If you examine the average value of 78MB/s you can see the burst events do not have significant impact and is not a bandwidth concern with ~3MB/s of loss.


NFS VMware Network Read I/O Limit Behavior by Mike La Spina


At the same time we are recording this write stream chart on vh3 a second ESX 4 host loading the same NFS OpenSolaris store.  As I would expect, its very similar to the read stream except that we can see the write performance is lower and that’s to be expected with any write operations. We can also identify that we are using a full duplex path transmission across to our OpenSolaris NFS host since vh0 is reading (recieving) and vh3 is writing(transmitting).


NFS VMware Network Write I/O Limit Behavior by Mike La Spina


In this chart we are observing a limiting characteristic of the VMware vmkernel NFS client process. We have introduced a read stream in combination with a preexisting active write stream on a single ESX host. As you can see the transmit and receive packet rates are both reduced and now sum to a maximum of ~75MB/s.



NFS VMware Network Mixed Read Write I/O Limit Behavior by Mike La Spina


Transitioning from read to write active streams confirms the transmission is limit to ~75Mb/s regardless the full duplex interface capability.  This information demonstrates that a host using 1Gbe ethernet connections will be constrained based on its available resources. This is a important element to consider when using NFS as a VMware datastore.


NFS VMware Network Mixed Read Write I/O Flip Limit Behavior by Mike La Spina


Another important element to consider is the CPU load impact of running the vmkernel NFS client. There is a significant CPU cycle cost on VMware hosts and this is very apparent under heavier loads. The following screen shot depicts a running IOmeter load test against our OpenSolaris NFS store. The important elements are as follows. IOMeter is performing 32KB reads in a 100% sequential access mode which drives a CPU load on the VM of ~35% however this is not the only CPU activity that occurs for this VM.


NFS IOMeter ZFS Throughput 32KB-Seq


When we examine the ESX host resource summary for the running VM we can now observe the resulting overhead load which is realized by viewing the Consumed Host CPU value. The VM in this case is granted 2 CPUs each are a 3.2Ghz Intel hypervisor resource. We can see that the ESX host is running at 6.6Ghz to drive the vmkernel NFS I/O load.


NFS VMware ESX 4 CPU Load


Lets see the performance chart results when we svMotion the activily loaded running VM on the same ESX host to an iSCSI VMFS based store on the same OpenSolaris storage host. The only elements changing in this test are the underlying storage protocols. Here we can clearly see CPU object 0 is the ESX host CPU load. During the svMotion activity we begin to see some I/O drop off due to the addition background disk load. Finally we observe the VM transition at the idle point and the resultant CPU load of iSCSI I/O impact. We clearly see the ESX host CPU load drop from 6.6Ghz to 3.5Ghz which makes it very apparent the NFS requires substantially higher CPU that iSCSI.


VM Trasitioned with vMotion from NFS to iSCSI on same ZFS Storage host


With the svMotion completed we now observe the same IOMeter screen shot retake and its very obvious that our throughput and IOPS have increased significantly and the VM granted CPU load has not changed significantly.   A decrease of ESX host CPU load in the order of  ~55% and and increase of ~32% in IOPS and 45% of throughput shows us there are some negative behaviors to be cognizant of. Keep in mind that this is not that case when the I/O type is small and random like that of a Database in those cases  NFS is normally the winner, however VMware normally hosts mixed loads and thus we need to consider this negative effect at design time and when targeting VM I/O characteristics.


iSCSI IOMeter ZFS X1DDR Cache Throughput 32KB-Seq Mike La Spina

iSCSI ESX 4 CPU Load by Mike La Spina


With a clear understanding of some important negative aspects to implementing NFS for VMware ESX hosts we can proceed to the storage system build detail. The first order of business is the hardware configuration detail. This build is simply one of my generic white boxes and it hosts the following hardware:


GA-EP45-DS3L Mobo with an Intel 3.2Ghz E8500 Core Duo

1 x 70GB OS Disk

2 x 500GB SATA II ST3500320AS disks

2GB of Ram

1 x Intel Pro 1000 PT Quad Network Adapter


As a very special treat on this configuration I am also privileged to run an DDRDrive X1 Cache Accelerator which I am currently testing some newly developed beta drivers for OpenSoalris. Normally I would use 4GB of ram as a minimum but I needed to constraint this build in a effort to load down the dedicated X1 LOG drive and the physical SATA disks thus this instance is running only 2GB of ram. In this blog entry I will not be detailing the OpenSolaris install process, we will begin from a Live CD installed OS.

OpenSolaris will default to a dynamic network service configuration named nwam, this needs to be disabled and the physical:default service enabled.

root@uss1:~# svcadm disable svc:/network/physical:nwam
root@uss1:~# svcadm enable svc:/network/physical:default

To establish an aggregation we need to un-configure any interfaces that we previously configured before proceeding.

root@uss1:~# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 10.1.0.1 netmask ffff0000 broadcast 10.255.255.255
ether 0:50:56:bf:11:c3
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
inet6 ::1/128

root@uss1:~# ifconfig e1000g0 unplumb

Once cleared the assignment of the physical devices is possible using the following commands

dladm create-aggr –d e1000g0 –d e1000g1 –P L2,L3 1
dladm create-aggr –d e1000g2 –d e1000g3 –P L2,L3 2

Here we have set the policy allowing layer 2 and 3 and defined two aggregates aggr1 and aggr2. We can now define the VLAN based interface shown here as VLAN 500 instances 1 are 2 respective of the aggr instances. You just need to apply the following formula for defining the VLAN interface.

(Adaptor Name) + vlan * 1000 + (Adaptor Instance)

ifconfig aggr500001 plumb up 10.1.0.1 netmask 255.0.0.0
ifconfig aggr500002 plumb up 10.1.0.2 netmask 255.0.0.0

Each pair of interfaces needs to be attached to a trunk definition on its switch path. Typically this will be a Cisco or HP switch in most environments. Here is a sample of how to configure each brand.

Cisco:

configure terminal
interface port-channel 1
interface ethernet 1/1
channel-group 1
interface ethernet 1/2
channel-group 1
interface ethernet po1
switchport mode trunk allowed vlan 500
exit

HP Procurve:

trunk 1-2 trk1 trunk
vlan 500
name “eSAN1″
tagged trk1

 

Once we have our two physical aggregates setup we can define the IP multipathing interface components.  As a best practice we should define the IP addresses in our hosts file and then refer to those names in the remaining configuration tasks.

Edit /etc/hosts to have the following host entries.

::1 localhost
127.0.0.1 uss1.local localhost loghost
10.0.0.1 uss1 uss1.domain.name
10.1.0.1 uss1.esan.data1
10.1.0.2 uss1.esan.ipmpt1
10.1.0.3 uss1.esan.ipmpt2

Here we have named the IPMP data interface aka a public IP as uss1.esan-data1 this ip will be the active connection for our VMware storage consumers.  The other two named uss1.esan-ipmpt1 and uss1.esan-ipmpt2 are beacon probe  IP test addresses and will not be available to external connections.

IPMP functionallity is included with OpenSolaris and is configured with the ifconfig utility. The follow sets up the first aggregate with a real public IP and a test address. The deprecated keyword defines the IP as a test address and the failover keyword defines if the IP can be moved in the event of interface failure.

ifconfig aggr500001 plumb uss1.esan.ipmpt1 netmask + broadcast + group ipmpg1 deprecated -failover up addif uss1.esan.data1 netmask + broadcast + failover up
ifconfig aggr500002 plumb uss1.esan.ipmpt1 netmask + broadcast + group ipmpg1 deprecated -failover up

To persist the IPMP network configuration on boot you will need to create hostname files matching the interface names with the IPMP configuration statement store in them. The following will address it.

echo uss1.esan.ipmpt1 netmask + broadcast + group ipmpg1 deprecated -failover up addif uss1.esan.data1 netmask + broadcast + failover up > /etc/hostname.aggr500001

echo uss1.esan.ipmpt1 netmask + broadcast + group ipmpg1 deprecated -failover up > /etc/hostname.aggr500002

The resulting interfaces will look like the following:

root@uss1:~# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
aggr1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
inet 10.1.0.2 netmask ff000000 broadcast 10.255.255.255
groupname ipmpg1
ether 0:50:56:bf:11:c3
aggr2: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3
inet 10.1.0.3 netmask ff000000 broadcast 10.255.255.255
groupname ipmpg1
ether 0:50:56:bf:6e:2f
ipmp0: flags=8001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,IPMP> mtu 1500 index 5
inet 10.1.0.1 netmask ff000000 broadcast 10.255.255.255
groupname ipmpg1
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
inet6 ::1/128

This network configuration yields 2, 2Gbe aggregate paths bound to a single logical active IP address on 10.1.0.1, with  interfaces aggr1 and aggr2 the keyword deprecated directs the IPMP mpathd service daemon to prevent application session connection packets establishment and the nofailover keyword instructs mpathd not to allow the bound IP to failover to any other interface in the IPMP group.

There are many other possible configurations but I prefer this method because it remains logically easy to diagnose and does not introduce unnecessary complexity.

Now that we have layer 3 network connectivity we should establish the other essential OpenSolaris static TCP/IP configuration elements. We need to ensure we have a persistent default gateway and our DNS client resolution enabled.

The persistent default gateway is very simple to define as is done with the route utility command as follows.

root@uss1:~# route -p add default 10.1.0.254
add persistent net default: gateway

When using NFS I prefer provisioning name resolution as a additional layer of access control. If we use names to define NFS shares and clients we can externally validate the incoming IP  with a static file or DNS based name lookup. An OpenSolaris NFS implementation inherently grants this methodology.  When a client IP requests access to an NFS share we can define a forward lookup to ensure the IP maps to a name which is granted access to the targeted share. We can simply define the desired FQDNs against the NFS shares.

In small configurations static files are acceptable as is in the case here. For large host farms the use of a DNS service instance would ease the admin cycle. You would just have to be careful that your cached TimeToLive (TTL) value is greater that 2 hours thus preventing excessive name resolution traffic. The TTL value will control how long the name is cached and this prevents constant external DNS lookups.

To configure name resolution for both file and DNS we simply copy the predefined config file named nsswitch.dns to the active config file nsswitch.conf as follows:

root@uss1:~# cp /etc/nsswitch.dns /etc/nsswitch.conf

Enabling DNS will require the configuration of our /etc/resolv.conf file which defines our name servers and namespace.

e.g.

root@ss1:~# cat /etc/resolv.conf
domain laspina.ca
nameserver 10.1.0.200
nameserver 10.1.0.201

You can also use the static /etc/hosts file to define any resolvable name to IP mapping.

With OpenSolaris you should always define your NFS share properties using the ZFS administrative tools. When this method is used we can the take advantage of keeping the NFS share properties inside of ZFS. This is really useful when you replicate or clone the ZFS file system to an alternate host as all the share properties will be retained. Here are the basic elements of an NFS share configuration for use on VMware storage consumers.

zfs create -p sp1/nas/vol1
zfs set mountpoint=/export/uss1-nas-vol1 sp1/nas/vol1
zfs set sharenfs=rw,nosuid,root=vh3-nas:vh2-nas:vh1-nas:vh0-nas sp1/nas/vol1

The ACL NFS share property of rw sets the entire share as read write, you could alternately use rw=hostname for each host but it seems redundant to me.  The nosuid prevents any incoming connection from switching user ids for example from a non-root value to 0. Finally the root=hostname property grants the incoming host name access to the share with root access permissions.  Any files created by the host will be as the root id. While these steps are some level of access control it falls well short of secure thus I also keep the NAS subnets fully isolated or firewalled to prevent external network access to the NFS share hosts.

Once our NFS share is up and running we can proceed to configure the VMware network components and share connection properties. VMware requires a vmkernel network interface definition to provision NFS connectivity. You should dedicate a vmnic team and a vswitch for your storage network.

Here is a visual  example of a vmkernel configuration with a teamed pair of vmnics

vmkernel eNAS-Interface by Mike La Spina

As you can see we have dedicated the vSwitch and vmnics on VLAN 500, no other traffic should be permitted on this network. You should also set the default vmkernel gateway to its own address. This will promote better performance as there is no need to leave this network.

For eNAS-Interface1 you should define one active and one standby vmnic. This will ensure proper interface fail-over in all failure modes.  The VMware NFS kernel instance will only use a single vmnic so your not loosing any bandwidth. The vmnic team only serves as a fault tolerant connection and is not a load balanced configuration.

VMkernel Team Stanby by Mike La Spina


At this point you should validate your network connectivity by pinging the vmkernel IP address from the OpenSolaris host. If you chose to ping from ESX use vmkping instead of ping otherwise you will not get a response.

Provided your network connectivity is good you can define your vmkernel NFS share properties. Here is a visual example.

Add an NFS share by Mike La Spina

And if you prefer an ESX command line method:

esxcfg-nas -a -o uss1-nas -s /export/uss1-nas-vol1 uss1-nas-vol1

In this example we are using a DNS based name of uss1-nas. This would allow you to change the host IP without having to reconfigure VMware hosts. You will want to make sure the DNS name cache TTL in not a small value for two reasons. One an DNS outage would impact the IP resolution and as well you do not want excessive resolution traffic on the eSAN subnet(s).

The NFS share configuration info is maintained in the /etc/vmware/esx.conf file and looks like the following example.

/nas/uss1-nas-vol1/enabled = “true”
/nas/uss1-nas-vol1/host = “uss1-nas”
/nas/uss1-nas-vol1/readOnly = “false”
/nas/uss1-nas-vol1/share = “/export/uss1-nas-vol1″

If your trying to change NFS share parameters and the NFS share is not available after a successful configuration you could run into a messed up vmkernel NFS state and you’ll receive the following message:

Unable to get Console path for Mount

You will need to reboot the ESX server to clean it up so don’t mess with anything else until that is performed. (I’ve wasted a few hours on that buggy VMware kernel NFS client behavior).

Once the preceeding steps are successful the result will be a NAS based NFS share which is now available like this example.

Running NFS shares by Mike La Spina

With a working NFS storage system we can now look at optimizing the I/O capability of ZFS and NFS.

VMware performs write operations over NFS using an O_SYNC control flag. This will force the storage system to commit all write operations to disk to ensure VM file integrity. This can be very expensive when it comes to high performance IOPS especially when using SATA architecture. We could disable our ZIL aka ZFS Intent Log but this could result in severe corruption in the event of a systems fault or environmental issue. A much better alternative is to use a non-volatile ZIL device. In this case we have an DDRdrive X1 which is a 4GB high speed externally powered dram bank with a high speed SCSI interface and also hosts 4GB of flash for long term shutdowns.  The DDRdrive X1 IO capability reaches the 200,000/sec range and up. By using an external UPS power source we can economically prevent ZFS corruption and reap the high speed benefits of dram even when unexpected system interruptions occur.

In this blog our storage host is using Seagate ST3500320AS disk which are challenged to achieve ~180 IOPS. And that IO rate is under ideal sequential read write loads. With a cache we can expect that these disks will deliver no greater than 360 IOPS under ideal conditions.

Now lets see if this is true based on some load tests using Microsoft’s SQLIO tool. First we will disable our ZFS ZIL caching DDRdrive X1 show here as device c9t0d0

NAME        STATE     READ WRITE CKSUM
sp1         DEGRADED     0     0     0
mirror-0  ONLINE       0     0     0
c6t1d0  ONLINE       0     0     0
c6t2d0  ONLINE       0     0     0
logs
c9t0d0  OFFLINE      0     0     0

No lets run the SQLIO test for 5 minutes with random 8K I/O write requests which are simply brutal for any SATA disk to keep up with.  We have defined a file size of 32GB to ensure we hit the disk by exceeding our 2GB cache memory foot print. As you can see from the output we achieve 227 IOs/sec which is below the mirrored drive pair capability.

C:\Program Files\SQLIO>sqlio -kW -s300 -frandom -o4 -b8 -LS -Fparam.txt
sqlio v1.5.SG
using system counter for latency timings, 3579545 counts per second
parameter file used: param.txt
file c:\testfile.dat with 2 threads (0-1) using mask 0×0 (0)
2 threads writing for 300 secs to file c:\testfile.dat
using 8KB random IOs
enabling multiple I/Os per thread with 4 outstanding
using specified size: 32768MB for file: c:\testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:   227.76
MBs/sec:     1.77

latency metrics:
Min_Latency(ms): 8
Avg_Latency(ms): 34
Max_Latency(ms): 1753
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%:  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1 29  7  3  2  1  1  1 54

new  name   name  attr  attr lookup rddir  read read  write write
file remov  chng   get   set    ops   ops   ops bytes   ops bytes
0     0     0   300     0      0     0     3   16K   146 1.12M /export/uss1-nas-vol1
0     0     0   617     0      0     0     0     0   309 2.39M /export/uss1-nas-vol1
0     0     0   660     0      0     0     0     0   329 2.52M /export/uss1-nas-vol1
0     0     0   677     0      0     0     0     0   338 2.63M /export/uss1-nas-vol1
0     0     0   638     0      0     0     0     0   321 2.46M /export/uss1-nas-vol1
0     0     0   496     0      0     0     0     0   246 1.88M /export/uss1-nas-vol1
0     0     0    44     0      0     0     0     0    21  168K /export/uss1-nas-vol1
0     0     0   344     0      0     0     0     0   172 1.32M /export/uss1-nas-vol1
0     0     0   646     0      0     0     0     0   323 2.51M /export/uss1-nas-vol1
0     0     0   570     0      0     0     0     0   285 2.20M /export/uss1-nas-vol1
0     0     0   695     0      0     0     0     0   350 2.72M /export/uss1-nas-vol1
0     0     0   624     0      0     0     0     0   309 2.38M /export/uss1-nas-vol1
0     0     0   562     0      0     0     0     0   282 2.15M /export/uss1-nas-vol1


Now lets enable the DDRdrive X1 ZIL cache and see where that takes us.

NAME        STATE     READ WRITE CKSUM
sp1         ONLINE       0     0     0
mirror-0  ONLINE       0     0     0
c6t1d0  ONLINE       0     0     0
c6t2d0  ONLINE       0     0     0
logs
c9t0d0  ONLINE       0     0     0

Again we run the identical SQLIO test and results are dramatically different, we immediately see a 4X improvement in IOPS but whats much more important is the reduction in latency which will make any database workload fly.

C:\Program Files\SQLIO>sqlio -kW -s300 -frandom -o4 -b8 -LS -Fparam.txt
sqlio v1.5.SG
using system counter for latency timings, 3579545 counts per second
parameter file used: param.txt
file c:\testfile.dat with 2 threads (0-1) using mask 0×0 (0)
2 threads writing for 300 secs to file c:\testfile.dat
using 8KB random IOs
enabling multiple I/Os per thread with 4 outstanding
using specified size: 32768 MB for file: c:\testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:   865.75
MBs/sec:     6.76

latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 8
Max_Latency(ms): 535
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 56 13  9  3  1  0  1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  7

new  name   name  attr  attr lookup rddir  read read  write write
file remov  chng   get   set    ops   ops   ops bytes   ops bytes
0     0     0   131     0      0     0     0     0    66  516K /export/uss1-nas-vol1
0     0     0 3.23K     0      0     0     0     0 1.62K 12.8M /export/uss1-nas-vol1
0     0     0    95     0      0     0     2    8K    43  324K /export/uss1-nas-vol1
0     0     0 2.62K     0      0     0     0     0 1.31K 10.3M /export/uss1-nas-vol1
0     0     0   741     0      0     0     0     0   369 2.78M /export/uss1-nas-vol1
0     0     0 1.99K     0      0     0     0     0  1019 7.90M /export/uss1-nas-vol1
0     0     0 1.34K     0      0     0     0     0   687 5.32M /export/uss1-nas-vol1
0     0     0   937     0      0     0     0     0   468 3.62M /export/uss1-nas-vol1
0     0     0 2.60K     0      0     0     0     0 1.30K 10.3M /export/uss1-nas-vol1
0     0     0 2.02K     0      0     0     0     0 1.01K 7.84M /export/uss1-nas-vol1
0     0     0 1.91K     0      0     0     0     0   978 7.58M /export/uss1-nas-vol1
0     0     0 1.94K     0      0     0     0     0   992 7.67M /export/uss1-nas-vol1

DDRdrive X1 Performance Chart by Mike La Spina


NFSStat Chart I/O DB Cache Compare by Mike La Spina


When we look at ZFS ZIL caching devices there are some important elements to consider. For most provisioned VMware storage systems you do not require large volumes of ZIL cache to generate good I/O performance.  What you need to do is carefully determine the active data write footprint size. Remember that ZIL is a write only world and that those writes will be relocated to a slower disk at some point. These relocation functions are processed in batches or as Ben Rockwood likes to say in a regular breathing cycle.  This means that random I/O operations can queued up and converted to a more sequential like behavior characteristic. Random synchronous write operations can be safely acknowledged immediately and then the ZFS DMU can process them more efficiently in the background. This means that if we provision cache devices that are closer to the system bus and have lower latency the back end core compute hardware will be able to move the data ahead of the bursting I/O peak up ramps and thus we deliver higher IOPS with significantly less cache requirements. Devices like the DDRdrive X1 are a good example of implementing this strategy.

I hope you found this blog entry to be interesting and useful.

Regards,

Mike

Site Contents: © 2010  Mike La Spina

Securing COMSTAR and VMware iSCSI connections

Connecting VMware iSCSI sessions to COMSTAR or any iSCSI target provider securely is required to maintain a reliable system. Without some level of initiator to target connection gate keeping we will eventually encounter a security event. This can happen from a variety of sources, for example a non-cluster aware OS can connect to an unsecured VMware shared storage LUN and cause severe damage to it since the OS has no shared LUN access knowledge.  All to often we make assumptions that security is about confidentiality when it is actually more commonly about data availability and integrity which will both be compromised if an unintentional connection were to write on a shared LUN.

At the very minimum security level we should apply non-authenticated named initiator access grants to our targets. This low security method defines initiator to target connection states for lower security tolerant environments. This security method is applicable when confidentiality is not as important and security is maintained with the physical access control realm. As well it should also coincide with SAN fabric isolation and be strictly managed by the Virtual System or Storage Administrators. Additionally we can increase access security control by enabling CHAP authentication which is a serious improvement over named initiators. I will demonstrate both of these security methods using COMSTAR iSCSI Providers and VMware within this blog entry.

Before we dive into the configuration details lets examine how LU’s are exposed. COMSTAR controls iSCSI target access using several combined elements. One of these elements is within the COMSTAR STMF facility where we can assign membership of host and target groups. By default if we do not define a host or target group any created target will belong to an implied ALL group. This group as we would expect grants any connecting initiator membership to the ALL group assigned LUN’s. These assignments are called views in the STMF state machine and are a mapping function of the Storage Block Driver service (SBD) to the STMF IT_nexus state tables.

This means that if we were to create an initiator without assigning a host group or host/target group combination, an initiator would be allowed unrestricted connectivity to any ALL group LUN views and possibly without any authentication at all. Allowing this to occur would of course be very undesirable from a security perspective in almost all cases. Conversely if we use a target group definition then only the initiators that connect to the respective target will see the LUN views which are mapped on that target definition instance.

While target groups do not significantly improve access security it does provide a means controlling accessibility based on the definition of interface connectivity classes which in turn can be mapped out on respective VLAN priority groups, bandwidth availability and applicable path fault tolerance capabilities which are all important aspects of availability and unfortunately are seldom considered security concepts in many architectures.

Generally on most simple storage configurations the use of target groups is not a requirement. However they do provide a level of access control with LUN views. For example we can assign LUN views to a target group which in turn frees us from having to add the LUN view to each host group within shared LUN configurations like VMware stores. With combination’s of host and target groups we can create more flexible methods in respect to shared LUN visibility. With the addition of simple CHAP authentication we can more effectively insulate target groups. This is primarily due to the ability to assign separate CHAP user and password values for each target.

Lets look at this visual depiction to help see the effect of using target and host groups.

COMSTAR host and target view depiction

In this depiction any initiator that connects to the target group prod-tg1 will by default see the views that are mapped to that target groups interfaces. Additionally if the initiator is also a member of the host group prod-esx1 those view mapping will also be visible.

One major difference with target groups verses the all group is that you can define LU views on mass to an entire class of initiator connections e.g. a production class. This becomes an important control element in a unified media environment where the use of VLANs separates visibility. Virtual interfaces can be created at the storage server and attached to VLANs respectively. Target groups become a very desirable as a control within a unified computing context.

Named Initiator Access

Enabling named initiator to target using unauthenticated access with COMSTAR and VMware iSCSI services is a relatively simple operation. Let’s examine how this method controls initiator access.

We will define two host groups, one for production esx hosts and one for test esx hosts.

# stmfadm create-hg prod-esx1

# stmfadm create-hg test-esx1

With these host groups defined we individually assign LU’s views to the host groups and then we define any initiator to be a member of one of the host groups to which it would only see the views which belong to the host group and additionally any views assigned to the default all group.

To add a host initiator to a host group, we must first create it in the port provider of choice which in this case is the iSCSI port provider.

# itadm create-initiator iqn.1998-01.com.vmware:vh1.1

Once created the defined initiator can be added to a host group.

# stmfadm add-hg-member -g prod-esx1 iqn.1998-01.com.vmware:vh1.1

An ESX host initiator with this iqn name can now attach to our COMSTAR targets and will see any LU views that are added to the prod-esx1 host group. But there are still some issues here, for example any ESX host with this initiator name will be able to connect to our targets and see the LUs. This is where CHAP can help to improve access control.

Adding CHAP Authentication on the iSCSI Target

Adding CHAP authentication is very easy to accomplish, we simply need to set a chap user name and secret on the respective iSCSI target. Here is an example of its application.

# itadm modify-target -s -u tcuid1 iqn.2009-06.target.ss1.1

Enter CHAP secret:
Re-enter secret:

The CHAP secret must be between 12 and 255 characters long. The secret is sent from the initiator to the target in clear text and this is one of the reasons why separate iSCSI networks are a best practice. The addition of CHAP allows us to further reduce any risks of a potential storage security event. We can define an additional target and they can have a different chap user names and or secrets.

CHAP is more secure when used in a mutual authentication back to the source initiator which is my preferred way to implement it on ESX 4 (ESX 3 does not support mutual chap). This mode does not stop a successful one-way authentication from an initiator to the target, it allows the initiator to request that the target host system iSCSI services must authenticate back to the initiator which provides validation that the target is indeed the correct one. Here is an example of the target side initiator definition that would provide this capability.

# itadm modify-initiator -s -u icuid1 iqn.1998-01.com.vmware:vh1.1

Enter CHAP secret:
Re-enter secret:

Configuring the ESX 4 Software iSCSI Initiator

On the ESX 4 host side we need to enter our initiator side CHAP values.

ESX 4 iSCSI Mutual CHAP

 

Be careful here, there are three places we can configure CHAP elements. The general tab allows a global point of admin where any target will inherit those entered values by default where applicable e.g. target chap settings. The the dynamic tab can override the global settings and as well the static tab overrides the global and dynamic ones. In this example we are configuring a dynamically discovered target to use mutual (aka bidirectional) authentication.

In closing CHAP is a reasonable method to ensure that we correctly grant initiator to target connectivity assignments in an effort to promote better integrity and availability. It does not however provide much on the side of confidentially for that we need more complex solutions like IPSec.

Hope you found this blog interesting.

Regards,

Mike

Site Contents: © 2009  Mike La Spina

VMware Kernel Return Codes

From time to time I find myself reaming through VMware log files in an effort to diagnose various failure events. This is certainly not my favorite task so to make the process a little less painful I decided to extract the vmkernel return codes from the VMware open source libraries and place an easily accessible tabled version of them on my blog. While it’s not a very interesting blog entry it does have a useful purpose. I promise to get back to some more interesting entries soon.You can also use the console command vmkerrcode -l if it’s handy.

This posted list is only for ESX 3.5.x, vSphere has different error codes and they do not map to ESX 3.5.x

Regards,

Mike

 

0         Success
0xbad0001 Failure
0xbad0002 Would block
0xbad0003 Not found
0xbad0004 Busy
0xbad0005 Already exists
0xbad0006 Limit exceeded
0xbad0007 Bad parameter
0xbad0008 Metadata read error
0xbad0009 Metadata write error
0xbad000a I/O error
0xbad000b Read error
0xbad000c Write error
0xbad000d Invalid name
0xbad000e Invalid handle
0xbad000f No such SCSI adapter
0xbad0010 No such target on adapter
0xbad0011 No such partition on target
0xbad0012 No filesystem on the device
0xbad0013 Memory map mismatch
0xbad0014 Out of memory
0xbad0015 Out of memory (ok to retry)
0xbad0016 Out of resources
0xbad0017 No free handles
0xbad0018 Exceeded maximum number of allowed handles
0xbad0019 No free pointer blocks (deprecated)
0xbad001a No free data blocks (deprecated)
0xbad001b Corrupt RedoLog
0xbad001c Status pending
0xbad001d Status free
0xbad001e Unsupported CPU
0xbad001f Not supported
0xbad0020 Timeout
0xbad0021 Read only
0xbad0022 SCSI reservation conflict
0xbad0023 File system locked
0xbad0024 Out of slots
0xbad0025 Invalid address
0xbad0026 Not shared
0xbad0027 Page is shared
0xbad0028 Kseg pair flushed
0xbad0029 Max async I/O requests pending
0xbad002a Minor version mismatch
0xbad002b Major version mismatch
0xbad002c Already connected
0xbad002d Already disconnected
0xbad002e Already enabled
0xbad002f Already disabled
0xbad0030 Not initialized
0xbad0031 Wait interrupted
0xbad0032 Name too long
0xbad0033 VMFS volume missing physical extents
0xbad0034 NIC teaming master valid
0xbad0035 NIC teaming slave
0xbad0036 NIC teaming regular VMNIC
0xbad0037 Abort not running
0xbad0038 Not ready
0xbad0039 Checksum mismatch
0xbad003a VLan HW Acceleration not supported
0xbad003b VLan is not supported in vmkernel
0xbad003c Not a VLan handle
0xbad003d Couldn’t retrieve VLan id
0xbad003e Connection closed by remote host, possibly due to timeout
0xbad003f No connection
0xbad0040 Segment overlap
0xbad0041 Error parsing MPS Table
0xbad0042 Error parsing ACPI Table
0xbad0043 Failed to resume VM
0xbad0044 Insufficient address space for operation
0xbad0045 Bad address range
0xbad0046 Network is down
0xbad0047 Network unreachable
0xbad0048 Network dropped connection on reset
0xbad0049 Software caused connection abort
0xbad004a Connection reset by peer
0xbad004b Socket is not connected
0xbad004c Can’t send after socket shutdown
0xbad004d Too many references: can’t splice
0xbad004e Connection refused
0xbad004f Host is down
0xbad0050 No route to host
0xbad0051 Address already in use
0xbad0052 Broken pipe
0xbad0053 Not a directory
0xbad0054 Is a directory
0xbad0055 Directory not empty
0xbad0056 Not implemented
0xbad0057 No signal handler
0xbad0058 Fatal signal blocked
0xbad0059 Permission denied
0xbad005a Operation not permitted
0xbad005b Undefined syscall
0xbad005c Result too large
0xbad005d Pkts dropped because of VLAN (support) mismatch
0xbad005e Unsafe exception frame
0xbad005f Necessary module isn’t loaded
0xbad0060 No dead world by that name
0xbad0061 No cartel by that name
0xbad0062 Is a symbolic link
0xbad0063 Cross-device link
0xbad0064 Not a socket
0xbad0065 Illegal seek
0xbad0066 Unsupported address family
0xbad0067 Already connected
0xbad0068 World is marked for death
0xbad0069 No valid scheduler cell assignment
0xbad006a Invalid cpu min
0xbad006b Invalid cpu minLimit
0xbad006c Invalid cpu max
0xbad006d Invalid cpu shares
0xbad006e Cpu min outside valid range
0xbad006f Cpu minLimit outside valid range
0xbad0070 Cpu max outside valid range
0xbad0071 Cpu min exceeds minLimit
0xbad0072 Cpu min exceeds max
0xbad0073 Cpu minLimit less than cpu already reserved by children
0xbad0074 Cpu max less than cpu already reserved by children
0xbad0075 Admission check failed for cpu resource
0xbad0076 Invalid memory min
0xbad0077 Invalid memory minLimit
0xbad0078 Invalid memory max
0xbad0079 Memory min outside valid range
0xbad007a Memory minLimit outside valid range
0xbad007b Memory max outside valid range
0xbad007c Memory min exceeds minLimit
0xbad007d Memory min exceeds max
0xbad007e Memory minLimit less than memory already reserved by children
0xbad007f Memory max less than memory already reserved by children
0xbad0080 Admission check failed for memory resource
0xbad0081 No swap file
0xbad0082 Bad parameter count
0xbad0083 Bad parameter type
0xbad0084 Dueling unmaps (ok to retry)
0xbad0085 Inappropriate ioctl for device
0xbad0086 Mmap changed under page fault (ok to retry)
0xbad0087 Operation now in progress
0xbad0088 Address temporarily unmapped
0xbad0089 Invalid buddy type
0xbad008a Large page info not found
0xbad008b Invalid large page info
0xbad008c SCSI LUN is in snapshot state
0xbad008d SCSI LUN is in transition
0xbad008e Transaction ran out of lock space or log space
0xbad008f Lock was not free
0xbad0090 Exceed maximum number of files on the filesystem
0xbad0091 Migration determined a failure by the VMX
0xbad0092 VSI GetList handler overflow
0xbad0093 Invalid world
0xbad0094 Invalid vmm
0xbad0095 Invalid transaction
0xbad0096 Transient file system condition, suggest retry
0xbad0097 Number of running VCPUs limit exceeded
0xbad0098 Invalid metadata
0xbad0099 Invalid page number
0xbad009a Not in executable format
0xbad009b Unable to connect to NFS server
0xbad009c The NFS server does not support MOUNT version 3 over TCP
0xbad009d The NFS server does not support NFS version 3 over TCP
0xbad009e The mount request was denied by the NFS server. Check that the export exists and that the client is permitted to mount it
0xbad009f The specified mount path was not a directory
0xbad00a0 Unable to query remote mount point’s attributes
0xbad00a1 NFS has reached the maximum number of supported volumes
0xbad00a2 Out of nice memory
0xbad00a3 VMotion failed to start due to lack of cpu or memory resources
0xbad00a4 Cache miss
0xbad00a5 Error induced when stress options are enabled
0xbad00a6 Maximum number of concurrent hosts are already accessing this resource
0xbad00a7 Host doesn’t have a journal
0xbad00a8 Lock rank violation detected
0xbad00a9 Module failed
0xbad00aa Unable to open slave if no master pty
0xbad00ab Not IOAble
0xbad00ac No free inodes
0xbad00ad No free memory for file data
0xbad00ae No free space to expand file or meta data
0xbad00af Unable to open writer if no fifo reader
0xbad00b0 No underlying device for major,minor
0xbad00b1 Memory min exceeds memSize
0xbad00b2 No virtual terminal for number
0xbad00b3 Too many elements for list
0xbad00b4 VMM<->VMK shared are mismatch
0xbad00b5 Failure during exec while original state already lost
0xbad00b6 vmnixmod kernel module not loaded
0xbad00b7 Invalid module
0xbad00b8 Address is not aligned on page boundary
0xbad00b9 Address is not mapped in address space
0xbad00ba No space to record a message
0xbad00bb No space left on PDI stack
0xbad00bc Invalid exception handler
0xbad00bd Exception not handled by exception handler
0xbad00be Can’t open sparse/TBZ files in multiwriter mode
0xbad00bf Transient storage condition, suggest retry
0xbad00c0 Storage initiator error
0xbad00c1 Timer initialization failed
0xbad00c2 Module not found
0xbad00c3 Socket not owned by cartel
0xbad00c4 No VSI handler found for the requested node
0xbad00c5 Invalid mmap protection flags
0xbad00c6 Invalid chunk size for contiguous mmap
0xbad00c7 Invalid MPN max for contiguous mmap
0xbad00c8 Invalid mmap flag on contiguous mmap
0xbad00c9 Unexpected fault on pre-faulted memory region
0xbad00ca Memory region cannot be split (remap/unmap)
0xbad00cb Cache Information not available
0xbad00cc Cannot remap pinned memory
0xbad00cd No cartel group by that name
0xbad00ce SPLock stats collection disabled
0xbad00cf Boot image is corrupted
0xbad00d0 Branched file cannot be modified
0xbad00d1 Name is reserved for branched file
0xbad00d2 Unlinked file cannot be branched
0xbad00d3 Maximum kernel-level retries exceeded
0xbad00d4 Optimistic lock acquired by another host
0xbad00d5 Object cannot be mmapped
0xbad00d6 Invalid cpu affinity
0xbad00d7 Device does not contain a logical volume
0xbad00d8 No space left on device
0xbad00d9 Invalid vsi node ID
0xbad00da Too many users accessing this resource
0xbad00db Operation already in progress
0xbad00dc Buffer too small to complete the operation
0xbad00dd Snapshot device disallowed
0xbad00de LVM device unreachable
0xbad00df Invalid cpu resource units
0xbad00e0 Invalid memory resource units
0xbad00e1 IO was aborted
0xbad00e2 Memory min less than memory already reserved by children
0xbad00e3 Memory min less than memory required to support current consumption
0xbad00e4 Memory max less than memory required to support current consumption
0xbad00e5 Timeout (ok to retry)
0xbad00e6 Reservation Lost
0xbad00e7 Cached metadata is stale
0xbad00e8 No fcntl lock slot left
0xbad00e9 No fcntl lock holder slot left
0xbad00ea Not licensed to access VMFS volumes
0xbad00eb Transient LVM device condition, suggest retry
0xbad00ec Snapshot LV incomplete
0xbad00ed Medium not found
0xbad00ee Maximum allowed SCSI paths have already been claimed
0xbad00ef Filesystem is not mountable
0xbad00f0 Memory size exceeds memSizeLimit
0xbad00f1 Disk lock acquired earlier, lost
0×2bad0000 Generic service console error

Site Contents: © 2009  Mike La Spina

Protecting ESX VMFS Stores with Automation

Some time ago I shared some interesting information about VMFS volumes that I found using direct analysis in my blog named Understanding VMFS volumes. This spawned some discussions on the VMware Community forums and it became apparent that an automated backup of the critical VMFS info could be useful in the event of an undesirable security event that impacts our system availability. By creating a simple backup script process we can provide the ability to recover much more quickly from such events. In this howto guide we will enable this process with a cron job using the existing /etc/cron.daily/ job location directory. We simply need to copy an automation script to this location and it will run daily. Or if your change rate is less frequent maybe the /etc/cron.weekly location is more suitable.    

I have provided a simple script named vmfs-bu linked here which creates VMFS header file backups using dd for all the VMFS volumes visible on an ESX host. The script will output it’s backups to the /var/log directory. Here is a sample of what the backups look like from one my lab hosts.

-rw-r–r–    1 root     root      2097152 Mar 28 21:01 vmfs-header-backup-sda.hex
-rw-r–r–    1 root     root      2097152 Mar 28 21:01 vmfs-header-backup-sdb.hex
-rw-r–r–    1 root     root      2097152 Mar 28 21:01 vmfs-header-backup-sdc.hex
-rw-r–r–    1 root     root      2097152 Mar 28 21:01 vmfs-header-backup-sdd.hex
-r——–    1 root     root      8388608 Mar 28 21:01 vmfs-metadata-sda-vh.sf.bu
-r——–    1 root     root      8388608 Mar 28 21:01 vmfs-metadata-sdb-vh.sf.bu
-r——–    1 root     root      8388608 Mar 28 21:01 vmfs-metadata-sdc-vh.sf.bu
-r——–    1 root     root      8388608 Mar 28 21:01 vmfs-metadata-sdd-vh.sf.bu

Also be sure that the script permissions are configured as executable using the follwing.

chmod 755 /etc/cron.daily/vmfs-bu

With these simple easy steps we can create a proactive failsafe for VMFS recovery in the event of those undesirable moments we hope will never occur.

In addition to this basic method we can extend it’s functionally to recovery from undesirable VM deletions by automating the experimental vmfs-undelete tool. Before I demonstrate this functionally I must advise you that this tool may have some negative impacts to certain configurations. The vmfs-undelete tool is a set of python scripts that basically read vmks block allocations and outputs them to an archive file. In order to read the vmdks the scripts will invoke VM snapshots and this will not work if a current snapshot exists. This is not a big issue since you should not be leaving any snapshots around anyway as a best practice. However this action can negatively impact VCB or other backup agents that require snapshoting to grant vmdk access, Thus if you do not carefully time the script invocation when we have other snapshot dependant activities it could cause failures of those functions.

With this important information considered we can proceed to look at how to invoke the additional protection feature. The scripts that were originally developed by VMware are designed to be user interactive and cannot be used as originally coded therefore I have modified them in order to provision some basic automation. You can access the modified scripts named vmfs-undelete-auto-script and menuauto.py here.

The modified menuauto.py script needs to be placed within the /usr/lib/vmware/python2.2/site-packages/vmware/undeletemods directory. While I could have just modified the existing menu.py script it is subject to change so this method prevents potential conflict issues. The vmfs-undelete-auto-script script location is optional and can be placed where ever you find apropriate. I chose to place it in the /root directory. The script requires a single argument which is to direct it output location with a path and file name. Since there is a potention for conflict with other snapshot based services the script should be invoked using a cron job outside of the daily or other predefined jobs. This cron job can be implemented using the crontab facility. Here is an example of how to create it while logged in as root.

crontab -e

This command will create a cron entry in /var/spool/cron/root and will load vi as a editor you will need to create the following cron details to invoke the vmfs-undelete script. You can find more info on cron editing at http://en.wikipedia.org/wiki/Cron.

05 22 * * * python /root/vmfs-undelete-auto-script /var/log/vmfs-undelete-archive

This entry will execute at 10:05PM every day. Again be sure to set the security of the script file as executable as shown here.

chmod 755 /root/vmfs-undelete-auto-script

When the script executes it can take several minutes to complete.

Hope you found this information to be useful.

Regards,

Mike

Site Contents: © 2009  Mike La Spina

Awarded VMware vExpert Status

While I was away at the OpenSolaris Storage Summit in San Francisco a really nice email appeared in my inbox which put a grateful smile on my face. John Troyer and his team felt that I was worthy of the vExperts award. I’m honoured to receive the recognition and thank-you to the judging panel members 

Congrats to all the other recipients as well.

Site Contents: © 2009  Mike La Spina

Next Page »