Running ZFS over iSCSI as a VMware vmfs store

The first time I looked at ZFS it totally floored me. This is a file system that has changed storage system rules as we currently know them and continues to do so. It is with no doubt the best architecture to date and now you can use it for your VMware stores.

Previously I had explored using it for a VMware store but ran into many issues which were real show stoppers. Like the VPD page response issue which made VMware see only one usable iSCSI store. But things are soon to be very different when Sun releases the snv_93 or above to all. I am currently using the unreleased snv_93 iscsitgt code and it works with VMware in all the ways you would want. Many thanks to the Sun engineers for adding NAA support on the iSCSI target service. With that being said let me divulge the details and behaviors of the first successful X4500 ZFS iSCSI VMware implementation in the real world.

Lets look at the Architectural view first.

X4500 iSCSI Architecture by Mike La Spina

The architecture uses a best practice approach consisting of completely separated physical networks for the iSCSI storage data plane. All components have redundant power and network connectivity. The iSCSI storage backplane is configured with an aggregate and is VLAN’d off from the server management network. Within the physical HP 2900’s an inter-switch ISL connection is defined but is not critical. This allows for more available data paths if additional interfaces were assigned on the ESX host side.
The Opensolaris aggregate and network components are configured as follows:

For those of you using Indiana….By default nwam is enabled on Indiana and this needs to be disabled and the physical network service enabled.

svcadm disable svc:/network/physical:nwam
svcadm enable svc:/network/physical:default

The aggregate is defined using the data link adm utility but first any bindings need to be cleared by unplumbing the interfaces.

e.g. ifconfig e1000g0 unplumb

Once cleared the assignment of the physical devices is possible using the following commands

dladm create-aggr –d e1000g0 –d e1000g1 –P L2,L3 1
dladm create-aggr –d e1000g2 –d e1000g3 –P L2,L3 2

Here we have set the policy allowing layer 2 and 3 and defined two aggregates aggr1 and aggr2. We can now define the VLAN based interface shown here as VLAN 500 instances 1 are 2 respective of the aggr instances. You just need to apply the following formula for defining the VLAN interface.

(Adaptor Name) + vlan * 1000 + (Adaptor Instance)

ifconfig aggr500001 plumb up netmask
ifconfig aggr500002 plumb up netmask

To persist the network configuration on boot you will need to create hostname files and hosts entries for the services to apply on startup.

echo ss1.iscsi1 > /etc/hostname.aggr500001
echo ss1.iscsi2 > /etc/hostname.aggr500002

Edit /etc/hosts to have the following host entries.

::1 localhost ss1.local localhost loghost ss1 ss1.iscsi1 ss1.iscsi2

On the HP switches its a simple static trunk definition on port 1 and 2 using the following at the CLI.

trunk 1-2 trk1 trunk 

Once all the networking components are up and running and persistent, its time to define the ZFS store and iSCSI targets. I chose to include both mirrored and raidz pools. I needed to find and organize the cxtxdx device names using the cfgadm command or you could issue a format command as well to see the controller, target, disk names if you’re not using an X4500. I placed the raidz devices across controllers to improve I/O and distribute the load. It would not be a prudent to place one array on a single SATA controller. So here is what it ends up looking like from the ZFS command view.

zpool create –f rp1 raidz1 c4t0d0 c4t6d0 c5t4d0 c8t2d0 c9t1d0 c10t1d0
zpool add rp1 raidz1 c4t1d0 c4t7d0 c5t5d0 c8t3d0 c9t2d0 c10t2d0
zpool add rp1 raidz1 c4t2d0 c5t0d0 c5t6d0 c8t4d0 c9t3d0 c10t3d0
zpool add rp1 raidz1 c4t3d0 c5t1d0 c5t7d0 c8t5d0 c9t5d0 c11t0d0
zpool add rp1 raidz1 c4t4d0 c5t2d0 c8t0d0 c8t6d0 c9t6d0 c11t1d0
zpool add rp1 raidz1 c4t5d0 c5t3d0 c8t1d0 c8t7d0 c10t0d0 c11t2d0
zpool add rp1 spare c11t3d0
zpool create –f mp1 mirror c10t4d0 c11t4d0
zpool add mp1 mirror c10t5d0 c11t5d0
zpool add mp1 mirror c10t6d0 c11t6d0
zpool add mp1 spare c9t7d0

It only takes seconds to create terabytes of storage, wow it truly is a thing of beauty (geek!). Now it’s time to define a few pools and stores in preparation for the creation of the iSCSI targets. I chose to create units of 750G since VMware would not perform well with much more than that. This is somewhat dependant on the size of the VM and type of I/O but generally ESX host will serve a wide mix so try I keep it to a reasonable size or it ends up with SCSI reservation issues (that’s a bad thing chief).

You must also consider I/O block size before creating a ZFS store this is not something that can be changed later so now is the time. It’s done by adding the –b 64K to the ZFS create command. I chose to use 64k for the block size which aligns with VMWare default allocation size thus optimizing performance. The –s option enables a sparse volume feature aka thin provisioning. In this case the space was available but it is my favorite way to allocate storage.

zfs create rp1/iscsi
zfs create -s -b 64K -V 750G rp1/iscsi/lun0
zfs create -s -b 64K -V 750G rp1/iscsi/lun1
zfs create -s -b 64K -V 750G rp1/iscsi/lun2
zfs create -s -b 64K -V 750G rp1/iscsi/lun3
zfs create mp1/iscsi
zfs create -s -b 64K -V 750G mp1/iscsi/lun0

Originally I wanted to build the ESX hosts using a local disk but thanks to some bad IBM x346 engineering I could not use the QLA4050C and an integrated Adaptec controller on the ESX host server hardware. So I decided to give boot from iSCSI a go thus here is the boot LUN definition that I used for it. The original architectural design requires local disk to prevent an ESX host failure in the event of an iSCSI path outtage.

zfs create rp1/iscsi/boot
zfs create -s -V 16G rp1/iscsi/boot/esx1

Now that the ZFS stores are complete we can create the iSCSI targets for the ESX hosts to use. I have named the target alias to reflect something about the storage system which makes it easier to work with. I also created an iSCSI configuration store so we can persist the iSCSI targets on reboots. (This may now be included with Opensolaris Indiana but I have not tested it)

mkdir /etc/iscsi/config
iscsitadm modify admin –base-directory /etc/iscsi/config
iscsitadm create target -u 0 -b /dev/zvol/rdsk/rp1/iscsi/lun0 ss1-zrp1
iscsitadm create target -u 1 -b /dev/zvol/rdsk/rp1/iscsi/lun1 ss1-zrp1
iscsitadm create target -u 2 -b /dev/zvol/rdsk/rp1/iscsi/lun2 ss1-zrp1
iscsitadm create target -u 3 -b /dev/zvol/rdsk/rp1/iscsi/lun3 ss1-zrp1
iscsitadm create target -b /dev/zvol/rdsk/mp1/iscsi/lun0 ss1-zmp1
iscsitadm create target -b /dev/zvol/rdsk/rp1/iscsi/boot/esx1 ss1-esx1-boot

Most blog examples of enabling targets show the ZFS command line method as shareiscsi=on. This works well for a new iqn but if you want to allocate additional LUN under that iqn then you need to use this –b backing store method.

Now that we have some targets you should be able to list them using:

iscsitadm list target

Notice that we only see one iqn for ss1-zrp1, you can use the –v option to show all the LUN’s if required.

Target: ss1-zrp1
iSCSI Name:
Connections: 2
Target: ss1-zmp1
iSCSI Name:
Connections: 2
Target: ss1-esx1-boot
iSCSI Name:
Connections: 2

It would be prudent to create some target initiator entries to allow authorization control of what initiator iqn’s can connect to a particular target.
This is an important step. It will create the ability to use CHAP or at least only allow named iqn’s to connect to that target. iSNS also provides a similar service.

iscsitadm create initiator –iqn esx1.1
iscsitadm create initiator –iqn esx1.2

Now we can assign these initiators to a target and then the target will only accept those initiators. You can also add CHAP authentication as well, but that’s beyond the scope of this blog.

iscsitadm modify target –acl esx1.1 ss1-esx1-boot
iscsitadm modify target –acl esx1.2 ss1-esx1-boot
iscsitadm modify target –acl esx1.1 ss1-zrp1
iscsitadm modify target –acl esx1.2 ss1-zrp1
iscsitadm modify target –acl esx1.1 ss1-zmp1
iscsitadm modify target –acl esx1.2 ss1-zmp1

In order to boot from the target LUN we need to configure the QLA4050C boot feature. You must do this from the ESX host using the ctrl Q sequence during the boot cycle. It is simply a matter of entering the primary boot target IP set the mode to manual and enter the iqn exactly as it was listed from the iscsitadm list targets command. e.g.

Once the iqn is entered the ESX host software can be installed and configured.
Till next time….

Site Contents: © 2008  Mike La Spina

iSCSI Security Basics

With iSCSI’s growing popularity the need for improved iSCSI security understanding is becoming very important. Multiple issues arise when we choose to transport storage over our networks. The fundamental security areas of availability, confidentiality and integrity are all at risk when iSCSI best practices are not implemented. For example a single attachment error can corrupt an iSCSI attached device at the speed of electrons. An entire data volume can be connected to an unauthorized users system. Access can be interrupted by a simple network problem. The possible outage issues are many and far reaching thus we must carefully assess our risk and mitigate them where appropriate.

The most important step to a best practice is good old planning. It never fails to amaze me when I see instances of adhoc implementations of highly sensitive storage system architectures. Without fail the adhoc unplanned method will result in unstable systems. Primarily due to forced system changes that were required to correct a multitude of issues like miss-configuration, poor performance, miss matched firmware etc. Planning is one of the most difficult parts of the process where we almost need to be a fortune teller. You have to ask yourself the proverbial 1000 questions and come up with the right answers. So where do you begin? Well a good start would be to use a risk based approach. How much would it cost if a failure occurred and what data is most critical, how much is the data worth etc. We need to walk through the possible events and determine what make sense to prevent the events and when not to overdo it.

Some of the more obvious ways to prevent security events are to build on fundamental best practices like the following.


  • Build a physically separate network to service iSCSI
  • Zone the services with VLANS
  • Provide power redundancy to all devices (Separate UPS etc.)  
  • Provide a minimum of two data paths to all critical components
  • Disable unused ports
  • Physically secure all the iSCSI network access points
  • Use mutual authentication
  • Where appropriate use TOE (TCP Offloading Engine) hardware but be careful of complexity
  • Encrypt data crossing a WAN      

The physical separation of the iSCSI network domain is important in many ways. We can prevent unpredictable capacity demands and provide much greater security by reducing the unauthorized physical connectivity. Change control is simplified and can be isolated. External network undesirables are completely eliminated. The dedicated physical network segregation practice is probably the single most important step of building a stable secure iSCSI environment.

Just as we segregate zones on traditional FCP based SAN’s the same practice is beneficial in the iSCSI world, we can filter and control OS based access and data flow we can isolate authentication and name services, performance monitoring can be observed in a VLAN context. Priority can be controlled on a per VLAN context.

I can never forget the day one of our circuit breakers popped in the DC, alarms were TAPped out and a feeling of panic stirred. But that’s when planning pays off. The system continued to run on the secondary power line. So by simply powering on a server you can pop a weak breaker and kill the entire system. So when it comes to power supplies, I take them very seriously. Use TRUE power redundancy on all devices that are part of a critical SAN system, iSCSI or not. It will likely save your bacon in a multitude of scenarios.

No matter how hard we try to be perfect, it is in our nature to make mistakes. And when you are staring at a bundle of cables they really do all look the same. Moving the wrong cable or a port failure can completely take out your iSCSI functionality. Worse yet, a switch module dies on you. For the cost of  a an additional port we can avoid all of these ugly scenarios, besides multi pathing gives you more peek bandwidth. And you gain the ability to do firmware updates without a full outage. Data path redundancy is a must for critical systems.

My neighbor had an interesting story the other day. He kept popping breakers when only two car block heaters were in use. (Block Heater? Yes up north when its cold you have to keep your cars plugged in. And no, we don’t live in igloos) So what does this have to do with iSCSI. Well my neighbor has discovered that if you have a power receptacle available other people will use it. They didn’t ask to use it and he did not expect them to, but you get the point. Your IT neighbors will at some point do the same so disable accessible ports by default and enable only what is in supposed to be in use. 

If my neighbor could have locked the receptacle he would not need to worry about disabling it, but security is about layers of prevention. The more you have the less likely that a breach will happen and preventing physical access is usually easy.

The fastest way to corrupt an iSCSI target is to allow unauthenticated access. It’s just to easy to connect to the wrong node. I decided to do a quick test the other day which involved connecting two Windows initiators to a target. On one I created a directory with jpg photos and on the other I connected and browsed to the photo directory. Bam corrupt target! The second node wrote a thumb nail while the first node copied more photos to the target. Its just completely crazy to run without authentication and one way authentication is not going to prevent this in many cases. Thus we should use a mutual authentication and no it does not make it less hackable but it does help prevent the more probable human error event. For example most IT shops will not want to have separate passwords for every iSCSI CHAP Pair, they will likely use one for many nodes to reduce the administrative overhead. Using the same passwords essentially allows the connection of any correctly configured initiator to any correctly configured target. If you couple a single password config with mutual authentication then you must define the return target access on the respective initiator and this reduces accidental connections to the wrong node. A small amount of effort will prevent this human error and it costs nothing to setup. Of course you can have unique passwords on every iSCSI host, it will however require a central password store and far more administration in larger configurations. There are other solutions but those are out of scope for this basic security blog.

If you anticipate heavy traffic or plan to use IPSec you will want to consider TCP Offload Engine Adaptors (TOE). This does not mean that every server should use TOE on the contrary you should evaluate the iSCSI network traffic of the service and the server CPU utilization level on that server. You can anticipate that 1Gb of sustained iSCSI network will consume about 25% of a 3 Ghz Dual CPU server for strait iSCSI work. IPSec will vary depending on the key size and the OS implementation of the IPSec protocol so I can’t even ball park it. An example where it would make a lot of sense to use a TOE is in the case of a VMware ESX host where you want as much CPU as possible available to VM’s. The TOE Adaptor will do all the iSCSI and IPSec work and the server CPU is all left to the VM’s. It does not make sense if the host will be underutilized and a medium amount of network traffic is predictable. You will also want to consider keeping your IPSec implementation as simple as possible. When a problem occurs, things can get very difficult to analyze and correct with IPSec as well setting up IPSec can be painful if things do not go as planned. (Blog on that subject to come later)

This PDF link provides some excelent information on IPSec and iSCSI performance.  

If you have a DR centre or other multisite iSCSI functions across a WAN you will need to consider if data encryption is required. For example if you are using an external network provider then you should definitely be encrypting the data. If securing the data is not critical then consider using compression of some type at the bare minimum. You will promote better availability and as an added bonus the data is will not cross the WAN in the clear for your neighbors to view. If encryption is important then IPSec is a good solid inexpensive way of ensuring confidentiality, integrity and authenticationt. There are many options available to secure the transmission over the WAN. It should be easy to manage and conform to industry standards like 128Bit AES symmetric ciphers which are reasonably fast and extremely secure when managed correctly.

Till Next Time



Site Contents: © 2008  Mike La Spina

« Previous Page