Running ZFS over iSCSI as a VMware vmfs store
The first time I looked at ZFS it totally floored me. This is a file system that has changed storage system rules as we currently know them and continues to do so. It is with no doubt the best architecture to date and now you can use it for your VMware stores.
Previously I had explored using it for a VMware store but ran into many issues which were real show stoppers. Like the VPD page response issue which made VMware see only one usable iSCSI store. But things are soon to be very different when Sun releases the snv_93 or above to all. I am currently using the unreleased snv_93 iscsitgt code and it works with VMware in all the ways you would want. Many thanks to the Sun engineers for adding NAA support on the iSCSI target service. With that being said let me divulge the details and behaviors of the first successful X4500 ZFS iSCSI VMware implementation in the real world.
Lets look at the Architectural view first.
The architecture uses a best practice approach consisting of completely separated physical networks for the iSCSI storage data plane. All components have redundant power and network connectivity. The iSCSI storage backplane is configured with an aggregate and is VLAN’d off from the server management network. Within the physical HP 2900’s an inter-switch ISL connection is defined but is not critical. This allows for more available data paths if additional interfaces were assigned on the ESX host side.
The Opensolaris aggregate and network components are configured as follows:
For those of you using Indiana….By default nwam is enabled on Indiana and this needs to be disabled and the physical network service enabled.
svcadm disable svc:/network/physical:nwam
svcadm enable svc:/network/physical:default
The aggregate is defined using the data link adm utility but first any bindings need to be cleared by unplumbing the interfaces.
e.g. ifconfig e1000g0 unplumb
Once cleared the assignment of the physical devices is possible using the following commands
dladm create-aggr –d e1000g0 –d e1000g1 –P L2,L3 1
dladm create-aggr –d e1000g2 –d e1000g3 –P L2,L3 2
Here we have set the policy allowing layer 2 and 3 and defined two aggregates aggr1 and aggr2. We can now define the VLAN based interface shown here as VLAN 500 instances 1 are 2 respective of the aggr instances. You just need to apply the following formula for defining the VLAN interface.
(Adaptor Name) + vlan * 1000 + (Adaptor Instance)
ifconfig aggr500001 plumb up 10.1.0.1 netmask 255.255.0.0
ifconfig aggr500002 plumb up 10.1.0.2 netmask 255.255.0.0
To persist the network configuration on boot you will need to create hostname files and hosts entries for the services to apply on startup.
echo ss1.iscsi1 > /etc/hostname.aggr500001
echo ss1.iscsi2 > /etc/hostname.aggr500002
Edit /etc/hosts to have the following host entries.
::1 localhost
127.0.0.1 ss1.local localhost loghost
10.0.0.1 ss1 ss1.domain.name
10.1.0.1 ss1.iscsi1
10.1.0.2 ss1.iscsi2
On the HP switches its a simple static trunk definition on port 1 and 2 using the following at the CLI.
trunk 1-2 trk1 trunk
Once all the networking components are up and running and persistent, its time to define the ZFS store and iSCSI targets. I chose to include both mirrored and raidz pools. I needed to find and organize the cxtxdx device names using the cfgadm command or you could issue a format command as well to see the controller, target, disk names if you’re not using an X4500. I placed the raidz devices across controllers to improve I/O and distribute the load. It would not be a prudent to place one array on a single SATA controller. So here is what it ends up looking like from the ZFS command view.
zpool create –f rp1 raidz1 c4t0d0 c4t6d0 c5t4d0 c8t2d0 c9t1d0 c10t1d0
zpool add rp1 raidz1 c4t1d0 c4t7d0 c5t5d0 c8t3d0 c9t2d0 c10t2d0
zpool add rp1 raidz1 c4t2d0 c5t0d0 c5t6d0 c8t4d0 c9t3d0 c10t3d0
zpool add rp1 raidz1 c4t3d0 c5t1d0 c5t7d0 c8t5d0 c9t5d0 c11t0d0
zpool add rp1 raidz1 c4t4d0 c5t2d0 c8t0d0 c8t6d0 c9t6d0 c11t1d0
zpool add rp1 raidz1 c4t5d0 c5t3d0 c8t1d0 c8t7d0 c10t0d0 c11t2d0
zpool add rp1 spare c11t3d0
zpool create –f mp1 mirror c10t4d0 c11t4d0
zpool add mp1 mirror c10t5d0 c11t5d0
zpool add mp1 mirror c10t6d0 c11t6d0
zpool add mp1 spare c9t7d0
It only takes seconds to create terabytes of storage, wow it truly is a thing of beauty (geek!). Now it’s time to define a few pools and stores in preparation for the creation of the iSCSI targets. I chose to create units of 750G since VMware would not perform well with much more than that. This is somewhat dependant on the size of the VM and type of I/O but generally ESX host will serve a wide mix so try I keep it to a reasonable size or it ends up with SCSI reservation issues (that’s a bad thing chief).
You must also consider I/O block size before creating a ZFS store this is not something that can be changed later so now is the time. It’s done by adding the –b 64K to the ZFS create command. I chose to use 64k for the block size which aligns with VMWare default allocation size thus optimizing performance. The –s option enables a sparse volume feature aka thin provisioning. In this case the space was available but it is my favorite way to allocate storage.
zfs create rp1/iscsi
zfs create -s -b 64K -V 750G rp1/iscsi/lun0
zfs create -s -b 64K -V 750G rp1/iscsi/lun1
zfs create -s -b 64K -V 750G rp1/iscsi/lun2
zfs create -s -b 64K -V 750G rp1/iscsi/lun3
zfs create mp1/iscsi
zfs create -s -b 64K -V 750G mp1/iscsi/lun0
Originally I wanted to build the ESX hosts using a local disk but thanks to some bad IBM x346 engineering I could not use the QLA4050C and an integrated Adaptec controller on the ESX host server hardware. So I decided to give boot from iSCSI a go thus here is the boot LUN definition that I used for it. The original architectural design requires local disk to prevent an ESX host failure in the event of an iSCSI path outtage.
zfs create rp1/iscsi/boot
zfs create -s -V 16G rp1/iscsi/boot/esx1
Now that the ZFS stores are complete we can create the iSCSI targets for the ESX hosts to use. I have named the target alias to reflect something about the storage system which makes it easier to work with. I also created an iSCSI configuration store so we can persist the iSCSI targets on reboots. (This may now be included with Opensolaris Indiana but I have not tested it)
mkdir /etc/iscsi/config
iscsitadm modify admin –base-directory /etc/iscsi/config
iscsitadm create target -u 0 -b /dev/zvol/rdsk/rp1/iscsi/lun0 ss1-zrp1
iscsitadm create target -u 1 -b /dev/zvol/rdsk/rp1/iscsi/lun1 ss1-zrp1
iscsitadm create target -u 2 -b /dev/zvol/rdsk/rp1/iscsi/lun2 ss1-zrp1
iscsitadm create target -u 3 -b /dev/zvol/rdsk/rp1/iscsi/lun3 ss1-zrp1
iscsitadm create target -b /dev/zvol/rdsk/mp1/iscsi/lun0 ss1-zmp1
iscsitadm create target -b /dev/zvol/rdsk/rp1/iscsi/boot/esx1 ss1-esx1-boot
Most blog examples of enabling targets show the ZFS command line method as shareiscsi=on. This works well for a new iqn but if you want to allocate additional LUN under that iqn then you need to use this –b backing store method.
Now that we have some targets you should be able to list them using:
iscsitadm list target
Notice that we only see one iqn for ss1-zrp1, you can use the –v option to show all the LUN’s if required.
Target: ss1-zrp1
iSCSI Name: iqn.1986-03.com.sun:02:eb9c3683-9b2d-ccf4-8ae0-85c7432f3ef6.ss1-zrp1
Connections: 2
Target: ss1-zmp1
iSCSI Name: iqn.1986-03.com.sun:02:36fd5688-7521-42bc-b65e-9f777e8bfbe6.ss1-zmp1
Connections: 2
Target: ss1-esx1-boot
iSCSI Name: iqn.1986-03.com.sun:02:d1ecaed7-459a-e4b1-a875-b4d5df72de40.ss1-esx1-boot
Connections: 2
It would be prudent to create some target initiator entries to allow authorization control of what initiator iqn’s can connect to a particular target.
This is an important step. It will create the ability to use CHAP or at least only allow named iqn’s to connect to that target. iSNS also provides a similar service.
iscsitadm create initiator –iqn iqn.2000-04.com.qlogic:qla4050c.esx1.1 esx1.1
iscsitadm create initiator –iqn iqn.2000-04.com.qlogic:qla4050c.esx1.2 esx1.2
Now we can assign these initiators to a target and then the target will only accept those initiators. You can also add CHAP authentication as well, but that’s beyond the scope of this blog.
iscsitadm modify target –acl esx1.1 ss1-esx1-boot
iscsitadm modify target –acl esx1.2 ss1-esx1-boot
iscsitadm modify target –acl esx1.1 ss1-zrp1
iscsitadm modify target –acl esx1.2 ss1-zrp1
iscsitadm modify target –acl esx1.1 ss1-zmp1
iscsitadm modify target –acl esx1.2 ss1-zmp1
In order to boot from the target LUN we need to configure the QLA4050C boot feature. You must do this from the ESX host using the ctrl Q sequence during the boot cycle. It is simply a matter of entering the primary boot target IP set the mode to manual and enter the iqn exactly as it was listed from the iscsitadm list targets command. e.g.
iqn.1986-03.com.sun:02:d1ecaed7-459a-e4b1-a875-b4d5df72de40.ss1-esx1-boot
Once the iqn is entered the ESX host software can be installed and configured.
Till next time….
Tags: aggregate, iscsi, opensolaris, san, Storage, vmfs, VMware, Volumes, zfs
Site Contents: © 2008 Mike La Spina
Mike, this is great, I’ve been using it as a template for about two days now. Have an x2250 and J4200 which I’m using as my iSCSI shelf. Now, in my situation, I’ve got 4 gige nics on the 2250, used two in aggregation – they are nge nics. Turned on jumbo frames, have an ESX server, Dell 2900 with twin 1G e1000 nics, aggregated the vmware way, all going thru at nortel switch, jumbo frames ready. My max speed read is around 72mb/sec and write is 122mb/sec. These seem to be rather hard limits of the 1g nics. What I’m wondering is this: what kind of throughput are you getting off your x4500 – and do the qlogic cards make that much difference? It seems that you are using e1000g cards in that x4500, can they be doing that much better? I see you showing 800mb/sec on the next entry – I’d like to get at least 400… possible?
Hi,
Looks like you are at the wire speed now. You can get better perf with qlogic cards but I would suggest 10Gbe if you really need better performance. Software iSCSI is limited to a maximum of one kernel ip connection session at a time so each ESX host will be bound to the single 1Gbe full duplex wire speed.
Mike, found this a year and a half after you posted it and it’s still helpful.
With Win2k3, I managed to get things booting from the target LUN just using vmware itself… raw disk setup and LSI Logic SAS adapter using the SAS1068 driver from the LSI Logic site. Bonus points: I don’t have to enter the IQN manually. It’s dramatically outperforming VMFS on my setup, which is an x4250 (300GB disks and SSD for ZIL) with an attached J4400 (1TB disks).
[…] than the machines that are on a 400GB VMFS. A good basic tutorial with iSCSI and ZFS is here: Running ZFS Over iSCSI as a vmware VMFS store — but note that I’m using raw LUNs after not being happy with the VMFS performance with […]
Thanks Karl,
I’m working on an update to it using COMSTAR, there’s lot of new features and performance gains when using the kernel based iSCSI target.
Regards,
Mike
Mike,
Thanks for the great thought leadership here, much appreciated. A question I have for you is that in your disk layout strategy, you create your pool back to front instead of side to side. e.g. I would normally do c0-c7 on t0 but you do t0-t3 on c0. Can you elaborate on your thinking there please? Thanks!
Hi Eff,
The thought process was to randomize the I/O across the 3 PCIX bus interfaces which attach the marvel SATA II controllers and avoid pattern I/O hot spots.
Since we cannot use cxt0d0 and cxt1d0 where cx is the enumerated boot disk controller and this will vary and could be enumerated as C4 or C5 etc. Thus I focused on the target instead of the controller. The same would work with using the controller as well however t0 and t1 would need to be shifted out at some point in the array map.
Regards,
Mike
Mike thank you! What strategy would you advise if I wanted to consume all drives with 2 hot spares and RaidZ2 as a requirement? If I enumerate from 1-4 (4×11’s +2 boot + 2 spare) using your method, I end up with even 1-4 almost across the entire system. Surely there must be a better strategy? The system in question is primarily for back and secondarily for VMWare iSCSI.
Thank you for the step by step instructions. It’s been very helpful, especially for a newbie in the unix world. I was hoping you can help me with an issue I’m having.
I was able to create targets and 2 tpgts, however, I’m only able to connect to tpgt 1 on my esxi4 host. Is this a limitation?
# iscsitadm modify target -p 1 zfs201-array300
# iscsitadm modify target -p 2 zfs201-array146
# iscsitadm modify target -p 2 zfs201-internal
# iscsitadm list target
Target: zfs201-array300
iSCSI Name: iqn.1986-03.com.sun:02:xadd460d-246b-c442-8fc1-bf9bd582c08f.zfs201-array300
Connections: 5
Target: zfs201-array146
iSCSI Name: iqn.1986-03.com.sun:02:xcf61291-876a-c969-b85f-b6916b879830.zfs201-array146
Connections: 0
Target: zfs201-internal
iSCSI Name: iqn.1986-03.com.sun:02:x8f889af-b2b3-c0a7-bd33-bf64b20a3892.zfs201-internal
Connections: 0
# iscsitadm list tpgt -v
TPGT: 1
IP Address: 192.168.10.39
TPGT: 2
IP Address: 192.168.20.39
Anthony,
There are a variety of possibilities. You have not indicated what versions are running or how the networking components are defined so it would be difficult to see what your issue could be. Regardless of that there is no limiting reason that you would not be able to at least connect to the second portal. Keep in mind that you will only have one active connection from a VMware ESX software iSCSI stack.
What are you trying to achieve using two separate portals?
Mike,
Thanks for the response.
My intention was to separate the DB2 storage and the log directory to two separate portals for better performance. Is this statement correct? “I can not have both portals actively connected on a single vmware host iSCSI stack”.
On a different topic, have you ever created a RDM of local storage and attach it to a solaris vm for zfs use? Any suggestions?
Anthony,
You can actively connect both portals over a vmnic team. However only one path will serve as a active connection to any given VM.
You would need to to define two vmkernels on the switch and manually override the vmnic assignment to allow only one vmnic per vmkernel interface. The interfaces must be on separate subnets.
I have defined shared SAN RDMs to Opensolaris VMs, but not with local disks. I see no real benefit to this practice and no longer use that method. vmdks on VMFS are more flexible and are within 97-99% of the performance capability of the raw device.
Regards,
Mike
[…] http://constantin.glez.de/blog/2010/06/closer-look-zfs-vdevs-and-performance http://blog.laspina.ca/ubiquitous/running-zfs-over-iscsi-as-a-vmware-vmfs-store (merci Tof) http://www.seanodes.com/ (que nous distribuons) http://www.nexentastor.org/ […]
Mike,
You referenced a 64k allocation unit size for VMFS3 in the article? Has this number changed with vSphere 5 and the now universal block size of 1MB?
Thank you,
Cooper
Hey Cooper,
VMFS 5 will write out in 1MB blocks, it also supports 8K sub block allocation. 64K block size still holds true for better performance in the majority of operations.
[…] Running ZFS over iSCSI as a VMware vmfs store […]