Ubiquitous Talk

Encapsulating VT-d Accelerated ZFS Storage within ESXi

Some time ago I found myself conceptually provisioning ESXi hosts that could transition local storage in a distributed manner within an array of hypervisors. The architectural model likens itself to an amorphous cluster of servers which share a common VM client service that self provisions shared storage to it’s parent hypervisor or even other external hypervisiors. This concept originally became a reality in one of my earlier blog entries named Provisioning Disaster Recovery with ZFS, iSCSI and VMware. With this previous success of a DR scope we can now explore more adventurous applications of storage encapsulation and further coin the phrase of “rampent layering violations of storage provisioning” thanks to Jeff Bonwick, Jim Moore and many other brilliant creative minds behind the ZFS storage technology advancements. One of the main barriers of success for this concept was the serious issue of circular latency from within the self provisioning storage VM. What this commonly means is we have a long wait cycle for the storage VM to ready the requested storage since it must wait for the hypervisior to schedule access to the raw storage blocks for the virtualized shared target which then will re-provision it to other VM’s. This issue is acceptable for a DR application but it’s a major show stopper for applications that require normal performance levels.

This major issue now has a solution with the introduction of Intel’s VT-d technology. VT-d allows us to accelerate storage I/O functionality directly inside a VM served by a VMware based ESX and ESXi hypervisors. VMware has leveraged Intel’s VT-d technology on ESXi 4.x (AMD I/O Virtualization Technology (IOMMU) is also supported) as part of the named feature VMDirectPath. This feature now allows us to insert high speed devices inside a VM which can now host a device that operates at the hardware speed of the PCI Bus and that my friend allows virtualized ZFS storage provisioning VMs to dramatically reduce or eliminate the hypervisor’s circular latency issue.

Very exciting indeed, so lets leverage a visual diagram of this amorphous server cluster concept to better capture what this envisioning actually entails.

Encapsulated Accelerated ZFS Architecture

The concept depicted here sets a multipoint NFS share strategy. Each ESXi host provisions it’s own NFS share from it’s local storage which can be accessed by any of the other hosts including itself. Additionally each encapsulated storage VM incorporates ZFS replication to a neighboring storage VM in a ring pattern thus allowing for crash based recovery in the event of a host failure. Each ESXi instance hosts a DDRdrive X1 PCIe Card which is presented to it’s storage VM over VT-d and VMDirectPath aka. PCI Pass Through. When managed via vCenter this solution allows us to svMotion VM’s across the cluster allowing rolling upgrades or hardware servicing.

The ZFS replication cycle works as a background ZFS send receive script process that incrementally updates the target storage VM. One very useful feature of ZFS send receive capability is the include ZFS properties flag -p. When this flag is used any NFS share properties that are defined using “sharenfs= ” will be sent the the target host. Thus the only required action to enable access to the replicated NFS share is to add it as an NFS storage target on our ESXi host. Of course we would also need to stop replication if we wish to use the backup or clone it to a new share for testing. Testing the backup without cloning will result in a modified ZFS target file system and this could force a complete ZFS resend of the file system in some cases.

Within this architecture our storage VM is built with OpenSolaris snv_134 thus we have the ability to engage in ZFS deduplication. This not only improves the storage capacity it also grants improved performance when we allocate sufficient memory to the storage VM. ZFS Arc caching needs only to cache these dedup block hits once which accelerates all depup access requests. For example if this cluster served a Virtual Desktop Environment (VDI) we would see all the OS file allocation blocks enter into the ZFS Arc cache and thus all VMs that reference the same OS file blocks would be cache accelerated. Dedup also grants a benefit with ZFS replication with the use of the ZFS send -D flag. This flag instructs ZFS send to the stream in dedup format and this dramatically reduces replication bandwidth and time consumption in a VMware environment.

With VT-d we now have the ability to add a non-volatile disk device as a dedicated ZIL accelerator commonly called a SLOG or Separate Intent Log. In this proof of concept architecture I have defined the DDRdrive X1 as a SLOG disk over VMware VMDirectPath to our storage VM. This was a challenge to accomplish as VT-d is just emerging and has many unknown behaviors with system PCI BUS timing and IRQ handling. Coaxing VT-d to work correctly proved to be the most technically difficult component of this proof of concept, however success is at hand using a reasonably cost effective ASUS motherboard in my home lab environment.

Let’s begin with the configuration of VT-d and VMware VMDirectPath.

VT-d requires system BIOS support and this function is available on the ASUS P6X58D series of motherboards. The feature is not enabled by default you must change it in BIOS. I have found that enabling VT-d does impact how ESXi behaves, for example some local storage devices that were available prior to enabling VT-d may not be accessible after enabling it and could result in messages like “cannot retrieve extended partition information”.

The following screen shots demonstrate where you would find the VT-d BIOS setting on the P6X58D mobo.

VT-d-BIOS-Enable1

If your using an AMD 890FX based ASUS Crosshair IV mobo then look for the IOMMU setting as depicted here:

Thanks go to Stu Radnidge over at http://vinternals.com/ for the screen shot!

IOMMU on AMD 890FX Based Mobos

Once VT-d or IOMMU is enabled ESXi VMDirectPath can be enabled from the VMware vSphere client host configuration-> advanced menu and will require a reboot to complete any further PCI sharing configurations.

One challenge I encountered was PCIe BUS timing issues, fortunately the ASUS P6X58D overclocking capability grants us the ability to align our clock timing on the PCIe BUS by tuning the frequency and voltage and thus I was able to stabilize the PCIe interface running on the DDRdrive X1. Here are original values I used that worked. Since that time I have pushed the i7 CPU to 4.0Ghz, but that can be risky since you need to up the CPU and DRAM voltages so I will leave the safe values for public consumption.

P6X58D-Overclock-Tuning1

P6X58D-Overclock-Tuning2

ESXi-Console_Shot

Once VT-d is active you will be able to edit the enumerated PCI device list check boxes and allow pass through for the device of your choice. There are three important PCI values to note. The device ID, Vendor ID and the Class ID of which you can Google it or take this short cut http://www.pcidatabase.com/ and discover who owns the device and what class it belongs to. In this case I needed to ID the DDRdrive X1 and I know by the class ID 0100 that it is a SCSI device.

VMDirectPath Enabled

Once our DDRdrive X1 device is added to the encapsulated OpenSolaris VM it’s shared IRQ mode will need to be adjusted such that no other IRQ’s are chained to it. This is adjusted by adding a custom VM config parameter named pciPassthru0.msiEnabled and setting its value to false.

VMPassThru-msiEnabled=false

In this proof of concept the storage VM is assigned 4Gb of memory which is reasonable for non-deduped storage. If you plan to dedup the storage I would suggest significantly more memory to allow the block hash table to be held in memory, this is important for performance and is also needed if you have to delete a ZFS file system. The amount will vary depending on the total storage provisioned. I would rough estimate about 8GB of memory for each 1TB of used storage. As well we have two network interfaces of which one will provision the storage traffic only. Keep in mind that dedup is still developing and should be heavily tested, you should expect some issues.

. VM Settings

If you have read my previous blog entry Running ZFS Over NFS as a VMware Store you will find the next section to be very similar. This is essentially many of the same steps but excludes aggregation and IPMP capability.

Using a basic OpenSolaris Indiana completed install we can proceed to configure a shared NFS store so let’s begin with the IP interface. We don’t need a complex network configuration for this storage VM and therefore we will just setup simple static IP interfaces, one to manage the OpenSolaris storage VM and one to provision the NFS store. Remember that you should normally separate storage networks from other network types from both a management and security perspective.

OpenSolaris will default to a dynamic network service configuration named nwam, this needs to be disabled and the physical:default service enabled.

root@uss1:~# svcadm disable svc:/network/physical:nwam
root@uss1:~# svcadm enable svc:/network/physical:default

To persistently configure the interfaces we can store the IP address in the local hosts file. The file will be referenced by the physical:default service to define the network IP address of the interfaces when the service starts up.

Edit /etc/hosts to have the following host entries.

::1 localhost
127.0.0.1 uss1.local localhost loghost
10.0.0.1 uss1 uss1.domain.name
10.1.0.1 uss1.esan.data1

As an option if you don’t normally use vi you can install nano.

root@uss1:~# pkg install SUNWgnu-nano

When an OpenSolaris host starts up the physical:default service will reference the /etc directory and match any plumbed network device to a file which contains the interface name a prefix of “hostname” and an extension using the interface name. For example in this VM we have defined two Intel e1000 interfaces which will be plumbed using the following commands.

root@uss1:~# ifconfig e1000g0 plumb
root@uss1:~# ifconfig e1000g1 plumb

Once plumbed these network devices will be enumerated by the physical:default service and if a file exists in the /etc directory named hostname.e1000g0 the service will use the content of this file to configure this interface in the format that ifconfig uses. Here we have created the file using echo, the “uss1.esan.data1” name will be looked up in the hosts file and maps to IP 10.1.0.1, the network mask and broadcast will be assigned as specified.

root@uss1:~# echo uss1.esan.data1 netmask 255.255.0.0 broadcast 10.1.255.255 > /etc/hostname.e1000g0

One important note: if your /etc/hostname.e1000g0 file has blank lines you may find that persistence fails on any interface after the blank line, thus no blank in the file sanity check would be advised.

One important requirement is the default gateway or route. Here we will assign a default route to network 10.0.0.1 which is the management network. also we need to add a route for network 10.1.0.0. using the following commands. Normally the routing function will dynamically assign the route for 10.1.0.0 so assigning a static one will ensure that no undesired discovered gateways are found and used which may cause poor performance.

root@uss1:~# route -p add default 10.0.0.254
root@uss1:~# route -p add 10.1.0.0 10.1.0.1

When using NFS I prefer provisioning name resolution as a additional layer of access control. If we use names to define NFS shares and clients we can externally validate the incoming IP with a static file or DNS based name lookup. An OpenSolaris NFS implementation inherently grants this methodology. When a client IP requests access to an NFS share we can define a forward lookup to ensure the IP maps to a name which is granted access to the targeted share. We can simply define the desired FQDNs against the NFS shares.

In small configurations static files are acceptable as is in the case here. For large host farms the use of a DNS service instance would ease the admin cycle. You would just have to be careful that your cached TimeToLive (TTL) value is greater that 2 hours thus preventing excessive name resolution traffic. The TTL value will control how long the name is cached and this prevents constant external DNS lookups.

To configure name resolution for both file and DNS we simply copy the predefined config file named nsswitch.dns to the active config file nsswitch.conf as follows:

root@uss1:~# cp /etc/nsswitch.dns /etc/nsswitch.conf

Enabling DNS will require the configuration of our /etc/resolv.conf file which defines our name servers and namespace.

e.g.

root@ss1:~# cat /etc/resolv.conf
domain laspina.ca
nameserver 10.1.0.200
nameserver 10.1.0.201

You can also use the static /etc/hosts file to define any resolvable name to IP mapping, which is my preferred method but since were are using ESXi I will use DNS to ease the administration cycle and avoid the unsupported console hack of ESXi.

It is now necessary to define a zpool using our VT-d enabled PCI DDRdrive X1 and VMDK. The VMDK can be located on any suitable VT-d compatible adapter. There is a good change that some HBA devices will not work with VT-d correctly with your system BIOS. As a tip I suggest you use a USB disk to provision the ESXi installation as it almost always works and is easy to backup and transfer to other hardware. In this POC I used a 500GB SATA disk attached over an ICH10 AHCI interface. Obviously there are other better performing disk subsystems available, however this is a POC and not for production consumption.

To establish the zpool we need to ID the PCI to CxTxDx device mappings, there are two ways that I am aware to find these names. You can ream the output of the prtconf -v command and look for disk instances and dev_links or do it the easy way and use the format command like the following.

root@uss1:~# format
Searching for disks…done

AVAILABLE DISK SELECTIONS:
0. c8t0d0 <DEFAULT cyl 4093 alt 2 hd 128 sec 32>
/pci@0,0/pci15ad,1976@10/sd@0,0
1. c8t1d0 <VMware-Virtual disk-1.0-256.00GB>
/pci@0,0/pci15ad,1976@10/sd@1,0
2. c11t0d0 <DDRDRIVE-X1-0030-3.87GB>
/pci@0,0/pci15ad,7a0@15/pci19e3,8@0/sd@0,0
Specify disk (enter its number): ^C
root@uss1:~#

With the device link info handy we can define the zpool with the DDRdrive X1 as a ZIL using the following command:

root@uss1:~# zpool create sp1 c8t1d0 log c11t0d0

root@uss1:~# zpool status
pool: rpool
state: ONLINE
scrub: none requested

config:
NAME        STATE     READ WRITE CKSUM
rpool       ONLINE       0     0     0
c8t0d0s0    ONLINE       0     0     0

errors: No known data errors

pool: sp1
state: ONLINE
scrub: none requested

config:
NAME        STATE     READ WRITE CKSUM
sp1         ONLINE       0     0     0
c8t1d0      ONLINE       0     0     0
logs
c11t0d0     ONLINE       0     0     0
errors: No known data errors

With a functional IP interface and ZFS pool complete you can define the NFS share and ZFS file system. Always define NFS properties using ZFS set sharenfs=, the share parameters will store as part of the ZFS file system which is ideal for a system failure recovery or ZFS relocation.

zfs create -p sp1/nas/vol0
zfs set mountpoint=/export/uss1-nas-vol0 sp1/nas/vol0
zfs set sharenfs=rw,nosuid,root=vh3-nas:vh2-nas:vh1-nas:vh0-nas sp1/nas/vol0

To connect a VMware ESXi host to this NFS store(s) we need to define a vmkernel network interface which I like to name eSAN-Interface1. This interface should only connect to the storage network vSwitch. The management network and VM network should be on another separate vSwitch.

vmkernel eSAN-Interface1

Since we are encapsulating the storage VM on the same server we also need to connect the VM to the storage interface over a VM network port group as show above. At this point we have all the base NFS services ready, we can now connect our ESXi host to the newly defined NAS storage target.

Add NFS Store

Thus we now have an Encapsulated NFS storage VM provisioning an NFS share to it’s parent hypervisor.

Encapsulated NFS Share

You may have noticed that the capacity of this share is ~390GB however we only granted a 256GB vmdk to this storage VM. The capacity anomaly is the result of ZFS deduplication on the shared file system. There are 10 16GB Windows XP hosts and 2 32GB Linux host located on this file system which would normally require 224GB of storage. Obviously dedup is a serious benefit in this case however you need to be aware of the costs, in order to sustain performance levels similar to non-deduped storage you MUST grant the ZFS code sufficient memory to hold the block hash table in memory. If this is memory not provisioned in sufficient amounts, your storage VM will be relegated to a what appears to be a permanent storage bottle neck, in other words you will enter a “processing time vortex”. (Thus as I have cautioned in the past ZFS dedup is maturing and needs some code changes before trusting it to mission critical loads, always test, test, test and repeat until you’re head spins)

Here’ s the result of using dedup within the encapsulated storage VM.

root@uss1:~# zpool list
NAME    SIZE ALLOC   FREE    CAP DEDUP HEALTH ALTROOT
rpool 7.94G 3.64G 4.30G    45% 1.00x ONLINE –
sp1     254G 24.9G   229G     9% 6.97x ONLINE –

And here’s a look at what’s it’s serving.

Encapsulated VM

Incredibly the IO performance is simply jaw dropping fast, here we are observing a grueling 100% random read load at 512 bytes per request. Yes that’s correct we are reaching 40,420 IOs per second.

Sample IOMeter IOPS

Even more incredible is the IO performance with a 100% random write load at 512 bytes per request. it’s simply unbelievable seeing 38491 IOs per second inside a VM which is served from a peer VM all on the same hypervisor.

Sample IOMeter IOPS 100% Random 512 Byte Writes

With a successfully configured and operational NFS share provisioned the next logical task is to define and automate the replication of this share and any others shares we may we to add to a neighboring encapsulated storage VM or for that matter any OpenSolaris host.

The basic elements to this functionality as follows:

Define a dedicated secured user to execute the replication functions.
Grant the appropriate permissions to this user to access a cron and ZFS.
Assign an RSA Key pair for automated ssh authentication.
Define a snapshot replication script using ZFS send/receive calls.
Define a cron job to regularly invoke the script.

Let define the dedicated replication user. In this example I will use the name zfsadm.

First we need to create the zfsadm user on all of our storage VMs.

root@uss1:~# useradd -s /bin/bash -d /export/home/zfsadm -P ‘ZFS File System Management’ zfsadm
root@uss1:~# mkdir /export/home/zfsadm
root@uss1:~# cp /etc/skel/* /export/home/zfsadm
root@uss1:~# echo PATH=/bin:/sbin:/usr/ucb:/etc:. > /export/home/zfsadm/.profile
root@uss1:~# echo export PATH >> /export/home/zfsadm/.profile
root@uss1:~# echo PS1=$’${LOGNAME}@$(/usr/bin/hostname)’~#’ ‘ >> /export/home/zfsadm/.profile

root@uss1:~# chown –R zfsadm /export/home/zfsadm
root@uss1:~# passwd zfsadm

In order to use an RSA key for authentication we must first generate an RSA private/public key pair on the storage head. This is performed using ssh-keygen while logged in as the zfsadm user. You must set the passphrase as blank otherwise the session will prompt for it.

root@uss1:~# su – zfsadm

zfsadm@uss1~#ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/export/home/zfsadm/.ssh/id_rsa):
Created directory ‘/export/home/zfsadm/.ssh’.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /export/home/zfsadm/.ssh/id_rsa.
Your public key has been saved in /export/home/zfsadm/.ssh/id_rsa.pub.
The key fingerprint is:
0c:82:88:fa:46:c7:a2:6c:e2:28:5e:13:0f:a2:38:7f zfsadm@uss1
zfsadm@uss1~#

The id_rsa file should not be exposed outside of this directory as it contains the private key of the pair, only the public key file id_rsa.pub needs to be exported. Now that our key pair is generated we need to append the public portion of the key pair to a file named authorized_keys2.

# cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys2

Repeat all the crypto key steps on the target VM as well.

We will use the Secure Copy command to place the public key file on the target hosts zfsadm users home directory. It’s very important that the private key is secured properly and it is not necessary to back it up as you can regenerate them if required.

From the local server here named uss1 (The remote server is uss2)

zfsadm@uss1~# scp $HOME/.ssh/id_rsa.pub uss2:$HOME/.ssh/uss1.pub
Password:
id_rsa.pub 100% |**********************************************| 603 00:00
zfsadm@uss1~# scp uss2:$HOME/.ssh/id_rsa.pub $HOME/.ssh/uss2.pub
Password:
id_rsa.pub 100% |**********************************************| 603 00:00
zfsadm@uss1~# cat $HOME/.ssh/uss2.pub >> $HOME/.ssh/authorized_keys2

And on the remote server uss2

# ssh uss2
password:
zfsadm@uss2~# cat $HOME/.ssh/uss1.pub >> $HOME/.ssh/authorized_keys2
# exit

Now that we are able to authenticate without a password prompt we need to define the automated replication launch using cron. Rather that using the /etc/cron.allow file to grant permissions to the zfsadm user we are going to use a finer instrument and grant the user access at the user properties level shown here. Keep in mind you can not use both ways simultaneously.

root@uss1~# usermod -A solaris.jobs.user zfsadm
root@uss1~# crontab –e zfsadm
59 23 * * * ./zfs-daily-rpl.sh zfs-daily.rpl

Hint: crontab uses vi – http://www.kcomputing.com/kcvi.pdf “vi cheat sheet”

The key sequence would be hit “i” and key in the line then hit “esc :wq” and to abort “esc :q!”

Be aware of the timezone the cron service runs under, you should check it and adjust it if required. Here is a example of whats required to set it.

root@uss1~# pargs -e `pgrep -f /usr/sbin/cron`

8550: /usr/sbin/cron
envp[0]: LOGNAME=root
envp[1]: _=/usr/sbin/cron
envp[2]: LANG=en_US.UTF-8
envp[3]: PATH=/usr/sbin:/usr/bin
envp[4]: PWD=/root
envp[5]: SMF_FMRI=svc:/system/cron:default
envp[6]: SMF_METHOD=start
envp[7]: SMF_RESTARTER=svc:/system/svc/restarter:default
envp[8]: SMF_ZONENAME=global
envp[9]: TZ=PST8PDT

Let’s change it to CST6CDT

root@uss1~# svccfg -s system/cron:default setenv TZ CST6DST

Also the default environment path for cron may cause some script “command not found” issues, check for a path and adjust it if required.

root@uss1~# cat /etc/default/cron
#
# Copyright 1991 Sun Microsystems, Inc. All rights reserved.
# Use is subject to license terms.
#
#pragma ident “%Z%%M% %I% %E% SMI”
CRONLOG=YES

This one has no default path, add the path using echo.

root@uss1~# echo PATH=/usr/bin:/usr/sbin:/usr/ucb:/etc:. > /etc/default/cron
# svcadm refresh cron
# svcadm restart cron

The final part of the replication process is a script that will handle the ZFS send/recv invocations. I have written a script in the past that can serve this task with some very minor changes.

Here is the link for the modified zfs-daily-rpl.sh replication script you will need to grant exec rights to this file e.g.

# chmod 755 zfs-daily-rpl.sh

This script will require that a zpool named sp2 exists on the target system, this is shamefully hard coded in the script.

A file containing the file system to replicate and the target are required as well.

e.g.

zfs-daily-rpl.sh filesystems.lst

Where filesystems.lst contains:

sp1/nas/vol0 uss2
sp1/nas/vol1 uss2

With any ZFS replicated file system that you wish to invoke on a remote host it is important to remember not make changes to the active replication stream. You must take a clone of this replication stream and this will avoid forcing a complete resend or other replication issues when you wish to test or validate that it’s operating as you expect.

For example:

We take a clone of one of the snapshots and then share it via NFS:

root@uss2~# zfs clone sp2/nas/vol0@31-04-10-23:59 sp2/clones/uss1/nas/vol0
root@uss2~# zfs set mountpoint=/export/uss1-nas-vol0 sp2/clones/uss1/nas/vol0
root@uss2~# zfs set sharenfs=rw,nosuid,root=vh3-nas:vh2-nas:vh1-nas:vh0-nas sp2/clones/uss1/nas/vol0

Well I hope you found this entry interesting.

Regards,

Mike

May 6th, 2010 | Tags: Acceleration, Dedup, Encapsulation, IOMMU, NFS, Storage, VMware, VT-d, zfs.
Categories: Storage, VMware | Comments: 42 Comments |

OpenSolaris Storage Summit 2009

The OpenSolaris Storage Summit was really cool to attend this year. Mike Shapiro presented an interesting view of what is transpiring in storage hardware and where storage vendors need to focus on in order to be successful in the next few years. As always his presentation is a pleasure to follow. He talked about the s7000 series development and where it fits in terms of the current commodity hardware advancements. It was exciting to hear that we will see COMSTAR integration in the next firmware release coming in the second quarter on 2009. With the inclusion of COMSTAR we will have a very comprehensive storage provisioning solution that is fully supported by SUN.

I also had the pleasure of hearing Don MackAskill speak on his experiences with OpenSolaris and the voyage that brought him to success on the s7000 product. His content was brilliant as usual and hopefully he will share more on SmugMug’s Blog site.

I presented in the afternoon and talked about using COMSTAR to re-provision existing storage systems in an effort to enhance the performance and capacity of these aging products and retain their value. As well to bring some desired features to them like compression, snapshots and replication without the having the high cost licenses the on the native systems. I also created a couple of video frame stop demos. The first one demonstrated the ability to attach existing storage systems with Fiber Channel and reprovision LU’s which can then be transitioned from one storage head to another without impacting a storage consumer connection. In this case the consumer was VMware and was attached over both Fiber Channel and iSCSI in a multi path multi protocol configuration. The second demo revealed the cool world of encapsulation by virtualizing an OpenSolaris storage server within a VMware and then replicating ZFS from an X4500 to the virtulized OpenSolaris VM. Once in a virtual state we exposed replicated iSCSI targets to the underlying VMware ESXi server and attached to theVMFS volumes presented on the LU’s.

Ben Rockwood also presented in the afternoon and it was a great pleasure to see. He discussed his knowledge on ZFS. Specifically some of the things he has discovered as best practices and the use of tools. It was very informative and I wish he had much more time because the content was exceptional. All of the presenters both mentioned and not were really great I would like to thank all of them for giving us their valuable time in the OpenSolaris community efforts.

If your interested in the content please visit OpenSolaris Storage Summit

Regards,

Mike

February 25th, 2009 | Tags: opensolaris, presentations, Storage, summit, sun.
Categories: General, Storage | Comments: No Comments |

Multi Protocol Storage Provisioning with COMSTAR

COMSTAR is a new breed of open source storage product available to the world. What was traditionally a closed and proprietary storage capability is now available to our open source communities. With OpenSolaris and COMSTAR the ability to freely provision virtual storage services over very mature high end protocols on standard commodity server hardware is now a reality. High performance transports are integral within the feature sets of COMSTAR and Sun’s open source portfolio of projects. The COMSTAR product is revolutionary in its method of provisioning storage virtualization and transport services to storage resource consumers.

COMSTAR provisions virtualized SCSI block storage over multiple SCSI transport protocols. While this function class is not new to us the ease of implementation using COMSTAR certainly is. All the complexities of using a multi protocol target services platform are cleaned up. It is simple to use and facilitates advanced high performance storage provisioning at the block level.

The services within this product have multiple common storage provisioning applications. One very interesting application is a storage gateway server and this blog demonstrates howto build a Fiber Channel (FC) storage gateway using the COMSTAR service layers and as well provision some additional features using the target services.

COMSTAR FC Gateway Architecture by Mike La Spina

In this example instance we are re-provisioning an existing storage system with an OpenSolaris COMSTAR configuration running on a commodity white box which functions as a storage server head that can compress, scrub, thin provision, replicate, snapshot and clone the existing block storage attachments. The example FC based storage could also be comprised of a JBOD FC array directly attached to the OpenSolaris storage head if we so desired or many other commonly available SCSI attachment methods. The objective here is to extend and enhance any block storage system with high performance transports and virtualization features. Of course we could also formalize the white box to an industrial strength host once we are satisfied that the proof of concept is mature and optimal.

The reality is that many older existing FC storage systems are installed without these features primarily due to the excessive licensing costs of them. And even when these features are available, its use is probably restricted to like proprietary systems thus obsolescing the entire lot of any useful future functionality. But what if you could re-purpose an older storage system to act as a DR store or backup cache system or maybe a test and development environment. With today’s economy this is from a cost perspective, very attractive and can be accomplished with very little risk on the investment side.

One of the possible applications for this flexible storage service is the re-provisioning of existing LUN’s from an existing system to newer more flexible SCSI transport protocols. This is particularly useful when we need to re-target the existing storage system from FC to iSCSI or the likes of. We can begin by exploring this functionality and explain how COMSTAR can provide us with this service.

First we need to understand the high level functionality of the COMSTAR service layers. Virtual LUN’s on COMSTAR are provisioned with a service layer named the LU provider. This layer maps backing stores of various types to a storage GUID assignment and additionally defines other properties like the LUN ID and size dimensions. This layer allows us to carve out the available block storage devices that are accessible on our OpenSolaris storage host. For example if we attached an FC Initiator to an external storage system we can then map the accessible SCSI block devices to the LU provider layer and then present this virtualized LUN to the other COMSTAR service layers for further processing.

Once we have defined the LU’s we can present this storage resource to the SCSI Target Mode Framework Service (STMF) layer which acts as the storage gate keeper. At this layer we define which clients (initiators) can connect to the LU’s based on Membership of Target Groups and Host Groups that are assigned logical views of the LU(s). The STMF layer routes the defined LU(s) as SCSI targets over a multiprotocol interface connection pool to a Port Provider. Port Providers are the protocol connection service instances which can be the likes of FC, iSCSI, SAS, iSER, FCoE and so on.

With these COMSTAR basics in mind let us begin by diving into some of the details of how this can be applied.

Sun has detailed howto setup COMSTAR at dlc.sun.com so no need to re-invent the wheel here.

Just as a note SXCE snv_103 and up integrate the COMSTAR FC and iSCSI port provider code. With the COMSTAR software components and FC target setup we can demonstrate the re-provisioning of an existing FC based storage server. Since I don’t have the luxury of having a proprietary storage server at home I will emulate this storage using an additional COMSTAR white box to act as the FC storage target to be re-provisioned.

On the existing FC target system we need to create Raid0 arrays of three disks each which will total up to a set of six trios. We will use these six non-fault tolerant disk groups as vdevs for a ZFS raidz2 group. This will allow us to create fault tolerant arrays from the existing storage server. The reasons for sets of three Raid0 groupings are to reduce the possibility of reaching the LUN maximums of the proprietary storage system and also we do not want to erode the performance by layering Raid 5 groups. As well we can tolerate a disk failure in two of the trios since we have Raidz2 across the Raid0 trio groups. Additionally using these Raid0 disk groups actually lowers the array failure probability rate. For example if a second disk were to failure in a single Raid0 set there would be no additional impact to other trios, thus reducing the overall failure rate.

To create the emulated FC storage system I have defined the following 16G ZFS sparse volumes respectively named trio1 through trio6 each as a representation of the 3 disk Raid0 spanned LUN on a source storage host named ss1.

root@ss1:~# zfs create sp1/gw
root@ss1:~# zfs create -s -V 16G sp1/gw/trio1
root@ss1:~# zfs create -s -V 16G sp1/gw/trio2
root@ss1:~# zfs create -s -V 16G sp1/gw/trio3
root@ss1:~# zfs create -s -V 16G sp1/gw/trio4
root@ss1:~# zfs create -s -V 16G sp1/gw/trio5
root@ss1:~# zfs create -s -V 16G sp1/gw/trio6

Once these mockup volumes are created they are then defined as backing stores using the sbdadm utility as follows.

root@ss1:~# sbdadm create-lu /dev/zvol/rdsk/sp1/gw/trio1

Created the following LU:

GUID DATA SIZE SOURCE
——————————– ——————- —————-
600144f01eb3862c0000494b55cd0001 17179803648 /dev/zvol/rdsk/sp1/gw/trio1

All the backing stores were added to the LU provider service layer to which in turn were assigned to the STMF service layer. Here we can see the automatically generated GUID’s that are assigned to the ZFS backing stores.

root@ss1:~# sbdadm list-lu

Found 6 LU(s)

              GUID                    DATA SIZE           SOURCE
——————————– ——————- —————-
600144f01eb3862c0000494b56000006      17179803648      /dev/zvol/rdsk/sp1/gw/trio6
600144f01eb3862c0000494b55fd0005      17179803648      /dev/zvol/rdsk/sp1/gw/trio5
600144f01eb3862c0000494b55fa0004      17179803648      /dev/zvol/rdsk/sp1/gw/trio4
600144f01eb3862c0000494b55f80003      17179803648      /dev/zvol/rdsk/sp1/gw/trio3
600144f01eb3862c0000494b55f50002      17179803648      /dev/zvol/rdsk/sp1/gw/trio2
600144f01eb3862c0000494b55cd0001      17179803648      /dev/zvol/rdsk/sp1/gw/trio1

A host group was defined named GW1 and respectively these LU GUID’s were added to the GW1 host group as LU views assigning LUN 0 to 5.

Just as a note the group names are case sensitive.
root@ss1:~#stmfadm create-hg GW1

Here we assigned the GUID’s a LUN value on the GW1 host group with the -n parm.

root@ss1:~# stmfadm add-view -h GW1 -n 0 600144F01EB3862C0000494B55CD0001
root@ss1:~# stmfadm add-view -h GW1 -n 1 600144F01EB3862C0000494B55F50002
root@ss1:~# stmfadm add-view -h GW1 -n 2 600144F01EB3862C0000494B55F80003
root@ss1:~# stmfadm add-view -h GW1 -n 3 600144F01EB3862C0000494B55FA0004
root@ss1:~# stmfadm add-view -h GW1 -n 4 600144F01EB3862C0000494B55FD0005
root@ss1:~# stmfadm add-view -h GW1 -n 5 600144F01EB3862C0000494B56000006

With the LU’s now available in a host group view we can add the COMSTAR re-provisioning gateway server FC wwn’s to this host group and it will become available as a storage resource on the re-provisioning gateway server named ss2. We need to obtain the wwn from the gateway server using the fcinfo hba-port command.
root@ss2:~# fcinfo hba-port
HBA Port WWN: 210000e08b100163
        Port Mode: Initiator
        Port ID: 10300
        OS Device Name: /dev/cfg/c8
        Manufacturer: QLogic Corp.
        Model: QLA2300
        Firmware Version: 03.03.27
        FCode/BIOS Version: BIOS: 1.47;
        Serial Number: not available
        Driver Name: qlc
        Driver Version: 20080617-2.30
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb
        Current Speed: 2Gb
        Node WWN: 200000e08b100163
        NPIV Not Supported

Using the stmfadm utility we add the gateway server’s wwn address to the GW1 host group.
root@ss1:~# stmfadm add-hg-member -g GW1 wwn.210000e08b100163

Once added to ss1 we can see that it is indeed available and online.
root@ss1:~# stmfadm list-target -v

Target: wwn.2100001B320EFD58
    Operational Status: Online
    Provider Name     : qlt
    Alias             : qlt2,0
    Sessions          : 1
        Initiator: wwn.210000E08B100163
            Alias: :qlc1
            Logged in since: Fri Dec 19 01:47:07 2008

The cfgadm command will scan for the newly available LUN’s and now we can access the emulated (aka boat anchor) storage system using our gateway server ss2. Of course we could also set up more initiators and access it over a multipath connection.

cfgadm -a

root@ss2:~# format
Searching for disks…done
AVAILABLE DISK SELECTIONS:
       0. c0t600144F01EB3862C0000494B55CD0001d0 <DEFAULT cyl 2086 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f01eb3862c0000494b55cd0001
       1. c0t600144F01EB3862C0000494B55F50002d0 <DEFAULT cyl 2086 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f01eb3862c0000494b55f50002
       2. c0t600144F01EB3862C0000494B55F80003d0 <DEFAULT cyl 2086 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f01eb3862c0000494b55f80003
       3. c0t600144F01EB3862C0000494B55FA0004d0 <DEFAULT cyl 2086 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f01eb3862c0000494b55fa0004
       4. c0t600144F01EB3862C0000494B55FD0005d0 <DEFAULT cyl 2086 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f01eb3862c0000494b55fd0005
       5. c0t600144F01EB3862C0000494B56000006d0 <DEFAULT cyl 2086 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f01eb3862c0000494b56000006

Now that we some FC LUN connections configured from to the storage system to be re-provisioned we can create a ZFS based pool which grants us the ability to carve out the block storage in a virtual manner. As discussed previously we will use raid dp a.k.a. raidz2 to provide a higher level of availability with the zpool create raidz2 option command.

root@ss2:~# zpool create gwrp1 raidz2 c0t600144F01EB3862C0000494B55CD0001d0 c0t600144F01EB3862C0000494B55F50002d0 c0t600144F01EB3862C0000494B55F80003d0 c0t600144F01EB3862C0000494B55FA0004d0 c0t600144F01EB3862C0000494B55FD0005d0 c0t600144F01EB3862C0000494B56000006d0

A quick status check reveals all is well with the ZFS pool.

root@ss2:~# zpool status gwrp1
pool: gwrp1
state: ONLINE
scrub: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        gwrp1                                      ONLINE       0     0     0
          raidz2                                   ONLINE       0     0     0
            c0t600144F01EB3862C0000494B55CD0001d0 ONLINE       0     0     0
            c0t600144F01EB3862C0000494B55F50002d0 ONLINE       0     0     0
            c0t600144F01EB3862C0000494B55F80003d0 ONLINE       0     0     0
            c0t600144F01EB3862C0000494B55FA0004d0 ONLINE       0     0     0
            c0t600144F01EB3862C0000494B55FD0005d0 ONLINE       0     0     0
            c0t600144F01EB3862C0000494B56000006d0 ONLINE       0     0     0

Let’s carve out some of this newly created pool as a 32GB sparse volume. The -p option creates the full path if it does not currently exist.

root@ss2:~# zfs create -p -s -V 32G gwrp1/stores/lun0

root@ss2:~# zfs list
NAME                         USED AVAIL REFER MOUNTPOINT
gwrp1                        220K 62.6G 38.0K /gwrp1
gwrp1/stores                67.9K 62.6G 36.0K /gwrp1/stores
gwrp1/stores/lun0           32.0K 62.6G 32.0K –

With a slice of the pool created we can now assign a GUID within the LU Provider layer using the sbdadm utility.

root@ss2:~# sbdadm create-lu /dev/zvol/rdsk/gwrp1/stores/lun0

Created the following LU:

GUID DATA SIZE SOURCE
——————————– ——————- —————-
600144f07ed404000000496813070001 34359672832 /dev/zvol/rdsk/gwrp1/stores/lun0

The LU Provider layer can also provision sparse based storage. However in this case the ZFS backing store is already thin provisioned. If this were a physical disk backing store it would be prudent to use the LU Provider’s sparse/thin provisioning feature. At this point we are ready to create the STMF Host Group and View that will be used to demonstrate a real world example of the multi protocol capability with the COMSTAR OpenStorage ss2 host. In this case I will use VMware ESX as a storage consumer. To reflect the host group type we will name it ESX1 and then we need to add a view for the LU GUID of the virtualized storage.

root@ss2:~# stmfadm create-hg ESX1

root@ss2:~# stmfadm add-view -h ESX1 -n 1 600144f07ed404000000496813070001

root@ss2:~# stmfadm list-view -l 600144F07ED404000000496813070001
View Entry: 0
    Host group   : ESX1
    Target group : All
    LUN          : 1

With a view defined for the VMware hosts let’s add an ESX host FC HBA wwn membership to the defined ESX1 host group. We need to retrieve the wwn from the VMware server using either the console or a Virtual Infrastructure Client GUI. Personally I like the console esxcfg-info tool, however if it’s an ESXi host then the GUI will serve the info just as well.

VMware Screen shot WWN by Mike La Spina

[root@vh1 root]# esxcfg-info -s | grep ‘Adapter WWN’
|—-Adapter WWNN…………………………20:00:00:e0:8b:01:f7:e2

root@ss2:~# stmfadm add-hg-member -g ESX1 wwn.210000e08b01f7e2

And the result of this change after we issue a rescan on vmhba1 and create a VMFS volume named ss2-cstar-zs0.0 with the re-provisioned storage is reflected here.

VMware Screen shot VMFS volume by Mike La Spina

This crafted storage is now a thinly provisioned VMFS store that can deliver replication, snapshots, cloning, advanced error detection and can also be re-platformed to a new storage system at a later date using ZFS’s hardware autonomy. The storage server is very attractive as it creates a level of future proofing and insulates the storage consumers from proprietary vendor lock in. But that’s not the best part of this example. Let’s say you wish provide different tiers of connectivity services for your storage consumers. For example we could attach a development or test environment using an iSCSI protocol and the more critical environments can use FC or FCoE based protocol.

So let’s look at how we can add a second SCSI transport protocol to this interesting configuration.

Just as a note the new iSCSI port provider is a kernel based implementation and has superior performance to its predecessor iscsitgt user land implementation.

To add the iSCSI protocol we need to enable the iscsi/target port provider service.

root@ss2:~# svcadm enable iscsi/target

Now we need to create an iSCSI target and iSCSI initiator definition so that we can add the iSCSI initiator to the ESX1 host group. As well we should define a target portal group so we can control what host IP(s) will service this target.

root@ss2:~# itadm create-tpg 2 10.0.0.1

root@ss2:~# itadm create-target

root@ss2:~# itadm create-target -n iqn.1986-03.com.sun:02:ss2.0 -t 2
Target iqn.1986-03.com.sun:02:ss2.0 successfully created

By default the iqn will be created as a member of the All targets group.

If we left out the parameters the itadm utility would create an iqn GUID and use the default target portal group of 1. And yes for those familiar with the predecessor iscsitadm utility we can now create a iqn name at the command line.

At this point we need to define the initiator iqn to the iSCSI port provider service and if required additionally secure it using CHAP. We need to retrieve the VMware initiator iqn name from either the Virtual Infrastructure Client GUI or console command line. Just as a note if we did not specify a host group when we defined our view the default would allow any initiator FC, iSCSI or otherwise to connect to the LU and this may have a purpose but generally it is a bad practice to allow in most configurations. Once created the initiator is added to the ESX1 host group thus enables our second access protocol to the same LU.

[root@vh1 root]# esxcfg-info -s | grep ‘iqn’
|—-ISCSI Name……………………………………..iqn.1998-01.com.vmware:vh1.1
|—-ISCSI Alias…………………………………….iqn.1998-01.com.vmware:vh1.1

root@ss2:~# itadm create-initiator iqn.1998-01.com.vmware:vh1.1

root@ss2:~# stmfadm add-hg-member -g ESX1 iqn.1998-01.com.vmware:vh1.1

After adding the ss2 iSCSI interface IP to VMware’s Software iSCSI initiator we now have a multipath multiprotocol connection to our COMSTAR storage host.

VMware iqn example By Mike La Spina

VMware mpath example by Mike La Spina

This is simply the most functional and advanced Open Source storage product in the world today. Here we have commodity white boxes serving advanced storage protocols in my home lab, can you imagine what could be done with Data Center class server hardware and Fishworks. You can begin to see the advantages of this future proof platform. As protocols like FCoE, Infiniband and iSER (iSCSI without the TCP session overhead) already working in COMSTAR the Sun Software Engineers and OpenSolaris community are crafting outstanding storage products.

Hope you found this blog to be interesting.

Regards,

Mike

January 26th, 2009 | Tags: clones, comstar, fc, fcoe, gateway, iscsi, iser, multi, opensolaris, protocol, provision, replication, reprovision, server, snapshot, sparse, Storage, target, thin, vmfs, VMware, zfs.
Categories: Storage, VMware | Comments: 10 Comments |

Understanding VMFS volumes

Understanding VMFS volumes is an important element within VMware ESX environments. When storage issues surface we need to correctly evaluate the VMFS volume states and apply the appropriate corrective actions to remediate undesirable storage events. VMFS architecture is not publically available and this certainly adds to the challenge when we need to correct a volume configuration or change issue. So lets begin to look at the components of a VMFS from what I have been able to decrypt using direct analysis.

All VMFS volume partitions will have a partition ID value of fb. Running fdisk can identify any partitions that are flagged as VMFS as shown here.

[root@vh1 ]# fdisk -lu /dev/sdc

Disk /dev/sdc: 274.8 GB, 274877889536 bytes
255 heads, 63 sectors/track, 33418 cylinders, total 536870878 sectors
Units = sectors of 1 * 512 = 512 bytes

Device Boot Start End Blocks Id System
/dev/sdc1 128 536860169 268430021 fb Unknown

What’s important to note here is the sector size = 512 and the starting/ending blocks.

Many VMFS volume configuration elements are visible in the /vmfs mount folder. Within the directory the are two subdirectories, the volumes directory and the devices directory. The volumes directory provisions the mount point location and the devices directory holds configuration elements. Within the devices directory the are several subdirectories of which I can explain the disks and lvm folders, the others are not known to me outside of theory only.

A key part of a VMFS volume is it’s UUID (aka Universally Unique Identifier) and as the name suggests it used to ensure uniqueness when more than one volume is in use. The UUID is generated on the initial ESX host that created the VMFS volume based on the UUID creation standards. You can determine which ESX host the initial VMFS volume was created on by referring to the last 6 bytes of the UUID. This value is the same as the last six bytes of the ESX host’s system UUID found in the /etc/vmware/esx.conf file.

By far one of the most critical elements on a VMFS volume is the GUID. The GUID is integral within the volume because it is used to form the vml path (aka virtual multipath link). The GUID is stored within the VMFS volume header and begins at address 0x10002E.

The format of the GUID can vary based on different implementations of SCSI transport protocols but generally you will see some obvious length variances of the vml path identifiers which stem from the use of T11 and T10 Standard SCSI address formats like EUI-64, and NAA 64. Regardless of those variables there are components outside of the GUID within the vml that we should take notice of. The vml construct contains references to the LUN and partition values and these are useful to know about. The following illustrates where these elements appear in some real examples.

When we issue an ls -l from the /vmfs/devices/disks directory the following info is observed.

vhhba#:Target:LUN:Partition -> vml:??_LUN_??_GUID:Partition

LUN GUID PARTITION
^ ^ ^
vmhba0:1:0:0 -> vml.02000000005005076719d163d844544e313436
vmhba0:1:0:1 -> vml.02000000005005076719d163d844544e313436:1
vmhba32:1:3:0 -> vml.0200030000600144f07ed404000000496ff8cd0003434f4d535441
vmhba32:1:3:1 -> vml.0200030000600144f07ed404000000496ff8cd0003434f4d535441:1

As well the issuing ls -l on the /vmfs/volumes list the VMFS UUID’s and the link name which is what we see displayed in the GUI client. In this example we will follow the UUID shown in blue and the named ss2-cstar-zs0.2 volume.

ss2-cstar-zs0.2 -> 49716cd8-ebcbbf9a-6792-000d60d46e2e

Additionally we can use esxcfg-vmkhbadevs -m to list the vmhba, dev and UUID associations.

[root@vh1 ]# esxcfg-vmhbadevs -m
vmhba0:1:0:1 /dev/sdd1 48a3b0f3-736b896e-af8f-00025567144e
vmhba32:1:3:1 /dev/sdf1 49716cd8-ebcbbf9a-6792-000d60d46e2e

As you can see we indeed have different GUID lengths in this example. We also can see that the vmhba device is linked to a vml construct and this is how the kernel defines paths to a visible SCSI LUN. The vml path hosts the LUN ID, GUID and partition number information and this is also stored in the volumes VMFS header. As well the header contains a UUID signature but this is not the VMFS UUID.

If we use hexdump as illustrated below we can see these elements in the VMFS header directly.

[root@vh1 root]# hexdump -C -s 0x100000 -n 800 /dev/sdf1
00100000 0d d0 01 c0 03 00 00 00 10 00 00 00 02 16 03 00 |                | <- LUN ID
00100010 00 06 53 55 4e 20 20 20 20 20 43 4f 4d 53 54 41 | SUN     COMSTA| <- Target Label
00100020 52 20 20 20 20 20 20 20 20 20 31 2e 30 20 60 01 |R         1.0 ` | <- LUN GUID
00100030 44 f0 7e d4 04 00 00 00 49 6f f8 cd 00 03 43 4f |D ~     Io    CO|
00100040 4d 53 54 41 00 00 00 00 00 00 00 00 00 00 00 00 |MSTA            |
00100050 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 fc |                | <- Volume Size
00100060 e9 ff 18 00 00 00 01 00 00 00 8f 01 00 00 8e 01 |                |
00100070 00 00 91 01 00 00 00 00 00 00 00 00 10 01 00 00 |                |
00100080 00 00 d8 6c 71 49 b0 aa 97 9b 6c 2f 00 0d 60 d4 |   lqI    l/ ` |
00100090 6e 2e 6e 89 19 fb a6 60 04 00 a7 ce 20 fb a6 60 |n n    `       `|
001000a0 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |                |
001000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |                |
*
00100200 00 00 00 f0 18 00 00 00 90 01 00 00 00 00 00 00 |                |
00100210 01 00 00 00 34 39 37 31 36 63 64 38 2d 36 30 37 |    49716cd8-607| <- SEG UUID in ASCII
00100220 35 38 39 39 61 2d 61 64 31 63 2d 30 30 30 64 36 |5899a-ad1c-000d6|
00100230 30 64 34 36 65 32 65 00 00 00 00 00 00 00 00 00 |0d46e2e         |
00100240 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |                |
00100250 00 00 00 00 d8 6c 71 49 9a 89 75 60 1c ad 00 0d |     lqI u`    | <- SEG UUID
00100260 60 d4 6e 2e 01 00 00 00 e1 9c 19 fb a6 60 04 00 |` n          ` |
00100270 00 00 00 00 8f 01 00 00 00 00 00 00 00 00 00 00 |                |
00100280 8e 01 00 00 00 00 00 00 64 cc 20 fb a6 60 04 00 |        d    ` |
00100290 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |                |
001002a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |                |

In addition to the VMFS header block we have the hidden metadata files of the volume which you can list using ls -al. The vh.sf contains the UUID of the VMFS store and any member segments info. (I would presume the name vh stands for Volume Header … ;D)

[root@vh1 ]# hexdump -C -s 0x200000 -n 256 /vmfs/volumes/49716cd8-ebcbbf9a-6792-000d60d46e2e/.vh.sf
00200000 5e f1 ab 2f 04 00 00 00 1f d8 6c 71 49 9a bf cb |^   / lqI      | <- VMFS UUID
00200010 eb 92 67 00 0d 60 d4 6e 2e 02 00 00 00 73 73 32 | g ` n     ss2| <- Volume Name
00200020 2d 63 73 74 61 72 2d 7a 73 30 2e 32 00 00 00 00 |-cstar-zs0.2    |
00200030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |                |
*
00200090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 |                |
002000a0 00 00 00 10 00 00 00 00 00 d8 6c 71 49 01 00 00 |          lqI   |
002000b0 00 d8 6c 71 49 9a 89 75 60 1c ad 00 0d 60 d4 6e | lqI u`    ` n| <- SEG UUID
002000c0 2e 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |                |
002000d0 00 00 00 01 00 20 00 00 00 00 00 01 00 00 00 00 |                |
002000e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |                |

And of course we can not leave out the partition entry block data for the device.

hexdump -C -n 256 /dev/sdf1

00000000 fa b8 00 10 8e d0 bc 00 b0 b8 00 00 8e d8 8e c0 |                |
00000010 fb be 00 7c bf 00 06 b9 00 02 f3 a4 ea 21 06 00 |   |         ! |
00000020 00 be be 07 38 04 75 0b 83 c6 10 81 fe fe 07 75 |    8 u        u|
00000030 f3 eb 16 b4 02 b0 01 bb 00 7c b2 80 8a 74 01 8b |         |   t |
00000040 4c 02 cd 13 ea 00 7c 00 00 eb fe 00 00 00 00 00 |L     |         |
00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |                |
*
000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 |                |
000001c0 03 00 fb fe ff ff 80 00 00 00 72 ef bf 5d 00 00 |          r ]  | Type Start End
000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |                |

With this detailed information it is possible to solve some common security issues with VMware stores like volume deletion and unintentional LUN ID changes.

Recently VMware added a some what useful command line tool named vmfs-undelete which exports metadata to a recovery log file which can restore vmdk block addresses in the event of deletion. It’s a simple tool and at present it’s experimental and unsupported and is not available on ESXi. The tool of course demands that you were proactive and ran it’s backup function in order to use it. Well I think this falls well short of what we need here. What if you have no previous backups of the VMFS configuration, so we really need to know what to look for and how to correct it and that’s exactly why I created this blog.

The volume deletion event is quite easy to fix and thats simply because the VMFS volume header is not actually deleted. The partition block data is what gets trashed and you can just about get way with murder when it comes to recreating that part. Within the first 128 sectors is the peice we need to fix. One method is to create a store with the same storage volume and then block copy the partition to a file which can be block copied to the deleted device partition data blocks and this will fix the issue.

For example we create a new VMFS store on the same storage backing with the same LUN size as the original and it shows up as a LUN with a device name of /dev/sdd we can use esxcfg-vmhbadevs -m to find it if required

The deleted device name was /dev/sdc

We use the dd command to do a block copy from the new partition to a file or even directly in this case.

Remember to back it up first!

dd if=/dev/sdc of=/var/log/part-backup-sdc-1st.hex bs=512 count=1

then issue

dd if=/dev/sdd of=/dev/sdc bs=512 count=1

dd if=/dev/sdd of=/var/log/part-backup-sdd.hex bs=512 count=1

dd if=/var/log/part-backup-sdd.hex of=/dev/sdc bs=512 count=1

I personally like using a file to perform this function as this becomes a future backup element which you can move to a safe location. The file can actually be edited with other utilities to provide more flexibility. e.g. hexedit etc. Addtitionally you could use fdisk to directly edit the partition table and provide the correct start and end addresses. This is something you should only do if you are well versed in it’s usage.

As as an additional level of protection we could even include making backups of the vh.sf metadata file and the VMFS header.

cp /vmfs/volumes/49716cd8-ebcbbf9a-6792-000d60d46e2e/.vh.sf /var/log/vh.sf.bu

dd if=/dev/sdc of=/var/log/vmfsheader-bu-sdc.hex bs=512 count=4096

This would grant the ability for support to examine the exact details of the VMFS configuration and potentially allow recovery from more complex issues.

One of the most annoying security events is when a VMFS LUN get’s changed inadvertently. If a VMFS volume LUN ID changes and is presented to an ESX host then the presented volume will be treated as a potential snapshot LUN. If this event occurs and the ESX servers advanced LVM parameter settings are at default the ESX host will not mount the volume. This behaviour is to prevent the possibility of corruption and downing the host since it can not determine which VM metadata inventory is correct.

If you are aware that the LUN ID has changed then the best course of action is to re-establish the correct LUN ID at the storage server first and rescan the affected vmhba’s. This is important because if you need to resignature the VMFS volume it will also require that the VM’s be imported back into inventory. Virtual Center logging and other various settings will be lost when this action is performed. This is a result of now having an incorrect UUID between the metadata, mount location and the vmx file UUID value.

If the storage change cannot be reverted back then a VMFS resignature method is the only option for reprovisioning a VMFS volume mount.

This is invoked by setting the LVM.DisallowSnapshotLun = 0 and LVM.EnableResignature = 1 and these should reverted back once the VMFS resignature operation is complete.

Regards,

Mike

January 21st, 2009 | Tags: deleted, esx, esxi, fdisk, guid, header, hex, lun, lvm, partition, path, recover, resignature, sector, Storage, uuid, vml, VMware, volume.
Categories: Security, Storage, VMware | Comments: 37 Comments |

Running ZFS over iSCSI as a VMware vmfs store

The first time I looked at ZFS it totally floored me. This is a file system that has changed storage system rules as we currently know them and continues to do so. It is with no doubt the best architecture to date and now you can use it for your VMware stores.

Previously I had explored using it for a VMware store but ran into many issues which were real show stoppers. Like the VPD page response issue which made VMware see only one usable iSCSI store. But things are soon to be very different when Sun releases the snv_93 or above to all. I am currently using the unreleased snv_93 iscsitgt code and it works with VMware in all the ways you would want. Many thanks to the Sun engineers for adding NAA support on the iSCSI target service. With that being said let me divulge the details and behaviors of the first successful X4500 ZFS iSCSI VMware implementation in the real world.

Lets look at the Architectural view first.

X4500 iSCSI Architecture by Mike La Spina

The architecture uses a best practice approach consisting of completely separated physical networks for the iSCSI storage data plane. All components have redundant power and network connectivity. The iSCSI storage backplane is configured with an aggregate and is VLAN’d off from the server management network. Within the physical HP 2900’s an inter-switch ISL connection is defined but is not critical. This allows for more available data paths if additional interfaces were assigned on the ESX host side.
The Opensolaris aggregate and network components are configured as follows:

For those of you using Indiana….By default nwam is enabled on Indiana and this needs to be disabled and the physical network service enabled.

svcadm disable svc:/network/physical:nwam
svcadm enable svc:/network/physical:default

The aggregate is defined using the data link adm utility but first any bindings need to be cleared by unplumbing the interfaces.

e.g. ifconfig e1000g0 unplumb

Once cleared the assignment of the physical devices is possible using the following commands

dladm create-aggr –d e1000g0 –d e1000g1 –P L2,L3 1
dladm create-aggr –d e1000g2 –d e1000g3 –P L2,L3 2

Here we have set the policy allowing layer 2 and 3 and defined two aggregates aggr1 and aggr2. We can now define the VLAN based interface shown here as VLAN 500 instances 1 are 2 respective of the aggr instances. You just need to apply the following formula for defining the VLAN interface.

(Adaptor Name) + vlan * 1000 + (Adaptor Instance)

ifconfig aggr500001 plumb up 10.1.0.1 netmask 255.255.0.0
ifconfig aggr500002 plumb up 10.1.0.2 netmask 255.255.0.0

To persist the network configuration on boot you will need to create hostname files and hosts entries for the services to apply on startup.

echo ss1.iscsi1 > /etc/hostname.aggr500001
echo ss1.iscsi2 > /etc/hostname.aggr500002

Edit /etc/hosts to have the following host entries.

::1 localhost
127.0.0.1 ss1.local localhost loghost
10.0.0.1 ss1 ss1.domain.name
10.1.0.1 ss1.iscsi1
10.1.0.2 ss1.iscsi2

On the HP switches its a simple static trunk definition on port 1 and 2 using the following at the CLI.

trunk 1-2 trk1 trunk

Once all the networking components are up and running and persistent, its time to define the ZFS store and iSCSI targets. I chose to include both mirrored and raidz pools. I needed to find and organize the cxtxdx device names using the cfgadm command or you could issue a format command as well to see the controller, target, disk names if you’re not using an X4500. I placed the raidz devices across controllers to improve I/O and distribute the load. It would not be a prudent to place one array on a single SATA controller. So here is what it ends up looking like from the ZFS command view.

zpool create –f rp1 raidz1 c4t0d0 c4t6d0 c5t4d0 c8t2d0 c9t1d0 c10t1d0
zpool add rp1 raidz1 c4t1d0 c4t7d0 c5t5d0 c8t3d0 c9t2d0 c10t2d0
zpool add rp1 raidz1 c4t2d0 c5t0d0 c5t6d0 c8t4d0 c9t3d0 c10t3d0
zpool add rp1 raidz1 c4t3d0 c5t1d0 c5t7d0 c8t5d0 c9t5d0 c11t0d0
zpool add rp1 raidz1 c4t4d0 c5t2d0 c8t0d0 c8t6d0 c9t6d0 c11t1d0
zpool add rp1 raidz1 c4t5d0 c5t3d0 c8t1d0 c8t7d0 c10t0d0 c11t2d0
zpool add rp1 spare c11t3d0
zpool create –f mp1 mirror c10t4d0 c11t4d0
zpool add mp1 mirror c10t5d0 c11t5d0
zpool add mp1 mirror c10t6d0 c11t6d0
zpool add mp1 spare c9t7d0

It only takes seconds to create terabytes of storage, wow it truly is a thing of beauty (geek!). Now it’s time to define a few pools and stores in preparation for the creation of the iSCSI targets. I chose to create units of 750G since VMware would not perform well with much more than that. This is somewhat dependant on the size of the VM and type of I/O but generally ESX host will serve a wide mix so try I keep it to a reasonable size or it ends up with SCSI reservation issues (that’s a bad thing chief).

You must also consider I/O block size before creating a ZFS store this is not something that can be changed later so now is the time. It’s done by adding the –b 64K to the ZFS create command. I chose to use 64k for the block size which aligns with VMWare default allocation size thus optimizing performance. The –s option enables a sparse volume feature aka thin provisioning. In this case the space was available but it is my favorite way to allocate storage.

zfs create rp1/iscsi
zfs create -s -b 64K -V 750G rp1/iscsi/lun0
zfs create -s -b 64K -V 750G rp1/iscsi/lun1
zfs create -s -b 64K -V 750G rp1/iscsi/lun2
zfs create -s -b 64K -V 750G rp1/iscsi/lun3
zfs create mp1/iscsi
zfs create -s -b 64K -V 750G mp1/iscsi/lun0

Originally I wanted to build the ESX hosts using a local disk but thanks to some bad IBM x346 engineering I could not use the QLA4050C and an integrated Adaptec controller on the ESX host server hardware. So I decided to give boot from iSCSI a go thus here is the boot LUN definition that I used for it. The original architectural design requires local disk to prevent an ESX host failure in the event of an iSCSI path outtage.

zfs create rp1/iscsi/boot
zfs create -s -V 16G rp1/iscsi/boot/esx1

Now that the ZFS stores are complete we can create the iSCSI targets for the ESX hosts to use. I have named the target alias to reflect something about the storage system which makes it easier to work with. I also created an iSCSI configuration store so we can persist the iSCSI targets on reboots. (This may now be included with Opensolaris Indiana but I have not tested it)

mkdir /etc/iscsi/config
iscsitadm modify admin –base-directory /etc/iscsi/config
iscsitadm create target -u 0 -b /dev/zvol/rdsk/rp1/iscsi/lun0 ss1-zrp1
iscsitadm create target -u 1 -b /dev/zvol/rdsk/rp1/iscsi/lun1 ss1-zrp1
iscsitadm create target -u 2 -b /dev/zvol/rdsk/rp1/iscsi/lun2 ss1-zrp1
iscsitadm create target -u 3 -b /dev/zvol/rdsk/rp1/iscsi/lun3 ss1-zrp1
iscsitadm create target -b /dev/zvol/rdsk/mp1/iscsi/lun0 ss1-zmp1
iscsitadm create target -b /dev/zvol/rdsk/rp1/iscsi/boot/esx1 ss1-esx1-boot

Most blog examples of enabling targets show the ZFS command line method as shareiscsi=on. This works well for a new iqn but if you want to allocate additional LUN under that iqn then you need to use this –b backing store method.

Now that we have some targets you should be able to list them using:

iscsitadm list target

Notice that we only see one iqn for ss1-zrp1, you can use the –v option to show all the LUN’s if required.

Target: ss1-zrp1
iSCSI Name: iqn.1986-03.com.sun:02:eb9c3683-9b2d-ccf4-8ae0-85c7432f3ef6.ss1-zrp1
Connections: 2
Target: ss1-zmp1
iSCSI Name: iqn.1986-03.com.sun:02:36fd5688-7521-42bc-b65e-9f777e8bfbe6.ss1-zmp1
Connections: 2
Target: ss1-esx1-boot
iSCSI Name: iqn.1986-03.com.sun:02:d1ecaed7-459a-e4b1-a875-b4d5df72de40.ss1-esx1-boot
Connections: 2

It would be prudent to create some target initiator entries to allow authorization control of what initiator iqn’s can connect to a particular target.
This is an important step. It will create the ability to use CHAP or at least only allow named iqn’s to connect to that target. iSNS also provides a similar service.

iscsitadm create initiator –iqn iqn.2000-04.com.qlogic:qla4050c.esx1.1 esx1.1
iscsitadm create initiator –iqn iqn.2000-04.com.qlogic:qla4050c.esx1.2 esx1.2

Now we can assign these initiators to a target and then the target will only accept those initiators. You can also add CHAP authentication as well, but that’s beyond the scope of this blog.

iscsitadm modify target –acl esx1.1 ss1-esx1-boot
iscsitadm modify target –acl esx1.2 ss1-esx1-boot
iscsitadm modify target –acl esx1.1 ss1-zrp1
iscsitadm modify target –acl esx1.2 ss1-zrp1
iscsitadm modify target –acl esx1.1 ss1-zmp1
iscsitadm modify target –acl esx1.2 ss1-zmp1

In order to boot from the target LUN we need to configure the QLA4050C boot feature. You must do this from the ESX host using the ctrl Q sequence during the boot cycle. It is simply a matter of entering the primary boot target IP set the mode to manual and enter the iqn exactly as it was listed from the iscsitadm list targets command. e.g.

iqn.1986-03.com.sun:02:d1ecaed7-459a-e4b1-a875-b4d5df72de40.ss1-esx1-boot

Once the iqn is entered the ESX host software can be installed and configured.
Till next time….

July 10th, 2008 | Tags: aggregate, iscsi, opensolaris, san, Storage, vmfs, VMware, Volumes, zfs.
Categories: Storage, VMware | Comments: 16 Comments |

Encapsulating VT-d Accelerated ZFS Storage within ESXi

OpenSolaris Storage Summit 2009

Multi Protocol Storage Provisioning with COMSTAR

Understanding VMFS volumes

Running ZFS over iSCSI as a VMware vmfs store

Blogroll

Cluster Maps

SiteMeter

Feeds

Recent Posts