Understanding VMFS volumes
Understanding VMFS volumes is an important element within VMware ESX environments. When storage issues surface we need to correctly evaluate the VMFS volume states and apply the appropriate corrective actions to remediate undesirable storage events. VMFS architecture is not publically available and this certainly adds to the challenge when we need to correct a volume configuration or change issue. So lets begin to look at the components of a VMFS from what I have been able to decrypt using direct analysis.
All VMFS volume partitions will have a partition ID value of fb. Running fdisk can identify any partitions that are flagged as VMFS as shown here.
[root@vh1 ]# fdisk -lu /dev/sdc
Disk /dev/sdc: 274.8 GB, 274877889536 bytes
255 heads, 63 sectors/track, 33418 cylinders, total 536870878 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/sdc1 128 536860169 268430021 fb Unknown
What’s important to note here is the sector size = 512 and the starting/ending blocks.
Many VMFS volume configuration elements are visible in the /vmfs mount folder. Within the directory the are two subdirectories, the volumes directory and the devices directory. The volumes directory provisions the mount point location and the devices directory holds configuration elements. Within the devices directory the are several subdirectories of which I can explain the disks and lvm folders, the others are not known to me outside of theory only.
A key part of a VMFS volume is it’s UUID (aka Universally Unique Identifier) and as the name suggests it used to ensure uniqueness when more than one volume is in use. The UUID is generated on the initial ESX host that created the VMFS volume based on the UUID creation standards. You can determine which ESX host the initial VMFS volume was created on by referring to the last 6 bytes of the UUID. This value is the same as the last six bytes of the ESX host’s system UUID found in the /etc/vmware/esx.conf file.
By far one of the most critical elements on a VMFS volume is the GUID. The GUID is integral within the volume because it is used to form the vml path (aka virtual multipath link). The GUID is stored within the VMFS volume header and begins at address 0x10002E.
The format of the GUID can vary based on different implementations of SCSI transport protocols but generally you will see some obvious length variances of the vml path identifiers which stem from the use of T11 and T10 Standard SCSI address formats like EUI-64, and NAA 64. Regardless of those variables there are components outside of the GUID within the vml that we should take notice of. The vml construct contains references to the LUN and partition values and these are useful to know about. The following illustrates where these elements appear in some real examples.
When we issue an ls -l from the /vmfs/devices/disks directory the following info is observed.
vhhba#:Target:LUN:Partition -> vml:??_LUN_??_GUID:Partition
LUN GUID PARTITION
^ ^ ^
vmhba0:1:0:0 -> vml.02000000005005076719d163d844544e313436
vmhba0:1:0:1 -> vml.02000000005005076719d163d844544e313436:1
vmhba32:1:3:0 -> vml.0200030000600144f07ed404000000496ff8cd0003434f4d535441
vmhba32:1:3:1 -> vml.0200030000600144f07ed404000000496ff8cd0003434f4d535441:1
As well the issuing ls -l on the /vmfs/volumes list the VMFS UUID’s and the link name which is what we see displayed in the GUI client. In this example we will follow the UUID shown in blue and the named ss2-cstar-zs0.2 volume.
ss2-cstar-zs0.2 -> 49716cd8-ebcbbf9a-6792-000d60d46e2e
Additionally we can use esxcfg-vmkhbadevs -m to list the vmhba, dev and UUID associations.
[root@vh1 ]# esxcfg-vmhbadevs -m
vmhba0:1:0:1 /dev/sdd1 48a3b0f3-736b896e-af8f-00025567144e
vmhba32:1:3:1 /dev/sdf1 49716cd8-ebcbbf9a-6792-000d60d46e2e
As you can see we indeed have different GUID lengths in this example. We also can see that the vmhba device is linked to a vml construct and this is how the kernel defines paths to a visible SCSI LUN. The vml path hosts the LUN ID, GUID and partition number information and this is also stored in the volumes VMFS header. As well the header contains a UUID signature but this is not the VMFS UUID.
If we use hexdump as illustrated below we can see these elements in the VMFS header directly.
[root@vh1 root]# hexdump -C -s 0x100000 -n 800 /dev/sdf1
00100000 0d d0 01 c0 03 00 00 00 10 00 00 00 02 16 03 00 | | <- LUN ID
00100010 00 06 53 55 4e 20 20 20 20 20 43 4f 4d 53 54 41 | SUN COMSTA| <- Target Label
00100020 52 20 20 20 20 20 20 20 20 20 31 2e 30 20 60 01 |R 1.0 ` | <- LUN GUID
00100030 44 f0 7e d4 04 00 00 00 49 6f f8 cd 00 03 43 4f |D ~ Io CO|
00100040 4d 53 54 41 00 00 00 00 00 00 00 00 00 00 00 00 |MSTA |
00100050 00 00 00 00 00 00 00 00 00 00 00 02 00 00 00 fc | | <- Volume Size
00100060 e9 ff 18 00 00 00 01 00 00 00 8f 01 00 00 8e 01 | |
00100070 00 00 91 01 00 00 00 00 00 00 00 00 10 01 00 00 | |
00100080 00 00 d8 6c 71 49 b0 aa 97 9b 6c 2f 00 0d 60 d4 | lqI l/ ` |
00100090 6e 2e 6e 89 19 fb a6 60 04 00 a7 ce 20 fb a6 60 |n n ` `|
001000a0 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
001000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
*
00100200 00 00 00 f0 18 00 00 00 90 01 00 00 00 00 00 00 | |
00100210 01 00 00 00 34 39 37 31 36 63 64 38 2d 36 30 37 | 49716cd8-607| <- SEG UUID in ASCII
00100220 35 38 39 39 61 2d 61 64 31 63 2d 30 30 30 64 36 |5899a-ad1c-000d6|
00100230 30 64 34 36 65 32 65 00 00 00 00 00 00 00 00 00 |0d46e2e |
00100240 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
00100250 00 00 00 00 d8 6c 71 49 9a 89 75 60 1c ad 00 0d | lqI u` | <- SEG UUID
00100260 60 d4 6e 2e 01 00 00 00 e1 9c 19 fb a6 60 04 00 |` n ` |
00100270 00 00 00 00 8f 01 00 00 00 00 00 00 00 00 00 00 | |
00100280 8e 01 00 00 00 00 00 00 64 cc 20 fb a6 60 04 00 | d ` |
00100290 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
001002a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
[root@vh1 ]# hexdump -C -s 0x200000 -n 256 /vmfs/volumes/49716cd8-ebcbbf9a-6792-000d60d46e2e/.vh.sf
00200000 5e f1 ab 2f 04 00 00 00 1f d8 6c 71 49 9a bf cb |^ / lqI | <- VMFS UUID
00200010 eb 92 67 00 0d 60 d4 6e 2e 02 00 00 00 73 73 32 | g ` n ss2| <- Volume Name
00200020 2d 63 73 74 61 72 2d 7a 73 30 2e 32 00 00 00 00 |-cstar-zs0.2 |
00200030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
*
00200090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 | |
002000a0 00 00 00 10 00 00 00 00 00 d8 6c 71 49 01 00 00 | lqI |
002000b0 00 d8 6c 71 49 9a 89 75 60 1c ad 00 0d 60 d4 6e | lqI u` ` n| <- SEG UUID
002000c0 2e 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
002000d0 00 00 00 01 00 20 00 00 00 00 00 01 00 00 00 00 | |
002000e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
And of course we can not leave out the partition entry block data for the device.
hexdump -C -n 256 /dev/sdf1
00000000 fa b8 00 10 8e d0 bc 00 b0 b8 00 00 8e d8 8e c0 | |
00000010 fb be 00 7c bf 00 06 b9 00 02 f3 a4 ea 21 06 00 | | ! |
00000020 00 be be 07 38 04 75 0b 83 c6 10 81 fe fe 07 75 | 8 u u|
00000030 f3 eb 16 b4 02 b0 01 bb 00 7c b2 80 8a 74 01 8b | | t |
00000040 4c 02 cd 13 ea 00 7c 00 00 eb fe 00 00 00 00 00 |L | |
00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
*
000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 | |
000001c0 03 00 fb fe ff ff 80 00 00 00 72 ef bf 5d 00 00 | r ] | Type Start End
000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
With this detailed information it is possible to solve some common security issues with VMware stores like volume deletion and unintentional LUN ID changes.
Recently VMware added a some what useful command line tool named vmfs-undelete which exports metadata to a recovery log file which can restore vmdk block addresses in the event of deletion. It’s a simple tool and at present it’s experimental and unsupported and is not available on ESXi. The tool of course demands that you were proactive and ran it’s backup function in order to use it. Well I think this falls well short of what we need here. What if you have no previous backups of the VMFS configuration, so we really need to know what to look for and how to correct it and that’s exactly why I created this blog.
The volume deletion event is quite easy to fix and thats simply because the VMFS volume header is not actually deleted. The partition block data is what gets trashed and you can just about get way with murder when it comes to recreating that part. Within the first 128 sectors is the peice we need to fix. One method is to create a store with the same storage volume and then block copy the partition to a file which can be block copied to the deleted device partition data blocks and this will fix the issue.
For example we create a new VMFS store on the same storage backing with the same LUN size as the original and it shows up as a LUN with a device name of /dev/sdd we can use esxcfg-vmhbadevs -m to find it if required
The deleted device name was /dev/sdc
We use the dd command to do a block copy from the new partition to a file or even directly in this case.
Remember to back it up first!
dd if=/dev/sdc of=/var/log/part-backup-sdc-1st.hex bs=512 count=1
then issue
dd if=/dev/sdd of=/dev/sdc bs=512 count=1
or
dd if=/dev/sdd of=/var/log/part-backup-sdd.hex bs=512 count=1
dd if=/var/log/part-backup-sdd.hex of=/dev/sdc bs=512 count=1
I personally like using a file to perform this function as this becomes a future backup element which you can move to a safe location. The file can actually be edited with other utilities to provide more flexibility. e.g. hexedit etc. Addtitionally you could use fdisk to directly edit the partition table and provide the correct start and end addresses. This is something you should only do if you are well versed in it’s usage.
As as an additional level of protection we could even include making backups of the vh.sf metadata file and the VMFS header.
cp /vmfs/volumes/49716cd8-ebcbbf9a-6792-000d60d46e2e/.vh.sf /var/log/vh.sf.bu
dd if=/dev/sdc of=/var/log/vmfsheader-bu-sdc.hex bs=512 count=4096
This would grant the ability for support to examine the exact details of the VMFS configuration and potentially allow recovery from more complex issues.
One of the most annoying security events is when a VMFS LUN get’s changed inadvertently. If a VMFS volume LUN ID changes and is presented to an ESX host then the presented volume will be treated as a potential snapshot LUN. If this event occurs and the ESX servers advanced LVM parameter settings are at default the ESX host will not mount the volume. This behaviour is to prevent the possibility of corruption and downing the host since it can not determine which VM metadata inventory is correct.
If you are aware that the LUN ID has changed then the best course of action is to re-establish the correct LUN ID at the storage server first and rescan the affected vmhba’s. This is important because if you need to resignature the VMFS volume it will also require that the VM’s be imported back into inventory. Virtual Center logging and other various settings will be lost when this action is performed. This is a result of now having an incorrect UUID between the metadata, mount location and the vmx file UUID value.
If the storage change cannot be reverted back then a VMFS resignature method is the only option for reprovisioning a VMFS volume mount.
This is invoked by setting the LVM.DisallowSnapshotLun = 0 and LVM.EnableResignature = 1 and these should reverted back once the VMFS resignature operation is complete.
Regards,
Mike
Site Contents: © 2009 Mike La Spina
ZFS Snapshot Rollup Bash Script
As a follow on to my blog entry Provisioning Disaster Recovery with ZFS, iSCSI and VMware I created this snapshot rollup script to help maintain the growing snapshots and minimize disk consumption. The script is an add-on to the zfsadm account cron jobs and runs under the security privileges of the zfsadm user detailed in that blog. An input text file is used to specify what ZFS path’s will be rolled up to a Grandfather Father Son backup scheme. All out of scope snapshots are destroyed leaving the current day’s and week’s snapshots, Friday weekly snapshots of the current month, each month’s end and as well, in time the year end snapshots. The cron job needs to run at minimum on target host but it would be prudent to run it on both systems. The script is aware of the possiblity that a snapshot may be cloned and will detect and log it. To add the job is simply a matter of adding it to the zfsadm users crontab.
# crontab –e zfsadm
0 3 * * * ./zfsgfsrollup.sh zfsrollup.lst
Hint: crontab uses vi – http://www.kcomputing.com/kcvi.pdf “vi cheat sheet”
The key sequence would be hit “i” and key in the line then hit “esc :wq” and to abort “esc :q!”
The job detailed here will run once a day at 3:00 AM which may need to be extended if you have a very slow link between the servers. If you intend to use this script as shown you should follow the additional details for adding a cronjob found in the original blog, items like time zone and the likes of are discussed there.
As well the script expects the gnu based versions of date and expr.
Here are the two files that are required
Hopefully you will find it to be useful.
Regards,
Mike
Site Contents: © 2008 Mike La Spina
A centrally based method for patching ESX3 VMWare Servers
I have updated my ESX servers manually many times and I find the process to say at the least is “annoying” so I decided to change it to an http based method with a modified patch configuration. I found that it really works well.
I did some searching prior to the method I settled on and found this blog which is quite good.
http://virtrix.blogspot.com/2007/03/vmware-autopatching-your-esx-host.html
I felt it does have some issues and I wanted to avoid running custom scripts on the server side. So here is what I build for my patch management solution.
Using the standard tools on the ESX3 server of cron jobs and esxupdate I created the following on my servers.
Define a new cron entry for running esxupdate every first Sunday of the month.
I chose to create a separate entry avoiding the esx installed ones for safety reasons.
Edit the file /etc/crontab and add the line in bold as show.
[root@yourserver /]nano /etc/crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/
# run-parts
01 * * * * root run-parts /etc/cron.hourly
02 4 * * * root run-parts /etc/cron.daily
22 4 * * 0 root run-parts /etc/cron.weekly
42 4 1 * * root run-parts /etc/cron.monthly
30 3 1 * 0 root run-parts /etc/cron.esxupdate
Create a new directory for the esxupdate cron run part as follows.
[root@yourserver /]mkdir /etc/cron.esxupdate
Create a new command file for the esxupdate cron run part and add the text lines as follows.
[root@yourserver /]nano /etc/cron.esxupdate/esxpatch
esxupdate -r http://yourhttpserver.fq.dns:8088/esx3.all update
Set the file as executable using chmod
chmod 755 /etc/cron.esxupdate/esxpatch
Allow outgoing traffic from the server using the esxcfg-firewall script as follows.
[root@yourserver /]esxcfg-firewall –allowoutgoing
Create an http server to host the patch repository.
I built an IIS server on my Virtual Center 3 server on port 8088 as follows
Install IIS (Be sure to secure it).
Create a patch folder named C:VMWarePatches and set your IIS home directory as follows.
You will need to add the following content types from a right mouse click-> properties of the IIS server definition within the Computer Manager mmc as shown.
Content types .hdr and .info need to be present.
Download the patches you require and extract them to c:vmwarepatches in the usual manner for repo based deployment. For example I downloaded the available esx-upgrade-from-esx3-3.0.2-61618.tar.gz upgrade file and extracted it to c:vmwarepatches which leaves a directory named 61818 in the http distro folder.
Rename this directory to esx3.all
Now this is were things get interesting as we are going to customize this patch package to include the latest additional patches that are available.
VMWare is using the installation and update functionally of Redhat rpm, yum and esxupdate scripts which means we can modify the repo source by conforming to the existing configuration rules. This configuration of the repo is based on xml tags of which I modified three tag element types.
1) info
The descriptor xml file found in the root of the update package which contains the descriptions of various patches in the package and looks like the following.
<descriptor version=”1.0″>
<vendor>VMware, Inc.</vendor>
<product>VMware ESX Server</product>
<release>3.0.2-62488</release>
<releasedate>Fri Nov 2 05:01:31 PDT 2007</releasedate>
<summary>3.0.2 Update 1 of VMware ESX Server</summary>
<description>This is 3.0.2 Update 1 rollup to Nov 2nd,2007 of VMware ESX Server.
It contains:
3.0.2-52542 Full 3.0.2 release of VMware ESX Server
ESX-1001724 Security bugs fixed in vmx rpm.
ESX-1001735 To update tzdata rpm.
…
ESX-1002424 VMotion RARP broadcast to multiple vmnic.
ESX-1002425 VMware-hostd-esx 3.0.2-62488
ESX-1002429 Path failback issue with EMC iSCSI array.
</description>
This xml file is used by the esxupdate script and can be modified to support additional patches. The parts we need to add are some descriptor text element data so that we know what it covers. I added ESX-1002424, 25 and 29 to the descriptor tag. And edited the headings date to reflect its current value.
2) rpms
The second element type I changed was the rpmlist tag. It looks like the following which is only a partial.
<rpmlist>
<rpm arch=”i386″ rel=”8.37.15″ ver=”2.4″>kernel-utils</rpm>
<rpm arch=”i686″ rel=”47.0.1.EL.62488″ ver=”2.4.21″>kernel-vmnix</rpm>
<rpm arch=”i386″ rel=”66″ ver=”1.2.7″>krb5-libs</rpm>
<rpm arch=”i386″ rel=”11″ ver=”1.1.1″>krbafs</rpm>
<rpm arch=”i386″ rel=”3vmw” ver=”1.1.22.15″>kudzu</rpm>
<rpm arch=”i386″ rel=”70RHEL3″ ver=”0.1″>laus-libs</rpm>
<rpm arch=”i386″ rel=”12″ ver=”378″>less</rpm>
<rpm arch=”i386″ rel=”52542″ ver=”3.0.2″>VMware-esx-perftools</rpm>
<rpm arch=”i386″ rel=”52542″ ver=”3.0.2″>VMware-esx-uwlibs</rpm>
<rpm arch=”i386″ rel=”62488″ ver=”3.0.2″>VMware-esx-vmx</rpm>
<rpm arch=”i386″ rel=”61818″ ver=”3.0.2″>VMware-esx-vmkernel</rpm>
<rpm arch=”i386″ rel=”52542″ ver=”3.0.2″>VMware-esx-vmkctl</rpm>
<rpm arch=”i386″ rel=”55869″ ver=”3.0.2″>VMware-esx-tools</rpm>
<rpm arch=”i386″ rel=”52542″ ver=”3.0.2″>VMware-esx-srvrmgmt</rpm>
<rpm arch=”i386″ rel=”52542″ ver=”3.0.2″>VMware-esx-scripts</rpm>
</rpmlist>
When you extract a newer esx3 patch it will contain its own descriptor rpmlist tags etc. and you can update this major upgrade list to include those new elements. You need to replace the old tag info if it exists which it usually will. For example the bold entry above for VMware-esx-vmkernel needs to be completely updated by removing the old tag values like rel=61818 and insert the new value of rel=62488. I simply searched for the tag name VMware-esx-vmkernel deleted the line and inserted the updated one. I will eventually write a script to do the config edits, but for now these hacks will do the trick.
3) nodeps
The third element I changed will not normally be necessary but in this case there is a bug in this upgrade package that did not account for an rpm option value of –U which means upgrade an rpm if present. The descriptor included a tag named nodeps which I don’t think works well when the rpm already exists of the same version so we need to remove the tag all together for previous base installs above version esx 3.0.0.
I deleted these tags for my config.
<nodeps>
<rpm>kbd</rpm>
<rpm>nfs-utils</rpm>
</nodeps>
In addition to the descriptor xml file edits we will need to copy the new rpm files into the esx3.all folder, remove the old ones, edit the header.info file found in the headers sub folder, copy the new hdr files to the headers folder and remove the old ones in the headers folder.
For example in adding ESX-1002424 I copied
From c:vmwarepatchesesx-1002424
Files
VMware-esx-apps-3.0.2-62488.i386.rpm
VMware-esx-vmkernel-3.0.2-62488.i386.rpm
VMware-esx-vmx-3.0.2-62488.i386.rpm
To c:vmwarepatchesesx3.all
From c:vmwarepatchesesx-1002424headers
Files
VMware-esx-apps-0-3.0.2-62488.i386.hdr
VMware-esx-vmkernel-0-3.0.2-62488.i386.hdr
VMware-esx-vmx-0-3.0.2-62488.i386.hdr
To c:vmwarepatchesesx3.allheaders
You should delete any older file versions matching the prefixes portions of the newer files to avoid confusion. It will still function if you leave the old ones but it’s not a best practice.
The last step is to edit the header.info file, search for the text lines of the rpm file prefix and replace the info lines with the newer ones found from the updated patch header.info file.
For example search for text VMware-esx-vmx in the master header.info file which looks like the following partial example.
0:openssh-3.6.1p2-33.30.13vmw.i386=openssh-3.6.1p2-33.30.13vmw.i386.rpm
0:net-snmp-libs-5.0.9-2.30E.20.i386=net-snmp-libs-5.0.9-2.30E.20.i386.rpm
0:libtermcap-2.0.8-35.i386=libtermcap-2.0.8-35.i386.rpm
0:VMware-esx-vmx-3.0.2-62488.i386=VMware-esx-vmx-3.0.2-62488.i386.rpm
0:acl-2.2.3-1.i386=acl-2.2.3-1.i386.rpm
0:dev-3.3.12.3-1.i386=dev-3.3.12.3-1.i386.rpm
0:ethtool-1.8-3.3.i386=ethtool-1.8-3.3.i386.rpm
Replace it completely with the newer header.info text entry.
I tested my configuration on serveral base install versions and is was nice to see that on an esx 3.0.2-61818 it only downloaded and installed the addtional patches that were added to the master repo esx3.all
You should keep in mind that based on the cron schedule your VM’s may need to go down on that host and you must plan those outage times and gracefully shutdown where applicable.
Till next time … happy esx patching.
Regards,
Mike
Site Contents: © 2008 Mike La Spina
VMware image customization in progress issue
I just came across this very annoying VMware guest customization artifact. I neglected to update the sysprep directories at
C:<ALLUSERSPROFILE>Application DataVmwareVMware VirtualCentersysprep
with a current version and it resulted in a failed image customization that would not go away. On each system restart a boot time run of sysprepDecrypter.exe would occur and cause a continual loop of execution on every boot.
To correct the issue I removed the sysprepDecryptor.exe registry entry from the session manger key rebooted and it ran for the last time. I manually ran sysprep afterwards to ensure that the SUID was unique.
Here is the location to delete the sysprepDecryptor.exe registry entry.
HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlSession Manager
Value name = BootExecute
Default Value = autocheck autochk *
Site Contents: © 2008 Mike La Spina
Provisioning Disaster Recovery with ZFS, iSCSI and VMware
OpenSolaris, ZFS, iSCSI and VMware are a great combination for provisioning Disaster Recovery (DR) systems at exceptionally low cost. There are some fundamentally well suited features of ZFS and VMFS volumes that provide a relatively simply and very efficient recovery process for VMware hosted non-zero RPO crash consistent recovery based environments. In this weblog I will demonstrate this capability and provide some step by step howto’s for replicating a ZFS, iSCSI and VMFS VMware based environment securely over a WAN or whatever you may have to a single ESXi remote server hosting a child OpenSolaris VM which provisions ZFS and iSCSI VMFS LUN’s to the parent ESXi host. The concept is to wrap the DR services into a single low cost self contained DR box that can be expanded out in the event of an actual DR incident while allowing for regular testing and validation processing without the high costs normally associated with standby DR systems. As one would expect this method becomes a very appealing solution for small to medium businesses who would normally abstain from DR provisioning activity due to the inherently high cost and complexity of DR.
The following diagram illustrates this DR system architecture.
When we have VMFS volumes backed by iSCSI based ZFS targets we gain the powerful replication capability of ZFS send and receive commands. This ZFS feature procures the ability to send an entire VMFS volume by way of a raw iSCSI target ZFS backing store. And once sent initially we can base all subsequent sends as a delta of change from a previous send snapshot which are respectively referred as snapshot deltas. Thus if we initially snapshot an iSCSI backing store and send the stream to a remote ZFS file system we can then send all the changed object data from that previous snapshot point to the current snapshot point and whatever else may be in between those snapshots. The result is a constant update of VMFS changes from the source ZFS file system to the remote ZFS file system of which can be completely different hardware. This ZFS hardware autonomy gift allows us to provision a much lower cost system on the DR remote side to host the VMFS volumes. For example the target system which is presented in this weblog is an IBM x3500 and the source is a SUN X4500 detailed in a previous blog.
There are some important considerations that should be kept in mind when we use snapshots to create any DR process. One of the most important areas to consider is the data change rates on the VMFS volumes that are to be included in the DR send/receive process. When we have VMware servers or VM’s that have low memory allocations (a.k.a. over committed memory) or application behaviors that swap to disk frequently we will observe high volumes of what I call disk noise or disk data change that has no permanent value. High amounts of disk noise will consume more storage and bandwidth on both systems when snapshots are present. In cases where the disk noise reaches a rate of 1GB/Day or more per volume it would be prudent to isolate the noise sources on a VMFS volume that will not be part of the replication strategy. You could for example create a VMFS LUN for swap and temp files on the local ESX host which can be ignored in the replication scope. Another important area is the growth rate of the overall storage may require routine pruning of older snapshots to reduce the total consumption of disk. For example if we have high change rates from database sources which can not be isolated we can at monthly intervals destroy all but one of the last months snapshots to conserve the available storage on both systems. This method still provisions a good DR process and as well provides a level of continuous data protection (CDP) and is simmilar to a grandfather/father/son preservation scheme.
Since we are handling valuable information we must use secure methods to access and handle the data transfers. This can be provisioned by using ssh and dedicated service accounts that will perform this one specific function. ZFS send and receive functions use an inherently secure approach by employing ssh as a transport tunnel when transmitting storage data to the target ZFS file system. This is just what we need to provision a secure exchange to a DR environment. Conversly we could use IPSec but this would be significantly more complex to achieve and complexity is not a good this when short implementation time is a priority.With this explanation wrapped up in our minds lets begin some of the detailed tasks that are required to procure this real world DR solution.
ESXi Server
The DR VMware server is based on the free ESXi product and with this we can encapsulate the entire DR functionallity in one server hardware platform. Within the ESXi engine we need to install and configure an OpenSolaris 5.11 snv_98 or higher VM using VMFS as the storage provider. This ESXi server configuration consists of a single SATA boot LUN and this LUN also stores the OpenSolaris iSCSI VM. In addition to the boot LUN we will create the ZFS iSCSI stores on serveral additional SATA disks that are presented to the OpenSolaris VM as separate VMFS datastores which we will use to create large vmdk’s. The virtual vmdk disks will be assigned as vdev’s for our receiving ZFS zpool. Talk about rampent layering. At this point we have a OpenSolaris VM defined on a hardware platform on which OpenSolaris would normally never work with natively in this example. You have goto love what you can do with VMware virtualization. By the way when SUN’s xVM product is more mature it could provision the same fuctionallity with native ZFS provisioning and that alone really is worth a talk, but lets continue our focus on this platform for now.
There are many configuration options available on the network provisioning side of our ESXi host. In this case VLAN’s are definetly a solid choice for this application and is my prefered approach to controlling iSCSI data flow. We initially would only need to provide iSCSI access for the local OpenSolaris VM as this will provision a virtual SAN to the parent ESXi host. The parent ESXi host needs to be able mount the iSCSI target LUN’s that were available in the production environmant and validate that the DR process works. In the event of DR activation we would need to add external ESXi hosts and VLAN’s will provide both locally isolated iSCSI networks with easy expansion if these services are required externally all with out need to purchase external switch hardware for the system until it is required. Thus within the ESXi host we need to define a VLAN for the iSCSI SAN and an isolated VLAN for production VM validations and finally we need to define the replication and management network which can optionally use a VLAN or be untagged depending on your environment.
This virtualized DR environment grants advanced capabilties to perform rich system tests at commodity prices. Very attracive indeed. For example you can now clone the replicated VMFS LUN’s on the DR engine and with a liitle Solaris SMF iSCSI target service magic provision the clone as a duplicated ESX environment which does not impact the ongoing replication. As well we have network isolation and virtualization that allows the environment to exist in a closed fully functional remotely accessible world. This world can also be extended out as a production mirror test environment with dynamic revert back in time and repeat functionallity.
There are many possible ESXi network and disk configurations that would meet the DR server’s requirements. At the minimum we should provision the following elements.
- Provision a bootable single separate SATA disk with a minimum of 16G available for the VMFS LUN that will store the OpenSolaris iSCSI VM.
- Provision a minimum of three (optimally six) additional SATA disks or more if required as VMFS LUN’s to host the ZFS zpool vdev’s with vmdk’s.
- Provision a minimum of two 1Gb Ethernet adaptors, teamed would be preferable if more are available.
- Define vSwitch0 with a VLAN tagged VM Network portgroup to connect the replication side of the OpenSolaris iSCSI VM and a Service Console portgroup to manage the ESXi host.
- Define vSwitch1 with a VLAN tagged iSCSI VM kernel portgroup to service the iSCSI data plane and also define a VM Network portgroup on the same VLAN to connect with the target interface of the OpenSolaris iSCSI VM.
- Define the required isolated VLAN tagged identically named portgroups as production on vSwitch0 and use a separated VLAN numbering set for them for isolation.
- Define the OpenSolaris VM with one adapter to connected to the production network portgroup and one adapter to attached to the iSCSI data plane portgroup to serve the iSCSI target IP.
Here is an example of what the VM disk assignments should look like.
Once the ESXi server is successfully built and the Opensolaris iSCSI VM is installed and functional we can create the required elements for enabling ZFS replication.
Create Service Accounts
On the systems that will act as replication partners create zfsadm ID’s as service accounts using the provided commands.
# useradd -s /usr/bin/bash -d /export/home/zfsadm -P ‘ZFS File System Management’ zfsadm
# mkdir /export/home/zfsadm
# mkdir /export/home/zfsadm/backup
# cp /etc/skel/* /export/home/zfsadm
# echo PATH=/usr/bin:/usr/sbin:/usr/ucb:/etc:. > /export/home/zfsadm/.profile
# echo export PATH >> /export/home/zfsadm/.profile
# chown –R zfsadm /export/home/zfsadm
# passwd zfsadm
Note the parameter -P ‘ZFS File System Management’, This will grant the account an RBAC profile association to administratively manage our ZFS file system unlike root which is much too powerful and is all to often used by many of us.
The next step is to generate some crypto keys for ssh connectivity we start this with a login as the newly created zfsadm user and run a secure shell locally to ensure you have a .ssh directory and key files created in the home drive for the zfsadm user. Note this directory is normally hidden.
# ssh localhost
The authenticity of host ‘localhost (127.0.0.1)’ can’t be established.
RSA key fingerprint is 0c:aa:64:72:84:b5:04:1c:a2:d0:42:8e:9f:4e:09:9d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘localhost’ (RSA) to the list of known hosts.
Password:
# exit
Now that we have the .ssh directory we can create a crypto key pair and configure a relatively secure login without the need to enter a password for the remote host using this account.
Do not enter a passphrase, it needs to be blank.
# ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/export/home/zfsadm/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /export/home/zfsadm/.ssh/id_dsa.
Your public key has been saved in /export/home/zfsadm/.ssh/id_dsa.pub.
The key fingerprint is:
bf:58:7b:97:8d:b5:d2:31:26:14:4c:9f:ce:72:a7:20 zfsadm@ss1
The id_dsa file should not be exposed outside of this directory as it contains the private key of the pair, only the public key file id_dsa.pub needs to be exported. Now that our key pair is generated we need to append the public portion of the key pair to a file named authorized_keys2.
# cat $HOME/.ssh/id_dsa.pub >> $HOME/.ssh/authorized_keys2
Repeat all the Create Service Accounts steps and crypto key steps on the remote server as well.
We will use the Secure Copy command to place the public key file on each opposing hosts zfsadm users home directory so that when the ssh tunnel is started the remote host can decrypt the encrypted connection request completing the tunnel which is generated with the private part of the pair. This is why we must protect the private part of the pair from exposure. Granted we have also defined an additional layer of security here by defining a dedicated user for the ZFS send activity it is very important that the private key is secured properly and it is not necessary to back it up as you can regenerate them if required.
From the local server here named ss1 (The remote server is ss2)
# scp $HOME/.ssh/id_dsa.pub ss2:$HOME/.ssh/ss1.pub
Password:
id_dsa.pub 100% |**********************************************| 603 00:00
# scp ss2:$HOME/.ssh/id_dsa.pub $HOME/.ssh/ss2.pub
Password:
id_dsa.pub 100% |**********************************************| 603 00:00
# cat $HOME/.ssh/ss2.pub >> $HOME/.ssh/authorized_keys2
And on the remote server ss2
# ssh ss2
password:
# cat $HOME/.ssh/ss1.pub >> $HOME/.ssh/authorized_keys2
# exit
This completes the trusted key secure login configuration and you should be able to secure shell from either system to the other without a password prompt using the zfsadm account. To further limit security exposure we could employe ipaddress restrictions and as well enable a firewall but this is beyond the scope of this blog.
Target Pool and ZFS rights
As a prerequisite you need to create the receiving zpool on the target to allow the zfs sends to occur. The receiving zpool name should be the same as the source to allow ease in the re-serving of iSCSI targets. Earlier we granted the “ZFS File System Management” profile to this zfsadm user. This RBAC profile allows us to run a pfexec command which pre checks what profiles the user is assigned and then executes appropriately based on this assignment. The bonus here is you do not have to create granular rights assignments to the ZFS file system.
On the target server create your receiveing zpool.
# zpool create rp1 <your vdev’s>
Create a Cron Job
Using a cron job we will invoke our ZFS snapshots and send tasks to the target host with the execution of a bash script named zfs-daily.sh. We need to use the crontab command to create a job that will execute it as the zfsadm user, no other user except root can access this job and that a good thing considering it has the ability to shell to another host!
As root add the zfs user name to the /etc/cron.d/cron.allow file.
# echo zfsadm >> /etc/cron.d/cron.allow
# crontab –e zfsadm
59 23 * * * ./zfs-daily.sh zfs-daily.rpl
Hint: crontab uses vi – http://www.kcomputing.com/kcvi.pdf “vi cheat sheet”
The key sequence would be hit “i” and key in the line then hit “esc :wq” and to abort “esc :q!”
Be aware of the timezone the cron service runs under, you should check it and adjust it if required. Here is a example of whats required to set it.
# pargs -e `pgrep -f /usr/sbin/cron`
8550: /usr/sbin/cron
envp[0]: LOGNAME=root
envp[1]: _=/usr/sbin/cron
envp[2]: LANG=en_US.UTF-8
envp[3]: PATH=/usr/sbin:/usr/bin
envp[4]: PWD=/root
envp[5]: SMF_FMRI=svc:/system/cron:default
envp[6]: SMF_METHOD=start
envp[7]: SMF_RESTARTER=svc:/system/svc/restarter:default
envp[8]: SMF_ZONENAME=global
envp[9]: TZ=PST8PDT
Let’s change it to CST6CDT
# svccfg -s system/cron:default setenv TZ CST6DST
Also the default environment path for cron may cause some script “command not found” issues, check for a path and adjust it if required.
# cat /etc/default/cron
#
# Copyright 1991 Sun Microsystems, Inc. All rights reserved.
# Use is subject to license terms.
#
#pragma ident “%Z%%M% %I% %E% SMI”
CRONLOG=YES
This one has no default path, add the path using echo.
# echo PATH=/usr/bin:/usr/sbin:/usr/ucb:/etc:. > /etc/default/cron
# svcadm refresh cron
# svcadm restart cron
Create Snapshot Replication Script
Here is the link for the zfs-daily.sh replication script you will need to grant exec rights to this file e.g.
# chmod 755 zfs-daily.sh
The replcation script needs to live in the zfsadm home directory /export/home/zfsadm at this point I only have the one script built but other ones are in the works like a grandfather/father/son snapshot rollup script. The first run of the script can take a considerable amount of time depending on the available bandwidth and size of the VMFS luns. This cron job runs at midnight and took 6 hours over 100MB’s of bandwidth the first time and less that 5 min thereafter. A secondary script that runs hourly and is rolled up at days end would be beneficial. I will get it around to that one and the grandfather/father/son script later.
At this point we have an automated DR process that provides a form of CDP. But we do not have a way to access it so we need to perform some additional steps. In order for VMware to use the relocated VMFS iSCSI targets we need to reinstate some critical configuration info that was stored on the source Service Management Facility (SMF) repository. Within the iscsitgtd service properties we have the Network Address Authority (NAA) value which is named GUID in the properties list. This value is very important, when a VMFS is initialized the NAA is written to the VMFS volume header and this will need to be redefined on the DR target so that VMware will recognize the data store as available. If the NAA on the header and target do not match, the volume will not be visible to the DR VMware ESXi host. To protect this configuration info we need to export it from the source host and send it to the target host.
Export SMF iSCSI configuration
The iscstgtd service configuration elements can be easily exported using the following command.
# svccfg export iscsitgt > /export/home/zfsadm/backup/ss1-iscsitgt.xml
# scp ss1:/export/home/zfsadm/backup/* ss2:/export/home/zfsadm/backup/
SMF iscsitgt import and iSCSI configuration details
To import the production service we would issue the following commands.
# svcadm disable iscsitgt
# svccfg delete iscsitgt
# svccfg import /export/home/zfsadm/backup/ss1-iscsitgt.xml
Importing the iscsitgt service configuration is a simple task but it does have some elements that will be problematic if they are left unchecked. For example iSCSI Target Portal Group Tag values are included with the exported/inport function and thus you may need to change the portal groups values to correct discovery failure when the ip addresses are different on the target system. Another potential issue is leaving the existing SMF config in place and then importing the new one on top of it. This is not a best practice as you may create an invalid SMF for the iscsitgt service with elements that are orphaned out etc. The SMF properties will have the backing store path from the source server and if the target server does not have the same zpool name this will need to be fixed. And lastly make sure you have the same iscsitgtd version on each end since it will have potential changes between the versions.
You will also need to add the ESXi software initiator to the iSCSI target(s) on the receiving server and grant access with an acl entry and chap info if used.
# iscsitadm create initiator –iqn iqn.1998-01.com.vmware:vh0.0 vh0.0
# iscsitadm modify target –acl vh0.0 ss1-zstore0
To handle a TPGT configuration change its simply a matter of re-adding them with the iscsitadm utility as demonstrated here or possibly deleting the one that are not correct.
# iscsitadm create tpgt 1
# iscsitadm modify tpgt -i 10.10.0.1 1
# iscsitadm modify tpgt -i 10.10.0.2 1
# iscsitadm modify target -p 1 ss1-zstore0
To delete a tpgt that is not correct is very strait forward.
# iscsitadm delete target -p 1 ss1-zstore0
# iscsitadm delete tpgt -A 1
Where 10.20.0.1 and 2 are the target interfaces that should participate in portal group 1 and ss1-zstore0 is the target alias. In some cases you may have to remove the tpgt all together. The backing store is editable as well as many other SMF properties. To change a backing store value in the SMF we use the svccfg command as follows.
Here is an example of listing all the backing stores and then changing the /dev/zvol/rdsk/sp2/iscsi/lun0 so its on zpool sp1 instead of sp2
# svcadm enable iscsitgt
# svccfg -s iscsitgt listprop | grep backing-store
param_dr-zstore0_0/backing-store astring /dev/zvol/rdsk/sp2/iscsi/lun0
param_dr-zstore0_1/backing-store astring /dev/zvol/rdsk/sp1/iscsi/lun1
# svccfg -s iscsitgt setprop param_dr-zstore0_0/backing-store=/dev/zvol/rdsk/sp1/iscsi/lun0
# svccfg -s iscsitgt listprop | grep backing-store
param_dr-zstore0_0/backing-store astring /dev/zvol/rdsk/sp1/iscsi/lun0
param_dr-zstore0_1/backing-store astring /dev/zvol/rdsk/sp1/iscsi/lun1
Changing the backing store value is instrumental if you wish to mount the VMFS LUN’s to provision system validation or online testing. However do not attach the file system from the active replicated zfs backing store to the ESXi server for validation or testing as it will fail any additional replications once it is modified outside of the active replication stream. You must first create a clone of a chosen snapshot and then modify the backing store to use this new backing store path. This method will present a read/write clone through the iscsitgt service and will have the same iqn names so no reconfiguration would be required to create different time windows into the data stores or reversion to a previous point.
Here is an example of how this would be accomplished.
# zfs create sp1/iscsi/clones
# zfs clone sp1/iscsi/lun0@10-10-2008-23:45 sp1/iscsi/clones/lun0
# svcadm refresh iscsitgt
# svcadm restart iscsitgt
To change to a different snapshot time you would simply need to destroy or rename the current clone and replace it with a new or renamed clone of an existing snapshot on the same clone backing store path.
# zfs destroy sp1/iscsi/clones/lun0
# zfs clone sp1/iscsi/lun0@10-11-2008-23:45 sp1/iscsi/clones/lun0
# svccfg -s iscsitgt setprop param_dr-zstore0_0/backing-store=/dev/zvol/rdsk/sp1/iscsi/clones/lun0
# svcadm refresh iscsitgt
# svcadm restart iscsitgt
VMware Software iSCSI configuration
The ESXi iSCSI software configuration is quite strait forward. In this architecture we need to place an interface of the OpenSolaris iSCSI target host on vSwitch1 which is where we defined the iSCSI-Net0 VM kernel network. To do this we create a VM Network portgroup on the same VLAN ID as the iSCSI VM kernel interface.
Here is an example of what this configuration looks like.
For more deatail on how to configure the iSCSI VM interfaces see this blog http://blog.laspina.ca/ubiquitous/running_zfs_over_iscsi_as in this case you would not need to define an aggregate since there is only one interface for the iSCSI vSAN.
The final step in the configuration is to define a discovery target on the iSCSI software configuration panel and then rescan the vmhba for new devices.
Hopefully this blog was of interest for you.
Til next time….
Mike
Site Contents: © 2008 Mike La Spina