Ubiquitous Talk

Exploring an IBM v7000 Storage Engine

Within the storage landscape today we have a multitude of products available to solve the never ending storage challenge. Every storage product has it’s own set of features and characteristics which deliver certain elements of value to storage architects like myself. Currently an opportunity surfaced in which an IBM v7000 storage engine was available for me to review. So with the IBM v7000 in hand I started a process to evaluate what features and characteristics this product hosted to create some interesting solutions to some common storage challenges. Before breaking into the main content of this blog I would like to state what this blog isn’t about. This is not a bench mark of everything the v7000 can do. Good performance is very subjective and with enough resource and tuning we can solve any load requirement with any product. As well it’s is not a feature comparison exercise with other products in the same class. What I’m looking for is value in the product and how we can apply the value to solve storage problems.

If your not familiar with the v7000 it does not take long to understand and navigate it, the management web based GUI is logically strait forward. It is beneficial to understand how the v7000 presents storage when working in the GUI. Historically the v7000 is a version of the original IBM SAN Volume Controller or more commonly known as the IBM SVC with an updated set of management interfaces and some new useful features. The primary concept of the SVC is to provide a virtually abstracted instance of any block storage architecture the SVC can interface with. The design uses a model to which we define units of managed disk or MDisks. MDisks are block level devices or LU’s such as a single disk or an array of disks which can be located internally or externally from the SVC’s clustered engines. MDisks can be placed into tiered storage resource pools or directly served as a block image. As the SVC virtually re-provisions it’s storage attachment’s to initiators it creates multi-pathed nv-cache accelerated active/active connections for it’s storage consumers even if the original MDisk lacks this function. This definitely adds value as a tool to solve storage problems, especially where you may need to migrate or restructure an active SAN environment.

The setup phase of this evaluation proved to be very easy, the v7000 ships with a USB Flash Drive which was pre-loaded with a simple Windows based initialization tool. This simple tool allows anyone to quickly reach the browser management subsystem in less than 5 minutes. Once the system is configured the v7000 only accepts a small set of functions from the tool that require physical access such as resetting the superuser password. This limits any serious security events like connecting the USB Flash Drive to an incorrect host and disruptively changing it’s interface configuration.

Here are some screen shots of what brought me to the GUI in less than 5 minutes.

IBM v7000 Init Tool

v7000 InitTool IP Address

v7000 InitTool Completion

For myself this easy to use tool was actually not the significant element of value with the initial system configuration. Under the covers of the graphical tool we can see the v7000 provides the ability to configure a system from a text file when it’s brought up from the default factory state. A subset of system command line interface functions are available via the USB based configuration satask.txt file. The InitTool itself generates the subset or you can create your own specific commands once you know the appropriate statements. This means we can create an automated configuration process with it. The reason I find this useful stems more from the context of a disaster recovery functional element and test lab provisioning. For example if the alternate environment hosted a ZFS replicated set of LUNs on a commodity server we could easily create a configuration driven from the command line scripts that will serve a cloned set of the LUNs served in image mode using the v7000 or its lower cost v3700 counterpart. This would allow us to cost effectively mirror the functionality of a source environment on demand while also allowing us to return back to a test lab or other previous state all in the same day. In other words we can create the agility to restructure what a set of VMware ESXi or other hypervisor clusters can view repetitively on demand.

In order to evaluate the performance behaviors of the v7000 I decided it needed to go through what I like to call a little storage server hell. It’s a work exercise were we push the server to perform a 100% random IO load at 50% read and write over 512 byte requests. In this case it is a test of what a specific client can drive under specific conditions. The client driving this load is a Windows 7 based VMware VM with 2 x 3.0Ghz Intel i7 vCPUs running IOMeter with 64 worker threads. The VMware hypervisor host is a vSphere 5.1 based ESXi instance running over a dual 2GB Fiber channel HBA. The loading VM resembles a very heavy storage consumer in respect to a virtual storage consumption context. In almost all of these evaluations we will be exercising them against 2 defined pools which are dp1 and dp2. dp1 is a pool hosted by the internally based storage resources of the v7000 evaluation unit, mdisk0 is mapped as a raid 5 array and mdisk4 is our mirrored SSD array. The SSD tier is not used in this specific test. We will also be exploring some externally configured FC attached storage which is defined to storage pool dp2 so more of that will follow later in this blog entry.

v7000 Test Base Pools

Let explore the test configuration and it’s results.

The specifics for the workload are as follows:

Storage Server Stress Scale (1 – 10) = 10
Workers(Threads) = 64
Disk Size in Sectors = 24,000,000
IO Size in Bytes= 512
Outstanding Requests per Worker = 1
Random Operation Percentage = 100
Read Operation Percentage = 50
Write Operation Percentage = 50
Fiber Channel Paths = 2
Fiber Channel Speed in GB = 2
Settling Time in Seconds = 240

Test 1 = Random IO Storage Provisioning Hell
Tool = IOMeter
Volume Name = svc1-san-vol0
Raid Mode = 5
Thick Block Map Mode = 0
Compression = 0
SSD Cache = 0
NV Ram Cache in GB = 8

As we can observe the virtual disk sector map size of 24M far exceeds the storage servers non volatile cache and this means the server will need to exercise random seeks for a significant portion of the requests. Let’s look at the results graphically.

v7000 Performance iops under 100% rnd 50% read/write

Obviously the v7000 system deals with the punishment very efficiently. Specifically we can see the latency is very low even under significant stress. This demonstrates the system performance excels even under a most demanding completely random work loads. The v7000 design maturity is evident in this load test. I’m sure we could push it further than this but we must keep the test results in context as the evaluation system only hosts an 8 drive 10k SAS raid 5 array. The constraint is an effective way to observe how well the SAN Volume Controller cluster software performs.

Just for brevity I collapsed the 24M virtual disk down to 1M of assigned sectors in a effort to observe what the storage cache based IO response would present under a 100% random 50 % read/write IO load.

v7000 Performance iops under 100% rnd 50% read/writre

Well it was very apparent that the Windows 7 VM is the limiting factor here, however it still demonstrates the value in the engineering. The latency is zero even with 23500 random IOPS hitting the storage cluster.

I found the compression feature to be effective for almost any type of VM’s IO load. There are some basic rules of engagement one should follow when using compression on the v7000. The first rule surrounds the load type, specifically be wary of very high intensity random write loads. This is not because the v7000 will not perform well for this load, it’s actually rooted around the systems CPU load factor when other CPU demand factors are present. You do not want to push the normal running CPU load over 60% for a sustained period as it will increase the possibility of creating excessive peak loading events. The second rule addresses the desire to engage in easy tier functionality. The issue becomes one where compression will not predicate a proper heat map pattern since writing compressed data is always a pattern shifting scenario and thus will not remain at the last heat map hit point. You can still drive a compressed volume but you would need to move the entire LUN to the SSD tier to be fully effective.

In order to gauge how well the v7000’s compression algorithm responds I chose to drive the engine with a typical Windows random 4k 70% read, 30% write load. Let’s observe the graphical result of a sustained 4 minute run.

The specifics for the workload are as follows:
Storage Server Stress Scale (1 – 10) = 6
Workers(Threads) = 32
Disk Size in Sectors = 24,000,000
IO Size in Bytes= 4096
Outstanding Requests per Worker = 1
Random Operation Percentage = 100
Read Operation Percentage = 70
Write Operation Percentage = 30
Fiber Channel Paths = 2
Fiber Channel Speed in GB = 2
Settling Time in Seconds = 240

Test 2 = Typical Random IO Compressed Storage Provisioning
Tool = IOMeter
Volume Name = svc1-san-vol0
Raid Mode = 5
Thick Block Map Mode = 0
Compression = 1
SSD Cache = 0
NV Ram Cache in GB = 8

v7000 Compression Performance iops 4k 70-30 read-write mix

The results are very interesting, we can observe a substantial drop is the SAS backplane IOPS in the interface metrics section. This is a excellent result as it will reduce the mdisk load and thus increase the data throughput. Another important element is the greatly reduced IO load at the mdisk layer. The input side volume is receiving a total of 2516 IOPS while the output side only requires 1452 IOPS. This is certainly a valued performance enhancing behavior when employing the compression feature. The final element I see as noteworthy is the very low latency result at the provisioned volume in which it never exceeds 3ms during the entire load run.

As a bonus we only consumed 17% of the CPU and gained the following compression capacity gain:

v7000 Compression Gain on a Windows 7 VM

Even with a completely random data foot print workload function for this performance behavior test case, we gained 32% in capacity when using the compression feature and I’m quite happy with that result.

From the IOMeter client side the performance results do correlate and it is demonstrated with this screen shot.

VM to v7000 IOMeter IOPS on compressed volume no ssd

Storage tiering is one of the more important elements the v7000 Storage Server can provision. All the marketing noise for this product emphasize that it’s easy to use and I would concur it was very easy to use and it works without any effort. The v7000 provisions tiering by granting the storage administrator the ability to define performance classes of mdisk arrays within a pool. IBM engineers make use of IO activity heat maps to determine which block extents within a defined volume should be migrated to a higher performance tier. You do have control of the initial size of the extents when you create the pool itself. Once created you cannot change the extent size and nor should. The default extent size is 256K which I did do a series of performance checks on and the IBM engineers have chosen a very good default. 256K fits the general use VM provisioning most suitably with the best performance over a range from 32K to 1MB. The v7000 engineers chose a 24Hr cycle of activity time within the heat map data to determine which extents should move to a higher tier and I agree with this methodology. Many dialogs about the subject of using shorter sampling time algorithms do circulate the web. I find that if the algorithm is too short or the extent size is too small the results are not favorable. When we move data to quickly we begin to thrash it around and this is not efficient. To much movement generates fragmentation, as well it uses too much backplane bandwidth and other systems resources resources like cache unnecessarily. It also does not allow the system an opportunity to move the extent when the system is most idle.

To observe the benefits of running tiered on the v7000 I chose to perform a before and after workload run using the same typical Windows 4k random IO with 70% read and 30% write. The specifics of the run are as follows and we will observe the results at the IOMeter client side.

Storage Server Stress Scale (1 – 10) = 6
Workers(Threads) = 32
Disk Size in Sectors = 24,000,000
IO Size in Bytes= 4096
Outstanding Requests per Worker = 1
Random Operation Percentage = 100
Read Operation Percentage = 70
Write Operation Percentage = 30
Fiber Channel Paths = 2
Fiber Channel Speed in GB = 2
Settling Time in Seconds = 240

Test 3 = Typical Random IO Tiered Storage Provisioning
Tool = IOMeter
Volume Name = svc1-san-vol0
Raid Mode = 5
Thick Block Map Mode = 0
Compression = 0
SSD Cache = 1
NV Ram Cache in GB = 8

VM to v7000 IOMeter IOPS Tiered mode before heat map move

And after 24 hours the same test parameters yielded the following result.

VM to v7000 IOMeter IOPS Tiered mode after heat map move

Obviously the result demonstrates significant IOPS performance gains. In this test the first workload run executed for 4 minutes and was then left idle for a period of 24 hours. Subsequently the second run was performed for the same 4 minute length. Within the IOMeter results I did find it very interesting that the throughput gain was quite remarkable. I was not expecting to see such a significant increase in the Total MB/s value. It’s actually 22 times greater than the original run. I did have to run it a second time just to verify that it was not an anomaly in the original test run. After tearing down the volume, recreating it, rerunning the workload and waiting the required 24 hours it again presented the same result. It’s something I will have to investigate further as the reason eludes me for the moment. None the less the numbers speak for themselves.

One of the most important features that the v7000 hosts for myself is the ability to virtualize external storage systems that are presented via fiber channel protocols. The reason I find value in this feature is that it grants the ability to move significant amounts of storage around without major impact to the primary external storage consumer. In addition to the migration capability one can also front end an external storage host and synchronously mirror the data to a second external storage host.

The v7000 officially supports a significant number of FC based external storage systems. Personally I wanted to investigate if the v7000 could handle an open source based product such as OpenSolaris which is now formally any Illumos based engine. There is a synergy that can be gained within the world of the svc and the commodity open source world. With that idea in mind I built the required elements and did some very interesting tests of provisioning up some OpenIndiana FC based LUNs to the v7000.

Lets walk through some of the build elements.

The source storage host hardware was some off the shelf white box commodity components as follows:

1 – Antec Case
1 – LSI SAS3442 Adapter
1 – QL2462 Dual FC Adapter
8Gb – DRAM
1 – Intel i7 930 CPU
4 – Seagate NL SAS ST32000645SS
1 – USB Flash
1 – X58 Gigatech Mobo

The USB Flash Drive was loaded with an OpenIndiana USB based install using version oi_151a7.
The basic OpenIndiana storage configuration elements are as follows:

~# zpool create -f sp1 raidz1 c7t13d0 c7t14d0 c7t15d0 c7t16d0

(4 Disk Raidz1 array)

~# update_drv -a -i ‘”pciex1077,2432″‘ qlt

(FC Target Mode Driver Binding For COMSTAR)

~# zfs create -b 64K -s -V 256G sp1/zfs1-san-vol1

~# zfs create -b 64K -s -V 256G sp1/zfs1-san-vol2

(Some ZFS Posix Block Devices)

stmfadm create-lu /dev/zvol/rdsk/sp1/zfs1-san-vol1

stmfadm create-lu /dev/zvol/rdsk/sp1/zfs1-san-vol2

(Some COMSTAR exposed LUNs)

~# stmfadm create-hg svc1

~# stmfadm add-view -n 0 -h svc1 600144F0F5644400000050BD35750001

~# stmfadm add-view -n 1 -h svc1 600144F0F5644400000050BD35750002

(Some COMSTAR host groups and views to the LUNs now assigned with a GUID)

~# stmfadm add-hg-member -g svc1 wwn.500507680220146B

~# stmfadm add-hg-member -g svc1 wwn.500507680210146B

(Add the v7000 to the COMSTAR svc1 host group)

~# zfs create -s -b 8K -V 32G sp1/zfs1-san-vol3

~# stmfadm create-hg esx1

~# stmfadm add-hg-member -g esx1 wwn.210000e08b83cef2

~# stmfadm create-lu /dev/zvol/rdsk/sp1/zfs1-san-vol3

~# stmfadm add-view -n 4 -h esx1 600144F0F5644400000050CD19240001

(Create a volume to test the v7000 image mode)

During the initial testing I found that the v7000 does indeed successfully connect to the open source based OpenIndiana storage host and the LUNs are identified as generic targets. After discovering the LUNs they were added to the dp2 pool. As a comparative I chose to perform the typical Windows 4k 70/30 workload run on a newly created volume from the v7000.

Lets observe the metrics presented on the v7000 performance console.

v7000 Performance iops 4k 70-30 read-write comstar

The performance is impressive for a 4 disk array mdisk presentation. There is an interesting v7000 caching effect revealed at the mdisks metric panel where we can observe the write load is only 420 IOPS verses the virtual volume IOPS write rate of 1650. This is definitely a beneficial impact of the non-volitile cache in the v7000 cluster. We can also see that the disk latency at the external side is considerably higher for write operations than that of the virtual volume layer. As well we can see the external storage host is also optimizing the operations demonstrated by the gradual increase in FC operations on the Interface metrics panel. I’m very pleased to see that the v7000 can successfully serve an open source based storage target and that there are valuable optimizations gained from this configuration.

One element I was very interested in exploring was the image mode feature of the v7000 which gives us the ability to present a volume in passthrough mode. In other words the v7000 acts as the target presenting the external storage content as a block for block image. The same caching benefits observed in the above test are also presented when using the image mode. In this next test we will first present some storage from the external OpenIndiana host to a VMware ESXi 5.1 hypervisor and create a VMFS volume with it. We will then place the IOMeter client VM on the volume and run a load test using the Windows 4k 70/30 run. Then we will shut the VM down, remove the volume from the ESXi host and present the OpenIndiana LUN to the v7000 for import. Once imported into the v7000 in image mode we will present and add the LUN back to the ESXi host. Finally we will rescan the FC adapter for VMFS volumes and observe the result.

Lets walk through the operation graphically.

TZVM IOMeter VM on OpenIndiana

TZVM running on OpenIndiana ZFS VMFS volume named zfs-san-vol3.

TZVM paths on OpenIndiana

Observing the VMware presented Fiber Channel paths for zfs1-san-vol3. Note the policy storage array type.

TZVM zfs1-san-vol3 Pre Image Mode IOMeter test

The pre-image mode migration IOMeter results are now presented for a 4 Min run. This is a 4K random 70/30 read write mix. At this point we need to shutdown the VM and we will also remove the zfs1-san-vol3 datastore from the ESXi host prior to re-introducing the same volume over the v7000 svc engine. We simply remove the ESXi FC initiator member definition from the COMSTAR esx1 group and this will prevent any connectivity of the original datastore instance. We do this to prevent any VMware snapshot detection issues. At the same time we will add the zfs1-san-vol3 LUN view to the COMSTAR svc1 group.

~# stmfadm remove-hg-member -g esx1 wwn.210000e08b83cef2

~# stmfadm add-view -n 4 -h svc1 600144F0F5644400000050CD19240001

v7000 SVC Image Mode View of zfs1-san-vol3 as mdisk7 No pool placement for the Image is correct

At this point we have run the v7000 mdisk detection and imported the newly discovery mdisk7 LUN which is the zfs1-san-vol3 datastore. We do not add the image to a pool.

TZVM VMware Datastore zfs1-san-vol3 SVC Image Mode Add

We can now proceed to add the svc image of zfs1-san-vlo3 back to the ESXi host and we can observe its now exposed as an IBM Fiber Channel presentation on LUN4.

ESXi Resignaturing zfs1-san-vol3

When the datastore is added VMware does notice the naa value is different and it needs to confirm that we do want the current datastore volume to be mounted with the same signature as before. This is a typical response for an changed naa. If this was not the correct LUN for this signature accepting this naa would introduce instability to this VMware VMFS clustered datastore on all other ESXi hosts.

VMware does indeed identify the external v7000 image mode presentation and we can observe zfs1-san-vol3 is completely intact.

v7000 Path Observation After Migration

We can observe the newly defined paths and take note of the policy mode storage array type as its now in SVC mode. We also now have 4 paths available as well.

TZVM IOMeter Result after v7000 Image

With the ESXi now up and running with the re-established zfs1-san-vol3 datastore in image mode over the v7000 we can now run the ran 4k random 70/30 read write mix. We can obverse an immediate gain of 500 IOPS which is the write load hitting the v700 nv-cache 4 minutes into the test and we can see the synergy working. I let the test run for an addition 6 minutes for a total run of 10 minutes to observe the full cache benefit of the v7000 as the storage virtualization head.

TZVM IOMeter Result after v7000 Image Full 10 Min

Obviously we can see the benefit of the ZFS arc cache and v7000 nv-cache working together to improve our system latency and IOPS flow. The information presented in exploration does demonstrate that the v7000 brings values in many unique attributes and specifically drives a high degree of agility within the storage solutions scope.

Well this brings a close to this blog entry and I must say the results were very interesting and enlightening.

I hope you enjoyed the post.

Regards,

Mike

February 7th, 2013 | Tags: IBM, SVC, Tier, v7000.
Categories: Storage | Comments: 3 Comments |

Updated ZFS Replication and Snapshot Rollup Script

Thanks to the efforts of Ryan Kernan we have an updated ZFS replication and snapshot rollup script. Ryan’s OpenIdiana/Solaris/Illumos community contribution improves the script to allow for a more dynamic source to target pool replication and changes the shapshot retention method to a specific number of snapshots rather than a Grandfather Father Son method.

zfs-replication.sh

Regards,

Mike

October 4th, 2011 | Categories: Security, Storage | Comments: 33 Comments |

Encapsulating VT-d Accelerated ZFS Storage within ESXi

Some time ago I found myself conceptually provisioning ESXi hosts that could transition local storage in a distributed manner within an array of hypervisors. The architectural model likens itself to an amorphous cluster of servers which share a common VM client service that self provisions shared storage to it’s parent hypervisor or even other external hypervisiors. This concept originally became a reality in one of my earlier blog entries named Provisioning Disaster Recovery with ZFS, iSCSI and VMware. With this previous success of a DR scope we can now explore more adventurous applications of storage encapsulation and further coin the phrase of “rampent layering violations of storage provisioning” thanks to Jeff Bonwick, Jim Moore and many other brilliant creative minds behind the ZFS storage technology advancements. One of the main barriers of success for this concept was the serious issue of circular latency from within the self provisioning storage VM. What this commonly means is we have a long wait cycle for the storage VM to ready the requested storage since it must wait for the hypervisior to schedule access to the raw storage blocks for the virtualized shared target which then will re-provision it to other VM’s. This issue is acceptable for a DR application but it’s a major show stopper for applications that require normal performance levels.

This major issue now has a solution with the introduction of Intel’s VT-d technology. VT-d allows us to accelerate storage I/O functionality directly inside a VM served by a VMware based ESX and ESXi hypervisors. VMware has leveraged Intel’s VT-d technology on ESXi 4.x (AMD I/O Virtualization Technology (IOMMU) is also supported) as part of the named feature VMDirectPath. This feature now allows us to insert high speed devices inside a VM which can now host a device that operates at the hardware speed of the PCI Bus and that my friend allows virtualized ZFS storage provisioning VMs to dramatically reduce or eliminate the hypervisor’s circular latency issue.

Very exciting indeed, so lets leverage a visual diagram of this amorphous server cluster concept to better capture what this envisioning actually entails.

Encapsulated Accelerated ZFS Architecture

The concept depicted here sets a multipoint NFS share strategy. Each ESXi host provisions it’s own NFS share from it’s local storage which can be accessed by any of the other hosts including itself. Additionally each encapsulated storage VM incorporates ZFS replication to a neighboring storage VM in a ring pattern thus allowing for crash based recovery in the event of a host failure. Each ESXi instance hosts a DDRdrive X1 PCIe Card which is presented to it’s storage VM over VT-d and VMDirectPath aka. PCI Pass Through. When managed via vCenter this solution allows us to svMotion VM’s across the cluster allowing rolling upgrades or hardware servicing.

The ZFS replication cycle works as a background ZFS send receive script process that incrementally updates the target storage VM. One very useful feature of ZFS send receive capability is the include ZFS properties flag -p. When this flag is used any NFS share properties that are defined using “sharenfs= ” will be sent the the target host. Thus the only required action to enable access to the replicated NFS share is to add it as an NFS storage target on our ESXi host. Of course we would also need to stop replication if we wish to use the backup or clone it to a new share for testing. Testing the backup without cloning will result in a modified ZFS target file system and this could force a complete ZFS resend of the file system in some cases.

Within this architecture our storage VM is built with OpenSolaris snv_134 thus we have the ability to engage in ZFS deduplication. This not only improves the storage capacity it also grants improved performance when we allocate sufficient memory to the storage VM. ZFS Arc caching needs only to cache these dedup block hits once which accelerates all depup access requests. For example if this cluster served a Virtual Desktop Environment (VDI) we would see all the OS file allocation blocks enter into the ZFS Arc cache and thus all VMs that reference the same OS file blocks would be cache accelerated. Dedup also grants a benefit with ZFS replication with the use of the ZFS send -D flag. This flag instructs ZFS send to the stream in dedup format and this dramatically reduces replication bandwidth and time consumption in a VMware environment.

With VT-d we now have the ability to add a non-volatile disk device as a dedicated ZIL accelerator commonly called a SLOG or Separate Intent Log. In this proof of concept architecture I have defined the DDRdrive X1 as a SLOG disk over VMware VMDirectPath to our storage VM. This was a challenge to accomplish as VT-d is just emerging and has many unknown behaviors with system PCI BUS timing and IRQ handling. Coaxing VT-d to work correctly proved to be the most technically difficult component of this proof of concept, however success is at hand using a reasonably cost effective ASUS motherboard in my home lab environment.

Let’s begin with the configuration of VT-d and VMware VMDirectPath.

VT-d requires system BIOS support and this function is available on the ASUS P6X58D series of motherboards. The feature is not enabled by default you must change it in BIOS. I have found that enabling VT-d does impact how ESXi behaves, for example some local storage devices that were available prior to enabling VT-d may not be accessible after enabling it and could result in messages like “cannot retrieve extended partition information”.

The following screen shots demonstrate where you would find the VT-d BIOS setting on the P6X58D mobo.

VT-d-BIOS-Enable1

If your using an AMD 890FX based ASUS Crosshair IV mobo then look for the IOMMU setting as depicted here:

Thanks go to Stu Radnidge over at http://vinternals.com/ for the screen shot!

IOMMU on AMD 890FX Based Mobos

Once VT-d or IOMMU is enabled ESXi VMDirectPath can be enabled from the VMware vSphere client host configuration-> advanced menu and will require a reboot to complete any further PCI sharing configurations.

One challenge I encountered was PCIe BUS timing issues, fortunately the ASUS P6X58D overclocking capability grants us the ability to align our clock timing on the PCIe BUS by tuning the frequency and voltage and thus I was able to stabilize the PCIe interface running on the DDRdrive X1. Here are original values I used that worked. Since that time I have pushed the i7 CPU to 4.0Ghz, but that can be risky since you need to up the CPU and DRAM voltages so I will leave the safe values for public consumption.

P6X58D-Overclock-Tuning1

P6X58D-Overclock-Tuning2

ESXi-Console_Shot

Once VT-d is active you will be able to edit the enumerated PCI device list check boxes and allow pass through for the device of your choice. There are three important PCI values to note. The device ID, Vendor ID and the Class ID of which you can Google it or take this short cut http://www.pcidatabase.com/ and discover who owns the device and what class it belongs to. In this case I needed to ID the DDRdrive X1 and I know by the class ID 0100 that it is a SCSI device.

VMDirectPath Enabled

Once our DDRdrive X1 device is added to the encapsulated OpenSolaris VM it’s shared IRQ mode will need to be adjusted such that no other IRQ’s are chained to it. This is adjusted by adding a custom VM config parameter named pciPassthru0.msiEnabled and setting its value to false.

VMPassThru-msiEnabled=false

In this proof of concept the storage VM is assigned 4Gb of memory which is reasonable for non-deduped storage. If you plan to dedup the storage I would suggest significantly more memory to allow the block hash table to be held in memory, this is important for performance and is also needed if you have to delete a ZFS file system. The amount will vary depending on the total storage provisioned. I would rough estimate about 8GB of memory for each 1TB of used storage. As well we have two network interfaces of which one will provision the storage traffic only. Keep in mind that dedup is still developing and should be heavily tested, you should expect some issues.

. VM Settings

If you have read my previous blog entry Running ZFS Over NFS as a VMware Store you will find the next section to be very similar. This is essentially many of the same steps but excludes aggregation and IPMP capability.

Using a basic OpenSolaris Indiana completed install we can proceed to configure a shared NFS store so let’s begin with the IP interface. We don’t need a complex network configuration for this storage VM and therefore we will just setup simple static IP interfaces, one to manage the OpenSolaris storage VM and one to provision the NFS store. Remember that you should normally separate storage networks from other network types from both a management and security perspective.

OpenSolaris will default to a dynamic network service configuration named nwam, this needs to be disabled and the physical:default service enabled.

root@uss1:~# svcadm disable svc:/network/physical:nwam
root@uss1:~# svcadm enable svc:/network/physical:default

To persistently configure the interfaces we can store the IP address in the local hosts file. The file will be referenced by the physical:default service to define the network IP address of the interfaces when the service starts up.

Edit /etc/hosts to have the following host entries.

::1 localhost
127.0.0.1 uss1.local localhost loghost
10.0.0.1 uss1 uss1.domain.name
10.1.0.1 uss1.esan.data1

As an option if you don’t normally use vi you can install nano.

root@uss1:~# pkg install SUNWgnu-nano

When an OpenSolaris host starts up the physical:default service will reference the /etc directory and match any plumbed network device to a file which contains the interface name a prefix of “hostname” and an extension using the interface name. For example in this VM we have defined two Intel e1000 interfaces which will be plumbed using the following commands.

root@uss1:~# ifconfig e1000g0 plumb
root@uss1:~# ifconfig e1000g1 plumb

Once plumbed these network devices will be enumerated by the physical:default service and if a file exists in the /etc directory named hostname.e1000g0 the service will use the content of this file to configure this interface in the format that ifconfig uses. Here we have created the file using echo, the “uss1.esan.data1” name will be looked up in the hosts file and maps to IP 10.1.0.1, the network mask and broadcast will be assigned as specified.

root@uss1:~# echo uss1.esan.data1 netmask 255.255.0.0 broadcast 10.1.255.255 > /etc/hostname.e1000g0

One important note: if your /etc/hostname.e1000g0 file has blank lines you may find that persistence fails on any interface after the blank line, thus no blank in the file sanity check would be advised.

One important requirement is the default gateway or route. Here we will assign a default route to network 10.0.0.1 which is the management network. also we need to add a route for network 10.1.0.0. using the following commands. Normally the routing function will dynamically assign the route for 10.1.0.0 so assigning a static one will ensure that no undesired discovered gateways are found and used which may cause poor performance.

root@uss1:~# route -p add default 10.0.0.254
root@uss1:~# route -p add 10.1.0.0 10.1.0.1

When using NFS I prefer provisioning name resolution as a additional layer of access control. If we use names to define NFS shares and clients we can externally validate the incoming IP with a static file or DNS based name lookup. An OpenSolaris NFS implementation inherently grants this methodology. When a client IP requests access to an NFS share we can define a forward lookup to ensure the IP maps to a name which is granted access to the targeted share. We can simply define the desired FQDNs against the NFS shares.

In small configurations static files are acceptable as is in the case here. For large host farms the use of a DNS service instance would ease the admin cycle. You would just have to be careful that your cached TimeToLive (TTL) value is greater that 2 hours thus preventing excessive name resolution traffic. The TTL value will control how long the name is cached and this prevents constant external DNS lookups.

To configure name resolution for both file and DNS we simply copy the predefined config file named nsswitch.dns to the active config file nsswitch.conf as follows:

root@uss1:~# cp /etc/nsswitch.dns /etc/nsswitch.conf

Enabling DNS will require the configuration of our /etc/resolv.conf file which defines our name servers and namespace.

e.g.

root@ss1:~# cat /etc/resolv.conf
domain laspina.ca
nameserver 10.1.0.200
nameserver 10.1.0.201

You can also use the static /etc/hosts file to define any resolvable name to IP mapping, which is my preferred method but since were are using ESXi I will use DNS to ease the administration cycle and avoid the unsupported console hack of ESXi.

It is now necessary to define a zpool using our VT-d enabled PCI DDRdrive X1 and VMDK. The VMDK can be located on any suitable VT-d compatible adapter. There is a good change that some HBA devices will not work with VT-d correctly with your system BIOS. As a tip I suggest you use a USB disk to provision the ESXi installation as it almost always works and is easy to backup and transfer to other hardware. In this POC I used a 500GB SATA disk attached over an ICH10 AHCI interface. Obviously there are other better performing disk subsystems available, however this is a POC and not for production consumption.

To establish the zpool we need to ID the PCI to CxTxDx device mappings, there are two ways that I am aware to find these names. You can ream the output of the prtconf -v command and look for disk instances and dev_links or do it the easy way and use the format command like the following.

root@uss1:~# format
Searching for disks…done

AVAILABLE DISK SELECTIONS:
0. c8t0d0 <DEFAULT cyl 4093 alt 2 hd 128 sec 32>
/pci@0,0/pci15ad,1976@10/sd@0,0
1. c8t1d0 <VMware-Virtual disk-1.0-256.00GB>
/pci@0,0/pci15ad,1976@10/sd@1,0
2. c11t0d0 <DDRDRIVE-X1-0030-3.87GB>
/pci@0,0/pci15ad,7a0@15/pci19e3,8@0/sd@0,0
Specify disk (enter its number): ^C
root@uss1:~#

With the device link info handy we can define the zpool with the DDRdrive X1 as a ZIL using the following command:

root@uss1:~# zpool create sp1 c8t1d0 log c11t0d0

root@uss1:~# zpool status
pool: rpool
state: ONLINE
scrub: none requested

config:
NAME        STATE     READ WRITE CKSUM
rpool       ONLINE       0     0     0
c8t0d0s0    ONLINE       0     0     0

errors: No known data errors

pool: sp1
state: ONLINE
scrub: none requested

config:
NAME        STATE     READ WRITE CKSUM
sp1         ONLINE       0     0     0
c8t1d0      ONLINE       0     0     0
logs
c11t0d0     ONLINE       0     0     0
errors: No known data errors

With a functional IP interface and ZFS pool complete you can define the NFS share and ZFS file system. Always define NFS properties using ZFS set sharenfs=, the share parameters will store as part of the ZFS file system which is ideal for a system failure recovery or ZFS relocation.

zfs create -p sp1/nas/vol0
zfs set mountpoint=/export/uss1-nas-vol0 sp1/nas/vol0
zfs set sharenfs=rw,nosuid,root=vh3-nas:vh2-nas:vh1-nas:vh0-nas sp1/nas/vol0

To connect a VMware ESXi host to this NFS store(s) we need to define a vmkernel network interface which I like to name eSAN-Interface1. This interface should only connect to the storage network vSwitch. The management network and VM network should be on another separate vSwitch.

vmkernel eSAN-Interface1

Since we are encapsulating the storage VM on the same server we also need to connect the VM to the storage interface over a VM network port group as show above. At this point we have all the base NFS services ready, we can now connect our ESXi host to the newly defined NAS storage target.

Add NFS Store

Thus we now have an Encapsulated NFS storage VM provisioning an NFS share to it’s parent hypervisor.

Encapsulated NFS Share

You may have noticed that the capacity of this share is ~390GB however we only granted a 256GB vmdk to this storage VM. The capacity anomaly is the result of ZFS deduplication on the shared file system. There are 10 16GB Windows XP hosts and 2 32GB Linux host located on this file system which would normally require 224GB of storage. Obviously dedup is a serious benefit in this case however you need to be aware of the costs, in order to sustain performance levels similar to non-deduped storage you MUST grant the ZFS code sufficient memory to hold the block hash table in memory. If this is memory not provisioned in sufficient amounts, your storage VM will be relegated to a what appears to be a permanent storage bottle neck, in other words you will enter a “processing time vortex”. (Thus as I have cautioned in the past ZFS dedup is maturing and needs some code changes before trusting it to mission critical loads, always test, test, test and repeat until you’re head spins)

Here’ s the result of using dedup within the encapsulated storage VM.

root@uss1:~# zpool list
NAME    SIZE ALLOC   FREE    CAP DEDUP HEALTH ALTROOT
rpool 7.94G 3.64G 4.30G    45% 1.00x ONLINE –
sp1     254G 24.9G   229G     9% 6.97x ONLINE –

And here’s a look at what’s it’s serving.

Encapsulated VM

Incredibly the IO performance is simply jaw dropping fast, here we are observing a grueling 100% random read load at 512 bytes per request. Yes that’s correct we are reaching 40,420 IOs per second.

Sample IOMeter IOPS

Even more incredible is the IO performance with a 100% random write load at 512 bytes per request. it’s simply unbelievable seeing 38491 IOs per second inside a VM which is served from a peer VM all on the same hypervisor.

Sample IOMeter IOPS 100% Random 512 Byte Writes

With a successfully configured and operational NFS share provisioned the next logical task is to define and automate the replication of this share and any others shares we may we to add to a neighboring encapsulated storage VM or for that matter any OpenSolaris host.

The basic elements to this functionality as follows:

Define a dedicated secured user to execute the replication functions.
Grant the appropriate permissions to this user to access a cron and ZFS.
Assign an RSA Key pair for automated ssh authentication.
Define a snapshot replication script using ZFS send/receive calls.
Define a cron job to regularly invoke the script.

Let define the dedicated replication user. In this example I will use the name zfsadm.

First we need to create the zfsadm user on all of our storage VMs.

root@uss1:~# useradd -s /bin/bash -d /export/home/zfsadm -P ‘ZFS File System Management’ zfsadm
root@uss1:~# mkdir /export/home/zfsadm
root@uss1:~# cp /etc/skel/* /export/home/zfsadm
root@uss1:~# echo PATH=/bin:/sbin:/usr/ucb:/etc:. > /export/home/zfsadm/.profile
root@uss1:~# echo export PATH >> /export/home/zfsadm/.profile
root@uss1:~# echo PS1=$’${LOGNAME}@$(/usr/bin/hostname)’~#’ ‘ >> /export/home/zfsadm/.profile

root@uss1:~# chown –R zfsadm /export/home/zfsadm
root@uss1:~# passwd zfsadm

In order to use an RSA key for authentication we must first generate an RSA private/public key pair on the storage head. This is performed using ssh-keygen while logged in as the zfsadm user. You must set the passphrase as blank otherwise the session will prompt for it.

root@uss1:~# su – zfsadm

zfsadm@uss1~#ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/export/home/zfsadm/.ssh/id_rsa):
Created directory ‘/export/home/zfsadm/.ssh’.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /export/home/zfsadm/.ssh/id_rsa.
Your public key has been saved in /export/home/zfsadm/.ssh/id_rsa.pub.
The key fingerprint is:
0c:82:88:fa:46:c7:a2:6c:e2:28:5e:13:0f:a2:38:7f zfsadm@uss1
zfsadm@uss1~#

The id_rsa file should not be exposed outside of this directory as it contains the private key of the pair, only the public key file id_rsa.pub needs to be exported. Now that our key pair is generated we need to append the public portion of the key pair to a file named authorized_keys2.

# cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys2

Repeat all the crypto key steps on the target VM as well.

We will use the Secure Copy command to place the public key file on the target hosts zfsadm users home directory. It’s very important that the private key is secured properly and it is not necessary to back it up as you can regenerate them if required.

From the local server here named uss1 (The remote server is uss2)

zfsadm@uss1~# scp $HOME/.ssh/id_rsa.pub uss2:$HOME/.ssh/uss1.pub
Password:
id_rsa.pub 100% |**********************************************| 603 00:00
zfsadm@uss1~# scp uss2:$HOME/.ssh/id_rsa.pub $HOME/.ssh/uss2.pub
Password:
id_rsa.pub 100% |**********************************************| 603 00:00
zfsadm@uss1~# cat $HOME/.ssh/uss2.pub >> $HOME/.ssh/authorized_keys2

And on the remote server uss2

# ssh uss2
password:
zfsadm@uss2~# cat $HOME/.ssh/uss1.pub >> $HOME/.ssh/authorized_keys2
# exit

Now that we are able to authenticate without a password prompt we need to define the automated replication launch using cron. Rather that using the /etc/cron.allow file to grant permissions to the zfsadm user we are going to use a finer instrument and grant the user access at the user properties level shown here. Keep in mind you can not use both ways simultaneously.

root@uss1~# usermod -A solaris.jobs.user zfsadm
root@uss1~# crontab –e zfsadm
59 23 * * * ./zfs-daily-rpl.sh zfs-daily.rpl

Hint: crontab uses vi – http://www.kcomputing.com/kcvi.pdf “vi cheat sheet”

The key sequence would be hit “i” and key in the line then hit “esc :wq” and to abort “esc :q!”

Be aware of the timezone the cron service runs under, you should check it and adjust it if required. Here is a example of whats required to set it.

root@uss1~# pargs -e `pgrep -f /usr/sbin/cron`

8550: /usr/sbin/cron
envp[0]: LOGNAME=root
envp[1]: _=/usr/sbin/cron
envp[2]: LANG=en_US.UTF-8
envp[3]: PATH=/usr/sbin:/usr/bin
envp[4]: PWD=/root
envp[5]: SMF_FMRI=svc:/system/cron:default
envp[6]: SMF_METHOD=start
envp[7]: SMF_RESTARTER=svc:/system/svc/restarter:default
envp[8]: SMF_ZONENAME=global
envp[9]: TZ=PST8PDT

Let’s change it to CST6CDT

root@uss1~# svccfg -s system/cron:default setenv TZ CST6DST

Also the default environment path for cron may cause some script “command not found” issues, check for a path and adjust it if required.

root@uss1~# cat /etc/default/cron
#
# Copyright 1991 Sun Microsystems, Inc. All rights reserved.
# Use is subject to license terms.
#
#pragma ident “%Z%%M% %I% %E% SMI”
CRONLOG=YES

This one has no default path, add the path using echo.

root@uss1~# echo PATH=/usr/bin:/usr/sbin:/usr/ucb:/etc:. > /etc/default/cron
# svcadm refresh cron
# svcadm restart cron

The final part of the replication process is a script that will handle the ZFS send/recv invocations. I have written a script in the past that can serve this task with some very minor changes.

Here is the link for the modified zfs-daily-rpl.sh replication script you will need to grant exec rights to this file e.g.

# chmod 755 zfs-daily-rpl.sh

This script will require that a zpool named sp2 exists on the target system, this is shamefully hard coded in the script.

A file containing the file system to replicate and the target are required as well.

e.g.

zfs-daily-rpl.sh filesystems.lst

Where filesystems.lst contains:

sp1/nas/vol0 uss2
sp1/nas/vol1 uss2

With any ZFS replicated file system that you wish to invoke on a remote host it is important to remember not make changes to the active replication stream. You must take a clone of this replication stream and this will avoid forcing a complete resend or other replication issues when you wish to test or validate that it’s operating as you expect.

For example:

We take a clone of one of the snapshots and then share it via NFS:

root@uss2~# zfs clone sp2/nas/vol0@31-04-10-23:59 sp2/clones/uss1/nas/vol0
root@uss2~# zfs set mountpoint=/export/uss1-nas-vol0 sp2/clones/uss1/nas/vol0
root@uss2~# zfs set sharenfs=rw,nosuid,root=vh3-nas:vh2-nas:vh1-nas:vh0-nas sp2/clones/uss1/nas/vol0

Well I hope you found this entry interesting.

Regards,

Mike

May 6th, 2010 | Tags: Acceleration, Dedup, Encapsulation, IOMMU, NFS, Storage, VMware, VT-d, zfs.
Categories: Storage, VMware | Comments: 42 Comments |

Running ZFS over NFS as a VMware Store

NFS is definitely a very well rounded high performance file storage system and it certainly serves VMware Stores successfully over many storage products. Recently one of my subscribers asked me if there was a reason why my blogs were more centric to iSCSI. Thus the question was probing for a answer to a question many of us ask ourselves. Is NFS superior to block based iSCSI and which one should I choose for VMware. The answer to this question is not which protocol is superior but which protocol serves to provision the features and function you require most effectively. I use both protocols and find they both have desirable capability and functionality and conversely have some negative points as well.

NFS typically is generally more accessible because its a file level protocol and sits higher up on the network stack. This makes it very appealing when working with VMware virtual disks aka vmdk’s simply because they also exist at the same layer. NFS is ubiquitous across NAS vendors and can be provisioned by multiple agnostic implementation endpoints. An NFS protocol hosts the capability to be virtualized and encapsulated within any Hypevisor instance either clustered or standalone. The network file locking and share semantics of NFS grant it a multitude of configurable elements which can serve a wide range of applications.

In this blog entry we will explore how to implement an NFS share for VMware ESX using OpenSolaris and ZFS. We will also explore a new way of accelerating the servers I/O performance with a new product called the DDRdrive X1.

OpenSolaris is an excellent choice for provisioning NFS storage volumes on VMware. It hosts many advanced desirable storage features that set it far ahead of other Unix flavors. We can use the advanced networking features and ZFS including the newly integrated dedup functionality to craft the best NFS functionality available today.

Let start by examining the overall NAS storage architecture.

NFS OpenSolaris/VMware Architecture by Mike La Spina

In this architecture we are defining a fault tolerant configuration using two physical 1Gbe switches with a quad or dual Ethernet adapter(s). On the OpenSolaris storage head we are using IPMP aka IP Multipathing to establish a single IP address to serve our NFS store endpoint. A single IP is more appropriate for VMware environments as they do not support multiple NFS IP targets per NFS mount point. IPMP provisions layer 3 load balancing and interface fault tolerance. IPMP commonly uses ICMP and default routes to determine interface failure states thus it well suited for a NAS protocol service layer. In a effort to reduce excessive ICMP rates we will aggregate the two dual interfaces into a single channel connection to each switch. This will allow us to define two test IP addresses for the IPMP service and keep our logical interface count down to a minimum. We are also defining a 2 port trunk/aggregate between the two physical switches which provides more path availability and reduces switch failure detection times.

On the ESX host side we are defining 1 interface per switch. This type of configuration requires that only one of the VMware interfaces is an active team member vmnic within a single vSwitch definition. If this is not configured this way the ESX host will fail to detect and activate the second nic under some failure modes. This is not a bandwidth constraint issue since the vmkernel IP interface will only activity use one nic.

With an architecture set in place let now explore some of the pros and cons of running VMware on Opensolaris NFS.

Some of the obvious pros are:

VMware uses NFS in a thin provisioned format.
VMDKs are stored as files and are mountable over a variety of hosts.
Simple backup and recovery.
Simple cloning and migration.
Scalable storage volumes.

And some of the less obvious pros:

IP based transports can be virtualized and encapsulated for disaster recovery.
No vendor lock-in
ZFS retains NFS share properties within the ZFS filesystem.
ZFS will dedup VMDKs files at the block level.

And there are the cons:

Every write I/O from VMware is an O_SYNC write.
Firewall setups are complex.
Limited in its application. Only NFS clients can consume NFS file systems.
General protocol security challenges. (RPC)
VMware kernel constraints
High CPU overhead.
Bursty data flow.

Before we break out into the configuration detail level lets examine some of the VMware and NFS behaviors so as to gain some insight into the reason I primarily use iSCSI for most VMware implementations.

I would like demonstrate some characteristics that are primarily a VMware client side behavior and it’s important that you are aware of them when your considering NFS as a Datastore.

This VMware performance chart of an IOMeter generated load reveals the burst nature of the NFS protocol. The VMware NFS client exclusively uses a O_SYNC flag on write operations which requires a committed response for the NFS server. At some point the storage system will not be able to complete every request and thus a pause in transmission will occur. The same occurs on reads when the network component buffers reach saturation. In this example chart we are observing a single 1Gbe interface at saturation from a read stream.

NFS VMware Network I/O Behavior by Mike La Spina

In this output we are observing a read stream across vh0 which is one of two active ESX4 host VMs loading our OpenSolaris NFS store and we can see the maximum network throughput is achieved which is ~81MB/s. If you examine the average value of 78MB/s you can see the burst events do not have significant impact and is not a bandwidth concern with ~3MB/s of loss.

NFS VMware Network Read I/O Limit Behavior by Mike La Spina

At the same time we are recording this write stream chart on vh3 a second ESX 4 host loading the same NFS OpenSolaris store. As I would expect, its very similar to the read stream except that we can see the write performance is lower and that’s to be expected with any write operations. We can also identify that we are using a full duplex path transmission across to our OpenSolaris NFS host since vh0 is reading (recieving) and vh3 is writing(transmitting).

NFS VMware Network Write I/O Limit Behavior by Mike La Spina

In this chart we are observing a limiting characteristic of the VMware vmkernel NFS client process. We have introduced a read stream in combination with a preexisting active write stream on a single ESX host. As you can see the transmit and receive packet rates are both reduced and now sum to a maximum of ~75MB/s.

NFS VMware Network Mixed Read Write I/O Limit Behavior by Mike La Spina

Transitioning from read to write active streams confirms the transmission is limited to ~75Mb/s regardless the full duplex interface capability. This information demonstrates that a host using 1Gbe ethernet connections will be constrained based on its available resources. This is a important element to consider when using NFS as a VMware datastore.

NFS VMware Network Mixed Read Write I/O Flip Limit Behavior by Mike La Spina

Another important element to consider is the CPU load impact of running the vmkernel NFS client. There is a significant CPU cycle cost on VMware hosts and this is very apparent under heavier loads. The following screen shot depicts a running IOmeter load test against our OpenSolaris NFS store. The important elements are as follows. IOMeter is performing 32KB reads in a 100% sequential access mode which drives a CPU load on the VM of ~35% however this is not the only CPU activity that occurs for this VM.

NFS IOMeter ZFS Throughput 32KB-Seq

When we examine the ESX host resource summary for the running VM we can now observe the resulting overhead load which is realized by viewing the Consumed Host CPU value. The VM in this case is granted 2 CPUs each are a 3.2Ghz Intel hypervisor resource. We can see that the ESX host is running at 6.6Ghz to drive the vmkernel NFS I/O load.

NFS VMware ESX 4 CPU Load

Lets see the performance chart results when we svMotion the activily loaded running VM on the same ESX host to an iSCSI VMFS based store on the same OpenSolaris storage host. The only elements changing in this test are the underlying storage protocols. Here we can clearly see CPU object 0 is the ESX host CPU load. During the svMotion activity we begin to see some I/O drop off due to the addition background disk load. Finally we observe the VM transition at the idle point and the resultant CPU load of iSCSI I/O impact. We clearly see the ESX host CPU load drop from 6.6Ghz to 3.5Ghz which makes it very apparent the NFS requires substantially higher CPU that iSCSI.

VM Trasitioned with vMotion from NFS to iSCSI on same ZFS Storage host

With the svMotion completed we now observe the same IOMeter screen shot retake and its very obvious that our throughput and IOPS have increased significantly and the VM granted CPU load has not changed significantly. A decrease of ESX host CPU load in the order of ~55% and and increase of ~32% in IOPS and 45% of throughput shows us there are some negative behaviors to be cognizant of. Keep in mind that this is not that case when the I/O type is small and random like that of a Database in those cases NFS is normally the winner, however VMware normally hosts mixed loads and thus we need to consider this negative effect at design time and when targeting VM I/O characteristics.

iSCSI IOMeter ZFS X1DDR Cache Throughput 32KB-Seq Mike La Spina

iSCSI ESX 4 CPU Load by Mike La Spina

With a clear understanding of some important negative aspects to implementing NFS for VMware ESX hosts we can proceed to the storage system build detail. The first order of business is the hardware configuration detail. This build is simply one of my generic white boxes and it hosts the following hardware:

GA-EP45-DS3L Mobo with an Intel 3.2Ghz E8500 Core Duo

1 x 70GB OS Disk

2 x 500GB SATA II ST3500320AS disks

2GB of Ram

1 x Intel Pro 1000 PT Quad Network Adapter

As a very special treat on this configuration I am also privileged to run an DDRDrive X1 Cache Accelerator which I am currently testing some newly developed beta drivers for OpenSoalris. Normally I would use 4GB of ram as a minimum but I needed to constraint this build in a effort to load down the dedicated X1 LOG drive and the physical SATA disks thus this instance is running only 2GB of ram. In this blog entry I will not be detailing the OpenSolaris install process, we will begin from a Live CD installed OS.

OpenSolaris will default to a dynamic network service configuration named nwam, this needs to be disabled and the physical:default service enabled.

root@uss1:~# svcadm disable svc:/network/physical:nwam
root@uss1:~# svcadm enable svc:/network/physical:default

To establish an aggregation we need to un-configure any interfaces that we previously configured before proceeding.

root@uss1:~# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 10.1.0.1 netmask ffff0000 broadcast 10.255.255.255
ether 0:50:56:bf:11:c3
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
inet6 ::1/128

root@uss1:~# ifconfig e1000g0 unplumb

Once cleared the assignment of the physical devices is possible using the following commands

dladm create-aggr –d e1000g0 –d e1000g1 –P L2,L3 1
dladm create-aggr –d e1000g2 –d e1000g3 –P L2,L3 2

Here we have set the policy allowing layer 2 and 3 and defined two aggregates aggr1 and aggr2. We can now define the VLAN based interface shown here as VLAN 500 instances 1 are 2 respective of the aggr instances. You just need to apply the following formula for defining the VLAN interface.

(Adaptor Name) + vlan * 1000 + (Adaptor Instance)

ifconfig aggr500001 plumb up 10.1.0.1 netmask 255.0.0.0
ifconfig aggr500002 plumb up 10.1.0.2 netmask 255.0.0.0

Each pair of interfaces needs to be attached to a trunk definition on its switch path. Typically this will be a Cisco or HP switch in most environments. Here is a sample of how to configure each brand.

Cisco:

configure terminal
interface port-channel 1
interface ethernet 1/1
channel-group 1
interface ethernet 1/2
channel-group 1
interface ethernet po1
switchport mode trunk allowed vlan 500
exit

HP Procurve:

trunk 1-2 trk1 trunk
vlan 500
name “eSAN1”
tagged trk1

Once we have our two physical aggregates setup we can define the IP multipathing interface components. As a best practice we should define the IP addresses in our hosts file and then refer to those names in the remaining configuration tasks.

Edit /etc/hosts to have the following host entries.

::1 localhost
127.0.0.1 uss1.local localhost loghost
10.0.0.1 uss1 uss1.domain.name
10.1.0.1 uss1.esan.data1
10.1.0.2 uss1.esan.ipmpt1
10.1.0.3 uss1.esan.ipmpt2

Here we have named the IPMP data interface aka a public IP as uss1.esan-data1 this ip will be the active connection for our VMware storage consumers. The other two named uss1.esan-ipmpt1 and uss1.esan-ipmpt2 are beacon probe IP test addresses and will not be available to external connections.

IPMP functionallity is included with OpenSolaris and is configured with the ifconfig utility. The follow sets up the first aggregate with a real public IP and a test address. The deprecated keyword defines the IP as a test address and the failover keyword defines if the IP can be moved in the event of interface failure.

ifconfig aggr500001 plumb uss1.esan.ipmpt1 netmask + broadcast + group ipmpg1 deprecated -failover up addif uss1.esan.data1 netmask + broadcast + failover up
ifconfig aggr500002 plumb uss1.esan.ipmpt2 netmask + broadcast + group ipmpg1 deprecated -failover up

To persist the IPMP network configuration on boot you will need to create hostname files matching the interface names with the IPMP configuration statement store in them. The following will address it.

echo uss1.esan.ipmpt1 netmask + broadcast + group ipmpg1 deprecated -failover up addif uss1.esan.data1 netmask + broadcast + failover up > /etc/hostname.aggr500001

echo uss1.esan.ipmpt1 netmask + broadcast + group ipmpg1 deprecated -failover up > /etc/hostname.aggr500002

The resulting interfaces will look like the following:

root@uss1:~# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
aggr1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
inet 10.1.0.2 netmask ff000000 broadcast 10.255.255.255
groupname ipmpg1
ether 0:50:56:bf:11:c3
aggr2: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3
inet 10.1.0.3 netmask ff000000 broadcast 10.255.255.255
groupname ipmpg1
ether 0:50:56:bf:6e:2f
ipmp0: flags=8001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,IPMP> mtu 1500 index 5
inet 10.1.0.1 netmask ff000000 broadcast 10.255.255.255
groupname ipmpg1
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
inet6 ::1/128

In order for IPMP to detect failures in this configuration you will need to define target probe addresses for IPMP use. For example I use multiple ESX hosts as probe target on the storage network.

e.g.

root@uss1:~# route add -host 10.1.2.1 10.1.2.1 -static
root@uss1:~# route add -host 10.1.2.2 10.1.2.2 -static

This network configuration yields 2,2Gbe aggregate paths bound to a single logical active IP address on 10.1.0.1, with interfaces aggr1 and aggr2 the keyword deprecated directs the IPMP mpathd service daemon to prevent application session connection packets establishment and the nofailover keyword instructs mpathd not to allow the bound IP to failover to any other interface in the IPMP group.

There are many other possible configurations but I prefer this method because it remains logically easy to diagnose and does not introduce unnecessary complexity.

Now that we have layer 3 network connectivity we should establish the other essential OpenSolaris static TCP/IP configuration elements. We need to ensure we have a persistent default gateway and our DNS client resolution enabled.

The persistent default gateway is very simple to define as is done with the route utility command as follows.

root@uss1:~# route -p add default 10.1.0.254
add persistent net default: gateway

To configure name resolution for both file and DNS we simply copy the predefined config file named nsswitch.dns to the active config file nsswitch.conf as follows:

root@uss1:~# cp /etc/nsswitch.dns /etc/nsswitch.conf

Enabling DNS will require the configuration of our /etc/resolv.conf file which defines our name servers and namespace.

e.g.

root@ss1:~# cat /etc/resolv.conf
domain laspina.ca
nameserver 10.1.0.200
nameserver 10.1.0.201

You can also use the static /etc/hosts file to define any resolvable name to IP mapping.

With OpenSolaris you should always define your NFS share properties using the ZFS administrative tools. When this method is used we can the take advantage of keeping the NFS share properties inside of ZFS. This is really useful when you replicate or clone the ZFS file system to an alternate host as all the share properties will be retained. Here are the basic elements of an NFS share configuration for use on VMware storage consumers.

zfs create -p sp1/nas/vol1
zfs set mountpoint=/export/uss1-nas-vol1 sp1/nas/vol1
zfs set sharenfs=rw,nosuid,root=vh3-nas:vh2-nas:vh1-nas:vh0-nas sp1/nas/vol1

The ACL NFS share property of rw sets the entire share as read write, you could alternately use rw=hostname for each host but it seems redundant to me. The nosuid prevents any incoming connection from switching user ids for example from a non-root value to 0. Finally the root=hostname property grants the incoming host name access to the share with root access permissions. Any files created by the host will be as the root id. While these steps are some level of access control it falls well short of secure thus I also keep the NAS subnets fully isolated or firewalled to prevent external network access to the NFS share hosts.

Once our NFS share is up and running we can proceed to configure the VMware network components and share connection properties. VMware requires a vmkernel network interface definition to provision NFS connectivity. You should dedicate a vmnic team and a vswitch for your storage network.

Here is a visual example of a vmkernel configuration with a teamed pair of vmnics

vmkernel eNAS-Interface by Mike La Spina

As you can see we have dedicated the vSwitch and vmnics on VLAN 500, no other traffic should be permitted on this network. You should also set the default vmkernel gateway to its own address. This will promote better performance as there is no need to leave this network.

For eNAS-Interface1 you should define one active and one standby vmnic. This will ensure proper interface fail-over in all failure modes. The VMware NFS kernel instance will only use a single vmnic so your not loosing any bandwidth. The vmnic team only serves as a fault tolerant connection and is not a load balanced configuration.

VMkernel Team Stanby by Mike La Spina

At this point you should validate your network connectivity by pinging the vmkernel IP address from the OpenSolaris host. If you chose to ping from ESX use vmkping instead of ping otherwise you will not get a response.

Provided your network connectivity is good you can define your vmkernel NFS share properties. Here is a visual example.

And if you prefer an ESX command line method:

esxcfg-nas -a -o uss1-nas -s /export/uss1-nas-vol1 uss1-nas-vol1

In this example we are using a DNS based name of uss1-nas. This would allow you to change the host IP without having to reconfigure VMware hosts. You will want to make sure the DNS name cache TTL in not a small value for two reasons. One an DNS outage would impact the IP resolution and as well you do not want excessive resolution traffic on the eSAN subnet(s).

The NFS share configuration info is maintained in the /etc/vmware/esx.conf file and looks like the following example.

/nas/uss1-nas-vol1/enabled = “true”
/nas/uss1-nas-vol1/host = “uss1-nas”
/nas/uss1-nas-vol1/readOnly = “false”
/nas/uss1-nas-vol1/share = “/export/uss1-nas-vol1”

If your trying to change NFS share parameters and the NFS share is not available after a successful configuration you could run into a messed up vmkernel NFS state and you’ll receive the following message:

Unable to get Console path for Mount

You will need to reboot the ESX server to clean it up so don’t mess with anything else until that is performed. (I’ve wasted a few hours on that buggy VMware kernel NFS client behavior).

Once the preceeding steps are successful the result will be a NAS based NFS share which is now available like this example.

Running NFS shares by Mike La Spina

With a working NFS storage system we can now look at optimizing the I/O capability of ZFS and NFS.

VMware performs write operations over NFS using an O_SYNC control flag. This will force the storage system to commit all write operations to disk to ensure VM file integrity. This can be very expensive when it comes to high performance IOPS especially when using SATA architecture. We could disable our ZIL aka ZFS Intent Log but this could result in severe corruption in the event of a systems fault or environmental issue. A much better alternative is to use a non-volatile ZIL device. In this case we have an DDRdrive X1 which is a 4GB high speed externally powered dram bank with a high speed SCSI interface and also hosts 4GB of flash for long term shutdowns. The DDRdrive X1 IO capability reaches the 200,000/sec range and up. By using an external UPS power source we can economically prevent ZFS corruption and reap the high speed benefits of dram even when unexpected system interruptions occur.

In this blog our storage host is using Seagate ST3500320AS disk which are challenged to achieve ~180 IOPS. And that IO rate is under ideal sequential read write loads. With a cache we can expect that these disks will deliver no greater than 360 IOPS under ideal conditions.

Now lets see if this is true based on some load tests using Microsoft’s SQLIO tool. First we will disable our ZFS ZIL caching DDRdrive X1 show here as device c9t0d0

NAME        STATE     READ WRITE CKSUM
sp1         DEGRADED     0     0     0
mirror-0 ONLINE       0     0     0
c6t1d0 ONLINE       0     0     0
c6t2d0 ONLINE       0     0     0
logs
c9t0d0 OFFLINE    0     0     0

No lets run the SQLIO test for 5 minutes with random 8K I/O write requests which are simply brutal for any SATA disk to keep up with. We have defined a file size of 32GB to ensure we hit the disk by exceeding our 2GB cache memory foot print. As you can see from the output we achieve 227 IOs/sec which is below the mirrored drive pair capability.

C:Program FilesSQLIO>sqlio -kW -s300 -frandom -o4 -b8 -LS -Fparam.txt
sqlio v1.5.SG
using system counter for latency timings, 3579545 counts per second
parameter file used: param.txt
file c:testfile.dat with 2 threads (0-1) using mask 0x0 (0)
2 threads writing for 300 secs to file c:testfile.dat
using 8KB random IOs
enabling multiple I/Os per thread with 4 outstanding
using specified size: 32768MB for file: c:testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 227.76
MBs/sec: 1.77
latency metrics:
Min_Latency(ms): 8
Avg_Latency(ms): 34
Max_Latency(ms): 1753
histogram:
ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 29 7 3 2 1 1 1 54

new name   name attr attr lookup rddir read read write write
file remov chng   get   set    ops   ops   ops bytes   ops bytes
0     0     0   300     0      0     0     3   16K   146 1.12M /export/uss1-nas-vol1
0     0     0   617     0      0     0     0     0   309 2.39M /export/uss1-nas-vol1
0     0     0   660     0      0     0     0     0   329 2.52M /export/uss1-nas-vol1
0     0     0   677     0      0     0     0     0   338 2.63M /export/uss1-nas-vol1
0     0     0   638     0      0     0     0     0   321 2.46M /export/uss1-nas-vol1
0     0     0   496     0      0     0     0     0   246 1.88M /export/uss1-nas-vol1
0     0     0    44     0      0     0     0     0    21 168K /export/uss1-nas-vol1
0     0     0   344     0      0     0     0     0   172 1.32M /export/uss1-nas-vol1
0     0     0   646     0      0     0     0     0   323 2.51M /export/uss1-nas-vol1
0     0     0   570     0      0     0     0     0   285 2.20M /export/uss1-nas-vol1
0     0     0   695     0      0     0     0     0   350 2.72M /export/uss1-nas-vol1
0     0     0   624     0      0     0     0     0   309 2.38M /export/uss1-nas-vol1
0     0     0   562     0      0     0     0     0   282 2.15M /export/uss1-nas-vol1

Now lets enable the DDRdrive X1 ZIL cache and see where that takes us.

NAME        STATE     READ WRITE CKSUM
sp1         ONLINE       0     0     0
mirror-0 ONLINE       0     0     0
c6t1d0 ONLINE       0     0     0
c6t2d0 ONLINE       0     0     0
logs
c9t0d0 ONLINE       0     0     0

Again we run the identical SQLIO test and results are dramatically different, we immediately see a 4X improvement in IOPS but whats much more important is the reduction in latency which will make any database workload fly.

C:Program FilesSQLIO>sqlio -kW -s300 -frandom -o4 -b8 -LS -Fparam.txt
sqlio v1.5.SG
using system counter for latency timings, 3579545 counts per second
parameter file used: param.txt
file c:testfile.dat with 2 threads (0-1) using mask 0x0 (0)
2 threads writing for 300 secs to file c:testfile.dat
using 8KB random IOs
enabling multiple I/Os per thread with 4 outstanding
using specified size: 32768 MB for file: c:testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 865.75
MBs/sec: 6.76
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 8
Max_Latency(ms): 535
histogram:
ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 56 13 9 3 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 7

new name   name attr attr lookup rddir read read write write
file remov chng   get   set    ops   ops   ops bytes   ops bytes
0     0     0   131     0      0     0     0     0    66 516K /export/uss1-nas-vol1
0     0     0 3.23K     0      0     0     0     0 1.62K 12.8M /export/uss1-nas-vol1
0     0     0    95     0      0     0     2    8K    43 324K /export/uss1-nas-vol1
0     0     0 2.62K     0      0     0     0     0 1.31K 10.3M /export/uss1-nas-vol1
0     0     0   741     0      0     0     0     0   369 2.78M /export/uss1-nas-vol1
0     0     0 1.99K     0      0     0     0     0 1019 7.90M /export/uss1-nas-vol1
0     0     0 1.34K     0      0     0     0     0   687 5.32M /export/uss1-nas-vol1
0     0     0   937     0      0     0     0     0   468 3.62M /export/uss1-nas-vol1
0     0     0 2.60K     0      0     0     0     0 1.30K 10.3M /export/uss1-nas-vol1
0     0     0 2.02K     0      0     0     0     0 1.01K 7.84M /export/uss1-nas-vol1
0     0     0 1.91K     0      0     0     0     0   978 7.58M /export/uss1-nas-vol1
0     0     0 1.94K     0      0     0     0     0   992 7.67M /export/uss1-nas-vol1

DDRdrive X1 Performance Chart by Mike La Spina

NFSStat Chart I/O DB Cache Compare by Mike La Spina

When we look at ZFS ZIL caching devices there are some important elements to consider. For most provisioned VMware storage systems you do not require large volumes of ZIL cache to generate good I/O performance. What you need to do is carefully determine the active data write footprint size. Remember that ZIL is a write only world and that those writes will be relocated to a slower disk at some point. These relocation functions are processed in batches or as Ben Rockwood likes to say in a regular breathing cycle. This means that random I/O operations can queued up and converted to a more sequential like behavior characteristic. Random synchronous write operations can be safely acknowledged immediately and then the ZFS DMU can process them more efficiently in the background. This means that if we provision cache devices that are closer to the system bus and have lower latency the back end core compute hardware will be able to move the data ahead of the bursting I/O peak up ramps and thus we deliver higher IOPS with significantly less cache requirements. Devices like the DDRdrive X1 are a good example of implementing this strategy.

I hope you found this blog entry to be interesting and useful.

Regards,

Mike

February 14th, 2010 | Tags: aggregate, cache, Dedup, NFS, performance, VMware, zfs, ZIL.
Categories: General, Storage, VMware | Comments: 30 Comments |

Protecting Active Directory with Snapshot Strategies

Using snapshots to protect Active Directory (AD) without careful planning will most definitely end up in a complete disaster. AD is a loosely consistent distributed multi-master database and it must not be treated as a static system. Without carefully addressing how AD works with Time Stamps, Version Stamps, Update Sequence Numbers (USNs), Globally Unique Identification numbers (GUIDs), Relative Identification numbers (RIDs), Security Identifiers (SIDs) and restoration requirements the system could quickly become unusable or severally damaged in the event of an incorrectly invoked snapshot reversion.

There are many negative scenarios that can occur if we were to re-introduce an AD replica to service from a snapshot instance without special handling. In the event of a snapshot based re-introduction the RID functional component is seriously impacted. In any AD system RIDs are created in range blocks and assigned for use to a participating Domain Controller (DC) by the RID master DC AD role. RIDs are used to create SIDs for all AD objects like Group or User objects and they must all be unique. Lets take a closer look at the SID to understand why RIDs are such a critical function.

A SID is composed with the following symbolic format: S–R–IA–SA–RID:

S: Indicates the type of value is a SID.
R: Indicates the revision of the SID.
IA: Indicates the issuing authority. Most are the NT Authority identity number 5.
SA: Indicates the sub-authority aka domain identifier.
RID: Indicates the Relative ID.

Now looking at some real SID example values we see that on a DC instance only the RID component of the SID is unique as show here in red text.

DS0User1 = S–1–5–21-3725033245-1308764377-180088833–3212
DS0UserGroup1 = S–1–5–21-3725033245-1308764377-180088833–7611

When an older snapshot image of a DC is reintroduced it’s assigned RID range will likely have RID entries that were already used to generate SIDs. Those SIDs would have replicated to the other DCs in the AD forest. When the reintroduced DC starts up it will try to participate in replication and servicing authentications of accounts. Depending on the age and configuration of its secure channel the DC could be successfully connected. This snapshot reintroduction event should be avoided since any RID usage from the aged DC will very likely result in duplicated SID creations and is obviously very undesirable.

Under normal AD recovery methods we would either need to restore AD or build a new server and perform a DC promo on it and possibly seize DC roles if required . The most important element of an normal AD restore process is the DC GUID reinitialization function. The DC GUID value reinitialization operation allows the restoration of an AD DC to occur correctly. A newly generated GUID becomes part of the Domain Identifier and thus the DC can create SIDs that are unique despite the fact that the RID assignment range it holds may be from a previously used one.

When we use a snapshot image of a guest DC VM none of the required Active Directory restore requirements will occur on system startup and thus we must manually bring the host online in DSRM mode without a network connection and then set the NTDS restore mode up. I see this as a serious security risk as there a is significant probability that the host could be brought online without these steps occurring and potentially create integrity issues.

One mitigation to this identified risk is to perform the required changes before a snapshot is captured and once the capture is complete revert the change back to the non-restore state. This action will completely prevent a snapshot image of a DC from coming online from a past time reference.

In order to achieve this level of server state and snapshot automation we would need to provision a service channel from our storage head to the involved VMs or for that matter any storage consumer. A service channel can provide other functionality beyond the NDTS state change as well. One example is the ability to flush I/O using VSS or sync etc.

We can now look at a practical example of how to implement this strategy on OpenSolaris based storage heads and W2K3 or W2K8 servers.

The first part of the process is to create the service channel on a VM or any other windows host which can support VB or Power Shell etc. In this specific case we need to provision an SSH Server daemon that will allow us to issue commands directed towards the storage consuming guest VM from the providing storage head. There are many possible products available that can provide this service. I personally like MobaSSH which I will use in this example. Since this is a Domain Controller we need to use the Pro version which supports domain based user authentication from our service channel VM.

We need to create a dedicated user that is a member of the domains BUILTINAdministrators group. This poses a security risk and thus you should mitigate it by restricting this account to only the machines it needs to service.

e.g. in AD restrict it to the DCs or possibly any involved VM’s to be managed and the Service Channel system itself.

Restricting user machine logins

A dedicated user allows us to define authentication from the storage head to the service channel VM using a trusted ssh RSA key that is mapped to the user instance on both the VM and OpenSolaris storage host. This user will launch any execution process that is issued from the OpenSolaris storage head.

In this example I will use the name scu, which is short for Service Channel User.

First we need to create the scu user on our OpenSolaris storage head.

root@ss1:~# useradd -s /bin/bash -d /export/home/scu -P ‘ZFS File System Management’ scu
root@ss1:~# mkdir /export/home/scu
root@ss1:~# cp /etc/skel/* /export/home/scu
root@ss1:~# echo PATH=/bin:/sbin:/usr/ucb:/etc:. > /export/home/scu/.profile
root@ss1:~# echo export PATH >> /export/home/scu/.profile
root@ss1:~# echo PS1=$’${LOGNAME}@$(/usr/bin/hostname)’~#’ ‘ >> /export/home/scu/.profile

root@ss1:~# chown –R scu /export/home/scu
root@ss1:~# passwd scu

In order to use an RSA key for authentication we must first generate an RSA private/public key pair on the storage head. This is performed using ssh-keygen while logged in as the scu user. You must set the passphrase as blank otherwise the session will prompt for it.

root@ss1:~# su – scu

scu@ss1~#ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/export/home/scu/.ssh/id_rsa):
Created directory ‘/export/home/scu/.ssh’.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /export/home/scu/.ssh/id_rsa.
Your public key has been saved in /export/home/scu/.ssh/id_rsa.pub.
The key fingerprint is:
0c:82:88:fa:46:c7:a2:6c:e2:28:5e:13:0f:a2:38:7f scu@ss1
scu@ss1~#

We now have the public key available in the file named id_rsa.pub the content of this file must be copied to the target ssh instance file named .ssh/authorized_keys. The private key file named id_rsa MUST NOT be exposed to any other location and should be secured. You do not need to store the private key anywhere else as you can regenerate the pair anytime if required.

Before we can continue we must install and configure the target Service Channel VM with MobaSSH.

Its a simple setup, just download MobaSSH Pro to the target local file system.

Execute it.

Click install.

Configure only the scu domain based user and clear all others from accessing the host.

e.g.

Moba Domain Users

Once MobaSSH is installed and restarted we can connect to it and finalize the secured ssh session. Don’t forget to add the scu user to your AD domains BUILTINAdministrators group before proceeding. Also you need to perform an initial NT login to the Service Channel Windows VM using the scu user account prior to using the SSH daemon, this is required to create it’s home directories.

In this step we are using putty to establish an ssh session to the Service Channel VM and then secure shelling to the storage server named ss1. Then we transfer the public key back to our self using scp and exit host ss1. Finally we use cat to append the public key file content to our .ssh/authorized_key file in the scu users profile. Once these steps are complete we can establish an automated prompt less secured encrypted session from ss1 to the Service Channel Windows NT VM.

[Fri Dec 18 – 19:47:24] ~
[scu.ws0] $ ssh ss1
The authenticity of host ‘ss1 (10.10.0.1)’ can’t be established.
RSA key fingerprint is 5a:64:ea:d4:fd:e5:b6:bf:43:0f:15:eb:66:99:63:6b.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘ss1,10.10.0.1’ (RSA) to the list of known hosts.
Password:
Last login: Fri Dec 18 19:47:28 2009 from ws0.laspina.ca
Sun Microsystems Inc. SunOS 5.11 snv_128 November 2008

scu@ss1~#scp .ssh/id_rsa.pub ws0:/home/scu/.ssh/ss1-rsa-scu.pub
scu@ws0’s password:
id_rsa.pub 100% |*****************************| 217 00:00
scu@ss1~#exit

[Fri Dec 18 – 19:48:09]
[scu.ws0] $ cat .ssh/ss1-rsa-scu.pub >> .ssh/authorized_keys

With our automated RSA key password definition completed we can proceed to customize the MobaSSH service instance to run as the scu user. We need to perform this modification in order to enable VB script WMI DCOM impersonate caller rights when instantiating objects. In this case we are calling a remote regedit object over WMI and modifying the NTDS service registry start up values and thus this can only be performed by an administrator account. This modification essentially extends the storage hosts capabilities to reach any Windows host that need integral system management function calls.

On our OpenSolaris Storage head we need to invoke a script which will remotely change the NTDS service state and then locally snapshot the provisioned storage and lastly return the NTDS service back to a normal state. To accomplish this function we will define a cron job. The cron job needs some basic configuration steps as follows.

The solaris.jobs.user is required to submit a cron job, this allows us to create the job but not administer the cron service.
If an /etc/cron.d/cron.allow file exists then this RBAC setting will be overridden by the files existence and you will need to add the user to that file or convert to the best practice methods of RBAC.

root@ss1~# usermod -A solaris.jobs.user scu
root@ss1~# crontab –e scu
59 23 * * * ./vol1-snapshot.sh

Hint: crontab uses vi – http://www.kcomputing.com/kcvi.pdf “vi cheat sheet”

The key sequence would be hit “i” and key in the line then hit “esc :wq” and to abort “esc :q!”

Be aware of the timezone the cron service runs under, you should check it and adjust it if required. Here is a example of whats required to set it.

root@ss1~# pargs -e `pgrep -f /usr/sbin/cron`

Let’s change it to CST6CDT

root@ss1~# svccfg -s system/cron:default setenv TZ CST6DST

Also the default environment path for cron may cause some script “command not found” issues, check for a path and adjust it if required.

This one has no default path, add the path using echo.

root@ss1~# echo PATH=/usr/bin:/usr/sbin:/usr/ucb:/etc:. > /etc/default/cron
# svcadm refresh cron
# svcadm restart cron

With a cron job defined to run the script named vol1-snapshot.sh in the default home directory of the scu user we are now ready to create the script content. Our OpenSolaris storage host needs to call a batch file on the remote Service Channel VM and it will execute a vbscript from there to set the NTDS start up mode . To do this from a unix bash script we will use the following statements in the vol1-snapshot.sh file.

ssh -t ws0 NTDS-PreSnapshot.bat
snap_date=”$(date +%d-%m-%y-%H:%M)”
pfexec zfs snapshot rp1/san/vol1@$snap_date
ssh -t ws0 NTDS-PostSnapshot.bat
exit

Here we are running a secure shell call to the MobaSSH daemon with a -t option which runs the tty screen locally and this allows use to issue an “exit” from the remote calling script closing the secure shell. On the Service Channel VM the followng batch file vbscript calls are executed using the pre and post batch files illustrated as follows.

scu Batch Files

NTDS-PreSnapshot.bat
cscript NTDS-SnapshotRestoreModeOn.vbs DS0
exit

NTDS-PostSnapshot.bat
cscript NTDS-SnapshotRestoreModeOff.vbs DS0
exit

NTDS-SnapshotRestoreModeOn.vbs

strComputer = Wscript.Arguments(0)
const HKLM=&H80000002
Set oregService=GetObject(“WinMgmts:{impersonationLevel=impersonate}!\” & strComputer & “rootdefault:stdRegProv”)
oregService.SetDWordValue HKLM, “SYSTEMCurrentControlSetServicesntdsparameters”, “Database restored from backup”, 1
Set oregService=Nothing

NTDS-SnapshotRestoreModeOff.vbs

We now have Windows integrated storage volume snapshot functionality that allows an Active Directory domain controller to be securely protected using a snapshot strategy. In the event we need to fail back to a previous point in time there will be no danger that the snapshot will cause AD corruption. The integration process has other highly desirable capabilities such as the ability to call VSS snapshots and any other application backup preparatory function calls. We could also branch out using more sophisticated PowerShell calls to VMware hosts in a fully automated recovery strategy using ZFS replication and remote sites.

Hope you enjoyed this entry.

Seasons Greetings to All.

Regards,

Mike

December 23rd, 2009 | Tags: Active Directory, best practice, opensolaris, Restore, Security, snapshot, zfs.
Categories: Security, Storage, VMware | Comments: 5 Comments |

Exploring an IBM v7000 Storage Engine

Updated ZFS Replication and Snapshot Rollup Script

Encapsulating VT-d Accelerated ZFS Storage within ESXi

Running ZFS over NFS as a VMware Store

Protecting Active Directory with Snapshot Strategies

Blogroll

Cluster Maps

SiteMeter

Feeds

Recent Posts