X4500 ZFS and iSCSI Performance Characteristics
Benchmarks are useful in many ways, they are particularly effective when you wish to validate an architectural design. In this case a SUN X4500 as an iSCSI target and VMware ESX 3.5 servers with QLA4050c initiators. This benchmark is not a definitive measure of what the architectural maximums are for the X4500 or the other components of the architecture, its a validation that the whole system performs as expected under the context of its current configuration. Within this configuration there are several components that have expected limitations such as the Ethernet switch buffering and flow control rates. As well we need to realize that in any iSCSI configuration it will have a characteristic collapsing point and inherent latency due to packet saturation peeks. In this architecture we can expect several specific limits such as 60% effective 1Gb Ethernet connection usage limits before latency issues are prevalent. Additionally when using SATA disk interfaces we can expect that high rates of small I/O will result in less than effective performance characteristics in the event that this behavior is sustained or is occurring very frequently.
The design details are on blog entry http://blog.laspina.ca/ubiquitous/running_zfs_over_iscsi_as
What becomes important to consider with this architecture is the cost to performance ratio in which this design is very attractive. The combined components of this system perform very well in this context and it has some pleasant surprises within its delivered capabilities. When we look at the performance of this design there are elements that seem to escape the traditional behaviors of some of the involved subcomponents. Within the current limiting parameters we would expect it to underperform in more use cases that not, but this behavior is not occurring. There are several reasons for this result and ZFS is a big factor within the system since it performs blocks of transactional write functions which are complimentary to SATA interface behaviors. SATA interfaces work well with larger transfer segments rather than small short transfer operations and thus ZFS optimizes performance over SATA disk arrays. Another factor is the virtualization layer on the VMware hosts consolidates many of the smaller I/O behaviors and delivers larger read and write transfer requests, provided that we make use of vmdk files when provisioning virtual disk devices.
In the first graph we are observing the results of a locally executed dd command collected with iostat as follows:
dd if=/dev/zero of=/rp1/iscsi/iotest count=1024k bs=64k
iostat -x 15 13 (only the last 12 outputs are used for the graph plot)
The graph reveals excellent 36 disk raid Z write performance at a level of 600MB/s for a sustained time period of 3 Min.
Other raid modes can provide significantly superior performance such as a 24 pair raid 1 mirror, this however does not grant the optimal use of the possible available disk capacity and is not required for this application.
This next graph reveals excellent read performance at a level of 800MB/s for a sustained time period of 3 Min. A dd command was again used as follows:
dd if=/rp1/iscsi/iotest of=/dev/null count=1024k bs=64k
This final graph plots the iostat collection values from the X4500 while the ESX 3.5 initiators were performing real time active application loads over the iSCSI network for a 3 Min period. Additionally 8 Virtual Machines were added to the production real time loads are were executing Microsoft’s SQLIO tool on 8 – 2GB files sustaining 100% writes at a block size of 64k.
We can observe 220MB/s sustained I/O while both read and write activity was present and also find a surprising 320MB/s peek of final write activity. While this is not a maximum attainable level of the possible configurations it certainly validates the performance to be excellent and definitely meets the cost to performance design objectives.
There are some small improvements which are available to optimize this designs performance. The use of jumbo frames at the network side will provide better performance for the TCP stack operations especially when using Software iSCSI initiators on VMware. More importantly using a raid Z array of 44 drives and two spares will improve the I/O performance by 15-20% at zero additional cost. As well the option to upgrade to 10Gb Ethernet is a next step if required as the X4500 can deliver much more than the current 4Gb aggregate.
Regards,
Mike
Site Contents: © 2008 Mike La Spina
Running ZFS over iSCSI as a VMware vmfs store
The first time I looked at ZFS it totally floored me. This is a file system that has changed storage system rules as we currently know them and continues to do so. It is with no doubt the best architecture to date and now you can use it for your VMware stores.
Previously I had explored using it for a VMware store but ran into many issues which were real show stoppers. Like the VPD page response issue which made VMware see only one usable iSCSI store. But things are soon to be very different when Sun releases the snv_93 or above to all. I am currently using the unreleased snv_93 iscsitgt code and it works with VMware in all the ways you would want. Many thanks to the Sun engineers for adding NAA support on the iSCSI target service. With that being said let me divulge the details and behaviors of the first successful X4500 ZFS iSCSI VMware implementation in the real world.
Lets look at the Architectural view first.
The architecture uses a best practice approach consisting of completely separated physical networks for the iSCSI storage data plane. All components have redundant power and network connectivity. The iSCSI storage backplane is configured with an aggregate and is VLAN’d off from the server management network. Within the physical HP 2900’s an inter-switch ISL connection is defined but is not critical. This allows for more available data paths if additional interfaces were assigned on the ESX host side.
The Opensolaris aggregate and network components are configured as follows:
For those of you using Indiana….By default nwam is enabled on Indiana and this needs to be disabled and the physical network service enabled.
svcadm disable svc:/network/physical:nwam
svcadm enable svc:/network/physical:default
The aggregate is defined using the data link adm utility but first any bindings need to be cleared by unplumbing the interfaces.
e.g. ifconfig e1000g0 unplumb
Once cleared the assignment of the physical devices is possible using the following commands
dladm create-aggr –d e1000g0 –d e1000g1 –P L2,L3 1
dladm create-aggr –d e1000g2 –d e1000g3 –P L2,L3 2
Here we have set the policy allowing layer 2 and 3 and defined two aggregates aggr1 and aggr2. We can now define the VLAN based interface shown here as VLAN 500 instances 1 are 2 respective of the aggr instances. You just need to apply the following formula for defining the VLAN interface.
(Adaptor Name) + vlan * 1000 + (Adaptor Instance)
ifconfig aggr500001 plumb up 10.1.0.1 netmask 255.255.0.0
ifconfig aggr500002 plumb up 10.1.0.2 netmask 255.255.0.0
To persist the network configuration on boot you will need to create hostname files and hosts entries for the services to apply on startup.
echo ss1.iscsi1 > /etc/hostname.aggr500001
echo ss1.iscsi2 > /etc/hostname.aggr500002
Edit /etc/hosts to have the following host entries.
::1 localhost
127.0.0.1 ss1.local localhost loghost
10.0.0.1 ss1 ss1.domain.name
10.1.0.1 ss1.iscsi1
10.1.0.2 ss1.iscsi2
On the HP switches its a simple static trunk definition on port 1 and 2 using the following at the CLI.
trunk 1-2 trk1 trunk
Once all the networking components are up and running and persistent, its time to define the ZFS store and iSCSI targets. I chose to include both mirrored and raidz pools. I needed to find and organize the cxtxdx device names using the cfgadm command or you could issue a format command as well to see the controller, target, disk names if you’re not using an X4500. I placed the raidz devices across controllers to improve I/O and distribute the load. It would not be a prudent to place one array on a single SATA controller. So here is what it ends up looking like from the ZFS command view.
zpool create –f rp1 raidz1 c4t0d0 c4t6d0 c5t4d0 c8t2d0 c9t1d0 c10t1d0
zpool add rp1 raidz1 c4t1d0 c4t7d0 c5t5d0 c8t3d0 c9t2d0 c10t2d0
zpool add rp1 raidz1 c4t2d0 c5t0d0 c5t6d0 c8t4d0 c9t3d0 c10t3d0
zpool add rp1 raidz1 c4t3d0 c5t1d0 c5t7d0 c8t5d0 c9t5d0 c11t0d0
zpool add rp1 raidz1 c4t4d0 c5t2d0 c8t0d0 c8t6d0 c9t6d0 c11t1d0
zpool add rp1 raidz1 c4t5d0 c5t3d0 c8t1d0 c8t7d0 c10t0d0 c11t2d0
zpool add rp1 spare c11t3d0
zpool create –f mp1 mirror c10t4d0 c11t4d0
zpool add mp1 mirror c10t5d0 c11t5d0
zpool add mp1 mirror c10t6d0 c11t6d0
zpool add mp1 spare c9t7d0
It only takes seconds to create terabytes of storage, wow it truly is a thing of beauty (geek!). Now it’s time to define a few pools and stores in preparation for the creation of the iSCSI targets. I chose to create units of 750G since VMware would not perform well with much more than that. This is somewhat dependant on the size of the VM and type of I/O but generally ESX host will serve a wide mix so try I keep it to a reasonable size or it ends up with SCSI reservation issues (that’s a bad thing chief).
You must also consider I/O block size before creating a ZFS store this is not something that can be changed later so now is the time. It’s done by adding the –b 64K to the ZFS create command. I chose to use 64k for the block size which aligns with VMWare default allocation size thus optimizing performance. The –s option enables a sparse volume feature aka thin provisioning. In this case the space was available but it is my favorite way to allocate storage.
zfs create rp1/iscsi
zfs create -s -b 64K -V 750G rp1/iscsi/lun0
zfs create -s -b 64K -V 750G rp1/iscsi/lun1
zfs create -s -b 64K -V 750G rp1/iscsi/lun2
zfs create -s -b 64K -V 750G rp1/iscsi/lun3
zfs create mp1/iscsi
zfs create -s -b 64K -V 750G mp1/iscsi/lun0
Originally I wanted to build the ESX hosts using a local disk but thanks to some bad IBM x346 engineering I could not use the QLA4050C and an integrated Adaptec controller on the ESX host server hardware. So I decided to give boot from iSCSI a go thus here is the boot LUN definition that I used for it. The original architectural design requires local disk to prevent an ESX host failure in the event of an iSCSI path outtage.
zfs create rp1/iscsi/boot
zfs create -s -V 16G rp1/iscsi/boot/esx1
Now that the ZFS stores are complete we can create the iSCSI targets for the ESX hosts to use. I have named the target alias to reflect something about the storage system which makes it easier to work with. I also created an iSCSI configuration store so we can persist the iSCSI targets on reboots. (This may now be included with Opensolaris Indiana but I have not tested it)
mkdir /etc/iscsi/config
iscsitadm modify admin –base-directory /etc/iscsi/config
iscsitadm create target -u 0 -b /dev/zvol/rdsk/rp1/iscsi/lun0 ss1-zrp1
iscsitadm create target -u 1 -b /dev/zvol/rdsk/rp1/iscsi/lun1 ss1-zrp1
iscsitadm create target -u 2 -b /dev/zvol/rdsk/rp1/iscsi/lun2 ss1-zrp1
iscsitadm create target -u 3 -b /dev/zvol/rdsk/rp1/iscsi/lun3 ss1-zrp1
iscsitadm create target -b /dev/zvol/rdsk/mp1/iscsi/lun0 ss1-zmp1
iscsitadm create target -b /dev/zvol/rdsk/rp1/iscsi/boot/esx1 ss1-esx1-boot
Most blog examples of enabling targets show the ZFS command line method as shareiscsi=on. This works well for a new iqn but if you want to allocate additional LUN under that iqn then you need to use this –b backing store method.
Now that we have some targets you should be able to list them using:
iscsitadm list target
Notice that we only see one iqn for ss1-zrp1, you can use the –v option to show all the LUN’s if required.
Target: ss1-zrp1
iSCSI Name: iqn.1986-03.com.sun:02:eb9c3683-9b2d-ccf4-8ae0-85c7432f3ef6.ss1-zrp1
Connections: 2
Target: ss1-zmp1
iSCSI Name: iqn.1986-03.com.sun:02:36fd5688-7521-42bc-b65e-9f777e8bfbe6.ss1-zmp1
Connections: 2
Target: ss1-esx1-boot
iSCSI Name: iqn.1986-03.com.sun:02:d1ecaed7-459a-e4b1-a875-b4d5df72de40.ss1-esx1-boot
Connections: 2
It would be prudent to create some target initiator entries to allow authorization control of what initiator iqn’s can connect to a particular target.
This is an important step. It will create the ability to use CHAP or at least only allow named iqn’s to connect to that target. iSNS also provides a similar service.
iscsitadm create initiator –iqn iqn.2000-04.com.qlogic:qla4050c.esx1.1 esx1.1
iscsitadm create initiator –iqn iqn.2000-04.com.qlogic:qla4050c.esx1.2 esx1.2
Now we can assign these initiators to a target and then the target will only accept those initiators. You can also add CHAP authentication as well, but that’s beyond the scope of this blog.
iscsitadm modify target –acl esx1.1 ss1-esx1-boot
iscsitadm modify target –acl esx1.2 ss1-esx1-boot
iscsitadm modify target –acl esx1.1 ss1-zrp1
iscsitadm modify target –acl esx1.2 ss1-zrp1
iscsitadm modify target –acl esx1.1 ss1-zmp1
iscsitadm modify target –acl esx1.2 ss1-zmp1
In order to boot from the target LUN we need to configure the QLA4050C boot feature. You must do this from the ESX host using the ctrl Q sequence during the boot cycle. It is simply a matter of entering the primary boot target IP set the mode to manual and enter the iqn exactly as it was listed from the iscsitadm list targets command. e.g.
iqn.1986-03.com.sun:02:d1ecaed7-459a-e4b1-a875-b4d5df72de40.ss1-esx1-boot
Once the iqn is entered the ESX host software can be installed and configured.
Till next time….
Site Contents: © 2008 Mike La Spina
Automated VMware Tools upgrades using your VC and Perl
I was upgrading a VI 2.0.2/3.0.2 system to VI 2.5/3.5 and found that upgrading VMware Tools was becoming a real pain. So I decided to look for a better way. I first did the usual search hoping to find a quickie but no luck. I searched the VMware forums to find that everyone was using the MSI package in a manner external to the VI system. This did not address any Linux or Unix systems so here is what I came up with to address the lack of automation.
First I installed the VMware VI perl Tool Kit on the VC.
I changed the NTFS rights to the Tool Kit directory to an dedicated admin user ID and SYSTEM only, I then populated the visdk.rc config file in the same directory with the following parms
VI_PROTOCOL=https
VI_SERVER=theVCserverFQDN
VI_SERVICEPATH=/sdk
VI_USERNAME=adminid
VI_PASSWORD=thepassword
I also added the following ENV variable to the SYSTEM Context.
I wrote a short simple Perl script to make use of the power of the VC and have it take care of the task using it’s own built capability. Frankly I just don’t see why VMware did not have a scheduled task item to handle this job in the VC.
#!/usr/bin/perl -w
#
# Requires a text file named VMToolsUpgradeList.txt
# containing VM names which the last of must not end with a CR
# Author: Mike La Spina
# Date: May 24, 2008
# upgradetools.pl –host TheVC
use strict;
use warnings;
use VMware::VIRuntime;
my %opts = (
‘host’ => {
type => “=s”,
help => “Host is the VIM API service target”,
required => 0,
},
);
Opts::add_options(%opts);
Opts::parse();
Opts::validate();
my ($property_name, $property_value);
$property_name = “config.name”;
$property_value = ”;
Util::connect();
# get all VMs with out of date tools
my $vm_views = Vim::find_entity_views(view_type => ‘VirtualMachine’,
filter => { ‘guest.toolsStatus’ => ‘toolsOld’ });
foreach (@$vm_views) {
my $nameVM = $_->name;
#print $nameVM . “n”;
open FILE, “<“, “VMToolsUpgradeList.txt” or die $!;
my @lines = <FILE>;
foreach my $vmname (@lines) {
if ($nameVM eq $vmname) {
$_->UpgradeTools_Task();
print “Issued VMware Tools upgrade task on ” . $nameVM . “n”;
};
next;
};
next ;
}
print “Tools Upgrade scan completedn”;
Util::disconnect();
The script requires a text file which I populated all of the VM names that I wanted to upgrade VMware Tools on. It’s noted in the script.
I created a windows task to run the perl script at 3:30AM and set it up to login in with the dedicated adminID and password.
I can now add any VM names to the VMToolsUpgradeList.txt and schedule the event to run after hours.
There are no issues with MSI deployements or Linux installs etc.
Till next time…
Site Contents: © 2008 Mike La Spina
ESX 3.5.0 Upgrade Pegasus Issue
I was upgrading some ESX 3.5.0 systems the other day and noticed a strange error on the pegasus server CIM startup.
The error was as follows.
Error from the log file /var/pegasus/vmware/install_queue/3.log”
Parsing error: parse error: Error adding class VMware_IdentityMemberOfCollection
to the repository: CIM_ERR_NOT_FOUND: The requested object could not be found:”VMware_Identity”
After digging around a bit I found a couple of errors in the MOF compiler directives and a file is missing from the shared provider components.
To correct the problem which appears to be wide spread I performed the following adjustments.
Edit the roleauth-schema compiler directive to include the VMware_Identity class definition using nano “/var/pegasus/vmware/install_queue/3_files/mofs/root/PG_Interop/roleauth-schema.mof”
Add the WMware_Identity line above the pre-existing member directive.
#pragma include (“VMware_Identity.mof”)
#pragma include (“VMware_IdentityMemberOfCollection.mof”)
It also needs to be added in the standard cimv2 path.
nano /var/pegasus/vmware/install_queue/3_files/mofs/root/cimv2/roleauth-schema.mof
#pragma include (“VMware_Identity.mof”)
#pragma include (“VMware_IdentityMemberOfCollection.mof”)
Copy the missing file from the stardard cimv2 path to the shared path.
cp /var/pegasus/vmware/install_queue/3_files/mofs/root/cimv2/VMware_Identity.mof
/var/pegasus/vmware/install_queue/3_files/mofs/root/PG_Interop/
Stop and start the service with these commands.
/etc/init.d/pegasus stop
/etc/init.d/pegasus start
Once the scripts completes the install_queues will be empty and the service will start much more quickly.
Site Contents: © 2008 Mike La Spina