Ubiquitous Talk

About

This is a technical blog site which shares information primarily about Virtualization, Storage and Security.

Home

February 1st, 2009 | Categories: | Comments: 19 Comments |

19 Comments

NiTRo says:

January 6, 2010 at 4:47 PM

Hi Mike, I was using your great vmfs metadata backup script with ESX 3 but now i’m working on ESXi 4 and i still would like to use it. But i also would like to be able to restore in case of corruption but the dd command give me an error when i try (dd: can’t open ‘/dev/disks/mpx.vmhba1:C0:T1:L0:1’: Function not implemented) can you help me ? Regards.
Mike La Spina says:

January 6, 2010 at 11:32 PM

Hi Raphael,

I believe the reason for the “function not implemented” message is a bit misleading. I suspect the real cause is most likely from a read only file system mount on /dev. You cannot create any files on that mount point. To change this behavior would be very challenging and completely unsupported. I would suggest using an automated build of an ESX 4 host to recover any ESXi 4 vmfs partition deletion. Then you could access the vmfs partition over the /dev/sdxx devices there. No need to license it since it would only be a temporary instance.

Regards,

Mike
NiTRo says:

January 7, 2010 at 3:56 AM

Mike, thanks a lot for your answer. You think there is not chance to find the real device path to use in dd ?
Frederik says:

March 19, 2010 at 12:18 PM

Hi Mike,

thank you for the nice blog. I liked the DDRdriveX1 review. It’s a nice device and it seems to do it’s job well. Unfortunately it’s rather steep price keeps it out of the soho realm. Which brings me to my question. esx(i) and opensolaris based storage. I currently run esxi hosted vm’s from an nfs b104 based backend. Works like a charm. Most important I understand why there’s no problem (data loss or consistancy) in case of a storage failure. All data is committed safely to disk.
Now a recent (b125 or later) iscsi comstar based backend. I know you prefer iscsi for your esx setups. My concerns are possible dataloss or corruption in case of storage failure (power, panic, loss of paths etc) when using iscsi due to caching. Recent discussions on the storage or zfs mailing lists didn’t led to a clear recommandation. To be clear, I don’t have nor want to disable the zil. And at the moment I don’t have an slog added to my pools.
I tested two iscsi scenarios. A vmfs3 based on a single lun and a windows 2003 vm using two rdm lun’s. The same vm was also used on the vmfs3 datastore. The lu’s where created from based /dev/RDSK/pool/vols (the raw ones not the cached DSK version). Checking the lu’s showed that the writeback cache was enabled.
I did run some tests with a java program which used sql server and did do a lot of file reading and writing apart from the sql server io’s. When using the rdm backed vm I did see zil usage (with zilstat from Richard Elling) when the sql part got executed. When using the vmfs3 backed vm I never saw zil usage, what was clearly visible was the default 30 second flush.
What can go wrong with this setup? Can the database in the vm lose transactions when the power is cut before the 30 seconds elapse? Should the writeback cache be disabled? How big is the risk of losing or corrupting the vmfs3 in case of a storage failure?

These are questions a lot of people have on their mind but no clear answers can be found for in the mailing lists nor in the various blogs.

Thank you for your time,

Frederik
Mike La Spina says:

March 19, 2010 at 4:09 PM

Hi Frederik,

Thanks for the acknowledgment.
You have a good question that many people are concerned about and very few will give you a confident answer on.

For the soho world there is no real risk since the data volume movement is so low that the probability of data loss is not a concern. All modern databases have the ability to recover from incomplete transactions and some can even deal with transactions that are missing but flagged as committed. I do allow write caching at the LUN level for iSCSI volumes in my soho/lab environment. I have seen multiple power outages with MSSQL 2005 running on a VM against my iSCSI targets with caching on and have not dealt with a corruption event to date. Completely disabling ZIL is only something I would only do if I have a UPS protecting the storage head. With NFS as a VMware shared store you will have more writes that require your ZIL protection since VMware always writes with the O_Sync flag when using NFS so more caution may be warranted. I would say that vmfs3 will survive in almost all crash events when the IO load is low and your iSCSI target LUN caching is enabled. Keep in mind that ZIL is active even when the LUN cache is on which is very safe as far as the soho world goes. I have hard failed a ZFS based NFS heads multiple times and have not seen any failure that rendered a share unusable But I have killed some data in that testing process and it was very minor.

If a crash occurs and the cache flush event was not complete then that data will be dropped however this is no different than the network ring buffers being dropped at a crash event.

Of course in a business environment it would not be advisable to disable your ZIL functionality in any scenario.

In closing always remember that “A ZFS snapshot is your friend” when is comes to a recovery job.

Regards,

Mike
NiTRo says:

April 5, 2010 at 10:59 AM

Hi Mike, i found the trick about the “function not implemented” issue in this kb : http://kb.vmware.com/kb/1008886 you need “conv=notrunc” parameter to get dd works 🙂
Mike La Spina says:

April 5, 2010 at 9:05 PM

Great find NiTro!
Thanks for the info.
John says:

August 11, 2011 at 3:20 AM

I have visited your blog and I think it is relevant for my for my visitors. My website is oriented around Diskless iSCSI Boot software. You may check it here: http://www.sandeploy.com. We are Page Rank 3 site with page views number growing everyday.

As you may see by yourself, we produce high quality Diskless iSCSI Boot software that allows users to perform network diskless boot of any operating system.
It benefit any size of busies by allowing them to cut costs on hard drives in client machines, also enabling better performance and noise decrease.

If you would like to know more about our company and our products, please let me know.

We would like to feature your site on a special page that you may view here:

http://www.sandeploy.com/Interesting-sites.php

There will be a link to your site, as well as short description of it.
Frederik says:

December 18, 2011 at 5:42 PM

Hi Mike,

have you by any change used OpenIndiana 151a or latest Illumos bits as a VMware datastore on one machine serving a different esxi machine? I’m experiencing a serious regression on write perf, both NFS and iSCSI. I see this on two test servers, AMD and INTEL. Once I boot back into oi148, the last opensolaris bits before the gate closed, performance is as expected. When I reboot into later, Illumos based bits, write performance is roughly halved. Since this is such a common setup I’m wondering why nobody else reported this. Except one other user who confirmed by mail that he still had the write performance drop. And because you use ESXi daily and did do a lot with OpenSolaris I thought if anybody would have seen this it would be you. Before filing a bug report I would like to double check with some known “power users”.
Thank you for your time.
Mike La Spina says:

December 19, 2011 at 12:49 AM

The performance issue you have described is probably related to a known DMA driver function on certain network adapter calls which is currently fixed. You would need to point to the experimental repo and update the code to verify it on your install.

Regards,
Mike
Frederik says:

December 19, 2011 at 7:33 AM

I’m aware of that issue. I even installed an updated (unpublished) e1000g driver a couple of days ago on top of my latest illumos bits but that did not resolve the issue. And to further exclude the dma issue and narrow it down a bit more. I compiled an iperf for the esxi console and can reach 934mbit both ways from the esxi service console to the storage servers. I can ftp at wirespeed, on a single socket, to the zfs storage from other clients. So basically the storage is ok EXCEPT when writing from ESXi. This can be observed from within the guest and from the service console, when doing a copy from a local fast ssd to the nfs datastore. As soon as I reboot back into oi148 the writes are back on the expected performance. The second test server was totally fresh installed with minimal configuration. The same behaviour can be observed on that machine as well.
Do you know how to simulate the behavior of esx when writing to vmdk’s on a nfs backed store on a Linux client for example. Mounting the store from Linux with the sync option gave totally different behavior. Does esx mounts the store async perhaps but open the vmdk file with the sync flag perhaps.
Any other testing and or debugging strategy would be more than welcome before I file bug report.
Again, thank you for your time.
Mike La Spina says:

December 19, 2011 at 9:58 AM

Hi Frederik,

I see … I will do some bench marks on my end and see if it’s something I have overlooked.

Regards,
Mike
Frederik says:

December 19, 2011 at 10:37 AM

Much appreciated. Very coarse stats from zilstat show a drop from 3100-3400 io’s per txg to 1900-2300 on anything later oi148. And I’m using as fast dedicated slog, ram based, on both machines. So no bottlenecks there. When inspecting the snoop files if found out that when copying esxi opened the new file (proc 7) on the nfs datastore with the fsync flag, as expected. Everything from mount,fsstat etc lookup normal. I want to the simulate the nfs rpc’s issued by esxi shell cp command on a linux client to see if I can reproduce the behavior on another platform.
TIA,
Frederik
Mike La Spina says:

December 19, 2011 at 7:43 PM

Hi Frederik,

A quick bench using ESXi 4.1 and an XP VM running on an NFS share served by an OI VM 148 -> 151.1 shows no performance loss. It actually results in a 5%-10% improvement in IOPS over NFS. The bench was a simple IOMeter and SQLIO load using the same variables on both 148 and 151. Can you describe your OI host configuration. Specifically the disk controller hardware and motherboard your running. This sounds like a driver issue.

Regards,
Mike
Frederik says:

December 20, 2011 at 6:09 AM

Hi Mike,

thank you for investigating this. The regression only happens when using a dedicated slog on anything later than oi148. And, at least for the moment, doesn’t involve NFS. It can be easily reproduced locally with dd.
# dd if=/dev/zero of=/mpool/4g.bin [oflag=sync] bs=32K count=128K
This more or less simulates the esxi nfs writes sync and async. I tried various block sizes but that didn’t made much of a difference
For example a 2 disk pool with and without slog
oi148 with slog async dd=218MB/s
oi148 with slog sync dd=50MB/s zilstat iops/per txg=7500
oi148 NO slog async dd=216MB/s
oi148 NO slog sync dd=5MB/s zilstat iops/per txg=750

illumos with slog async dd=223MB/s
illumos with slog sync dd=13MB/s zilstat iops/per txg= first=4400 second=675 third=750 this pattern repeats
illumos NO slog async dd=221MB/s
illumos NO slog sync dd=5MB/s zilstat iops/per txg=770

Do you still have the DDRdrive? It would be interesting to see if the same behaviour manifests itself using this device, since it isn’t connected to a HBA.

I’m very curious if you’re tests involved a slog. Sorry to have wasted your time benching ESXi while it appears to be a local problem. But I never suspected that. One the brighter side is that one can remove the slog nowadays without recreating the pool.

Again many thanks and I hope you’ve some more time and a dedicated slog to verify my latest finding,

Frederik
Mike La Spina says:

December 20, 2011 at 7:18 PM

Hi Frederik,

I do not have the DDRdrive, however I can simply use a VT-d attached disk as a slog. If this is a slog txg code issue it should show up just the same. I am doing the unit testing now, the result will be available shortly

Regards,
Mike
Mike La Spina says:

December 20, 2011 at 10:01 PM

Hi Frederik,

Here are the numbers from using dd in an OI VM:

148 slog
4294967296 bytes (4.3 GB) copied, 80.0237 s, 53.7 MB/s async
4294967296 bytes (4.3 GB) copied, 559.775 s, 7.7 MB/s sync

148 no slog
4294967296 bytes (4.3 GB) copied, 76.7171 s, 56.0 MB/s async
4294967296 bytes (4.3 GB) copied, 226.254 s, 19.0 MB/s sync

151 slog
4294967296 bytes (4.3 GB) copied, 76.3189 s, 56.3 MB/s async
4294967296 bytes (4.3 GB) copied, 559.104 s, 7.7 MB/s sync

151 no slog
4294967296 bytes (4.3 GB) copied, 79.7127 s, 53.9 MB/s async
4294967296 bytes (4.3 GB) copied, 174.932 s, 24.6 MB/s sync

I think this rules out the zil code direction.
I think it is more likely that your dealing with a driver issue.

Keep in mind these tests are not using a fast slog, it’s just an external disk over VT-d on an LSI 1068 SAS Adapter and thus is not accelerated.
The other disk is a local vmdk and it is cached at the ESXi host.

Regards,
Mike
Frederik says:

December 21, 2011 at 2:29 AM

Mmmm, not what I expected. Still many thanks. I’m using dedicated machines, AMD and Intel, SAS and Sata and see it on both. I did an analysis of the commits and made a shortlist of commits possible responsible for this. Based on the list I’m rebuilding the code and test at what point the regression was introduced. Hopefully a single commit can be pinpointed for this. A pity you don’t have the DDRdrive anymore…
I’ll keep you posted about any progress.
Again thank you for time,
Frederik
Mark Gibbons says:

March 1, 2013 at 5:52 PM

Hi

Frederik mentions in this post that he had built an iperf binary for esx console. I was wondering if he could share that with me. I believe the is a blog rather than a forum so I do not know if it possible to contact a user via email. If its possible perhaps the moderator/blogger could forward this message to Frederik with my email address so that he could get in touch.

Many Thanks

KR’s

Mark

Ubiquitous Talk

About

19 Comments

Leave a Reply

Blogroll

Cluster Maps

SiteMeter

Feeds

Recent Posts

Ubiquitous Talk

About

19 Comments

Leave a Reply

Blogroll

Tags

Cluster Maps

SiteMeter

Feeds

Recent Posts