Controlling Snapshot Noise

The ability to perform file system, database and volume snapshots grants us many data protection benefits. However there are some serious problems that can occur if we do not carefully architect snapshot based storage infrastructures. This blog entry will discuss some of the issues with data noise induction and data integrity when using point in time data snapshot activities  and how we can reduce the negative aspects of these data protection methods.

With the emergence of snapshot technology in the data center data noise induction is an unwanted by product and needs to addressed. Active data within a file store or raw volume will have significant amounts of temporary data like  memory swaps and application temp files. This type of data is required for system operations but unfortunately it is completely useless within a point in time snapshot and simply consumes valuable storage space with no permanent value within the scope of system data protection. There are many sources of this undesirable data noise that we need to consider and define strategies to isolate and eliminate them where possible.

In some cases using raw stores eg. iSCSI, FC etc. we will have duplicate snapshot functionality points in the storage stream and this further complicates how we approach a solution to noise induction issues. One common example of snapshot functionality duplication is Microsoft Windows 2003 Volume Snapshot Service aka. VSS. If we enable VSS and an external snapshot service is employed then we are now provisioning snapshots of snapshots which of course is less than optimal because much of the delta between points in time are just redundant encapsulated data. There are some higher level advantages to allowing this to occur like provisioning self service end user restores and VSS aware file system quiescence but for the most part it is not optimal from a space consumption or performance and efficiency  perspective.

If we perform snapshots at multiple points in the data storage stream using VSS we will have three points of data delta. The changed data elements on the source store of the primary files, a copy-on-write set of  changed blocks of the same primary store including it’s meta data and finally the external snapshot delta and it’s meta data. As well if the two snapshot events were to occur at the same time it creates a non-integral copy of a snapshot and meta data which is just pure wasted space since it is inconsistent.

With the co-existing use of VSS we need to define what functionality is important to us. For example VSS is limited to 512 snapshots and 64 versions so if we need to exceed these limits we have to employ an external snapshot facility. Or perhaps we need to allow a user self service file restore functionality from a shared folder. In many cases this functionality is provided by the external snapshot provisioning point. OpenSolaris, EMC and NetApp are some examples of storage products that can provide such functionality. Of course my preference is Custom OpenSolaris storage servers or the S7000 series of storage product which is based on OpenSolaris and is well suited for the formally supported side of things.

Solely provisioning the snapshots externally verses native MS VSS can significantly reduce induced data noise if the external provider supports VSS features or provides tools to control VSS. VSS copy on write snapshot capability should not be active when using this strategy so as to eliminate the undesirable snapshot echo noise. Most environments will find that they require the use of snapshot services that exceed the native MS VSS capabilities.  Provisioning the snapshot function directly on a shared storage system is a significantly better strategy verses allowing a distributed deployment of storage management points across your infrastructure.

OpenSolaris and ZFS provides superior depth in snapshot provisioning than Microsoft shared folder snapshot copy services. Implementing ZFS dramatically reduces space consumption and allows snapshots up to the maximum capacity of the storage system and OpenSolaris provides MS SMB client access to the snapshots which users can manage recovery as a self service. By employing ZFS as a backing store, snapshot management is simplified and snapshots are  available for export to alternate share points by cloning and provisioning the point in time to data consumers performing a multitude of desirable tasks such as audits, validation, analysis and testing.

If we need to employ MS VSS snapshot services provisioned on a storage server that uses snapshot based data protection strategies  then we will need prevent  continuous snapshots on the storage server. We can use features like snap mirror and zfs replication to provision a replica of the data however this would need to be strictly limited in count e.g. 2 or 3 versions and timed to avoid a multiple system snapshot time collision. Older snapshots should be purged and we only allow the MS VSS snapshot provisioning to keep the data deltas.

Another common source of snapshot noise is temporary file data or memory swaps. Fortunately with this type of noise the solution is relatively easy to solve as we simply isolate this type of storage onto storage volumes or shares that are explicitly excluded from a snapshot service function. For example if we are using VMFS stores we can place vswp files on a designated VMFS volume and conversely within an operating system we can create a separate vmdk disk that maps to a VMFS volume which we also exclude from the snapshot function. This strategy requires that we ensure that any replication scheme incorporates the creation or one time replication of these volumes. Unfortunately this methodology does not play well with storage vmotion so one must ensure that the a relocation does not move the noisy vmdk’s back into the snapshot service provisioned stores.

VMware VMFS volume VM snapshots is a significant source of data noise. When a snapshot is initiated from within VMware all data writes are placed on delta file instances. These delta files will be captured on the external storage systems snapshot points and will remain there after the VM snapshot is removed. Significant amount of data delta are produced by VM based snapshots and sometimes mulitple deltas can exceed original vmdk size. An easy way to prevent this undesirable impact is to clone the VM to a store outside the snapshot provisioned stores rather than invoking snapshots.

Databases are probably the most challenging source of snapshot noise data and requires a different strategy than isolation because the data within a specific snapshot is all required to provide system integrity. For example we cannot isolate SQL log data because it is required to do a crash recovery or to roll forward etc.  We can isolate temp database stores since any data in those date stores would not be valid in a crash recovery state.

One strategy that I use as both a blanket method when we are not are able to use other methods and in concert with the previously discussed isolation methods is a snapshot roll-up function. This strategy simply reduces the number of long term snapshot copies that are kept. The format is based on a Grand Father, Father and Son (GFS) retention chain of the snapshot copies and is well suited for a variety of data types. The effect is to provide a reasonable amount of data protection points to satisfy most computing environments and keep the captured noise to a manageable value. For example if we were to snapshot without any management cycle every 15 minutes we would accumulate ~35,000 delta points of data over the period of 1 year. Conversely if we employ the GFS method we will accumulate 294 delta points of data over the period of 1 year. Obviously the consumption of storage resource is so greatly reduced that  we could keep many additional key points in time if we wished and still maintain a balance of recovery point verses consumption rate.

Let’s take a look at a simple real example of how snapshot noise can impact our storage system using VMware, OpenSolaris and ZFS block based iSCSI volume snapshots. In this example we have a simple Windows Vista VM that is sitting idle, in other words only the OS is loaded and it is power on and running.

First we take a ZFS snapshot of the VMFS ZFS iSCSI volume.

zfs snapshot sp1/ss1-vol0@beforevmsnap

Now we invoke a VMware based snapshot and have a look at the result.

root@ss1:~# zfs list -t snapshot
NAME                               USED  AVAIL  REFER  MOUNTPOINT
sp1/ss1-vol0@beforevmsnap          228M      -  79.4G  -

Keep in mind that we are not modifying any data files in this VM if we were to change data files the deltas would be much larger and in many cases with multiple VMware snapshots could exceed the VMs original size if it is allowed to remain for long periods of weeks and longer. The backing store snapshot initially consumes 228MB which will continue to grow as changes occurs on the volume. A significant part of the 228MB is the VMs memory image in this case and of course it has no permanent storage value.

sp1/ss1-vol0@after1stvmsnap          1.44M      -  79.5G  -

After the initial VMware snapshot occurs we create a new point in time ZFS snapshot and here we observe some noise in the next snapshot and again we have not changed any data files in the last minute or so.

sp1/ss1-vol0@after2ndvmsnap       1.78M      -  79.5G  -

And yet another ZFS snapshot a couple of minutes later shows more snapshot noise accumulation. This is one of the many issues that are present when we allow non-discretionary placement of files and temporary storage on snapshot based systems.

Now lets see the impact of destroying a snapshot that was created after we delete the VMware based snapshot.

root@ss1:~# zfs destroy sp1/ss1-vol0@beforevmsnap
root@ss1:~# zfs list -t snapshot
NAME                               USED  AVAIL  REFER  MOUNTPOINT
sp1/ss1-vol0@after1stvmsnap       19.6M      -  79.2G  -
sp1/ss1-vol0@after2ndvmsnap       1.78M      -  79.2G  -

Here we observe the reclamation of more than 200MB of storage. And this is why GFS based snapshot rollups can provide some level of noise control.

Well I hope you found this entry to be useful.

Til next time..





Site Contents: © 2009  Mike La Spina