Question
· Sep 17, 2019

Data Innovations Instrument Manager Backups on a VM

We are currently implementing the Data Innovations Instrument Manager product.  In setting up our backup process we are wanting to use Veam snapshots. The application runs in a Caché 2016.1/Windows Server 2016 instance.  We are running an HA primary/secondary/arbiter config. The statement below is from DI.  I am curious to see what others that have implemented the DI Instrument Manager in the same or similar config have in place for backup.

"DI recommends is recommending that we not perform snapshots, but if you do choose to do so, here is some important information to consider.

Sometimes customers who use mirroring and VM snapshots to do their backup do run into mirror connectivity issues and failovers. This is usually due to network interruption when the VM becomes “stunned” during part of the snapshot backup process and the stun time lasts longer than the time determined by the mirror QoS timeout. 

When you run an ExternalFreeze it freezes updates to the database file on disk but does not stop system activity. Activity continues and updates will be stored in memory.

However, it is possible that system activity will be suspended if one of the following occurs:

1) The system runs out of global buffers for processes to write to. 

2) The length of the suspension is longer than the system default (currently 600 seconds/10 minutes). "

Discussion (4)2
Log in or sign up to continue

Using Veeam backup/snapshot is very common with Caché and IRIS, and when using the snapshot process there are a couple things to be aware of:

1. Make sure you are NOT including the VM's memory state as this will have a long impact to VM stun times.

2. Make sure you are current with VMware vSphere patches as there are some known issues with snapshot performance and data consistency in older versions of vSphere.  I would recommend being on at least vSphere 6.7 or above.

3. You need to make sure your journal disk is on a different VMDK than any of your CACHE.DATs and CACHE.WIJ especially after you the thaw the instance because a large burst of writes may happen and cause IO to flood/serialize the device and potentially block or slow down journal writes (...and triggers a premature mirror failover because of it).

4. You definitely need to use the ExternalFreeze/Thaw APIs to ensure the CACHE.DATs within the snapshot are "clean". 

5. Confirm your current Q0S timeout value as some earlier versions of Caché had a very low QoS value and with snapshots I believe it should be 8 set to seconds and not to exceed 30 seconds.

Also the links that Peter mentioned are very good links to reference as well for more details.

After re-reading excellent articles referenced above, it seemed that:
1) Too low QoS value can be incompatible with VM Stun time.
2) Too high value can be inappropriate as well for some other reasons. E.g., it can postpone a failover when it's of real need when Primary crashed or isolated.
So, why not stop bothering about QoS value, and just Set No Failover during snapshot phase? Documentation describes how to do it manually, while it should be possible programmatically as well.