These are some of the things to look out for with MetroCluster and can be considered best practices and recommendations.
Disable change_fsid
One very important configuration change to be done on MetroCluster controllers is to immediately disable the change_fsid option. If it is not disabled the all volumes and LUNs will be renamed during failover and make it impossible to volumes and LUNs to be referenced. This is really critical for LUNs.
To avoid the FSID change in the case of a site takeover, you can set the change_fsid option to off (the default is on). Setting this option to off has the following results if a site takeover is initiated by the cf forcetakeover -d command:
- Data ONTAP refrains from changing the FSIDs of volumes and aggregates.
- Users can continue to access their volumes after site takeover without remounting.
- LUNs remain online.
If you don’t disable the change_fsid option in MetroCluster configurations the following happens when the cf forcetakeover -d command is run:
- Data ONTAP changes the file system IDs (FSIDs) of volumes and aggregates because ownership changes.
- Because of the FSID change, clients must remount their volumes if a takeover occurs.
- If using Logical Units (LUNs), the LUNs must also be brought back online after the takeover.
options cf.takeover.change_fsid off
MetroCluster RC file
There is a file on MetroCluster controllers called /etc/mcrc which can be used to configure partner addresses on different subnets. To do this, you must create a separate /etc/mcrc file and enable the cf.takeover.use_mcrc_file option.
When taking over its partner, the node uses the partner’s /etc/mcrc file to configure partner addresses locally instead of /etc/rc.
For the most part this is not used but it’s good to know about in case it is required
Fabric MetroCluster Latency Considerations
A dedicated fiber (dark fiber) link has a round-trip time (RTT) of approximately 1ms for every 100km (~60 miles). Additional almost imperceptible latency might be introduced by devices (for example, multiplexers) en route. Realistically your SAN switches will be able to handle higher latency than that but to ensure performance and uptime it’s recommended to keep the latency as near to 1ms as possible.
Generally speaking as the distance between sites increases so too does the latency (assuming 100km = 1ms link latency):
- Storage response time increases by the link latency. For example, if storage has a response time of 1.5ms for local access, then over 100km the response time increases by 1ms to 2.5ms.
- Applications may be latency sensitive and should be factored into planning for MetroCluster. Some apps with 5ms latency will have an additional 1ms added for the distance latency hit. While not huge it may have a knock on effect on application performance, particularly highly transactional applications
MetroCluster Tie-Breaker
MetroCluster Tie-breaker (MCTB) software by default sits in observer mode. This means that it monitors, observes and alerts on disaster. It can also send SNMP alerts on disaster. The MCTB does not perform failover automatically unless configured. By default it won’t. The advantage of MCTB is that it can monitor the status of all sites within the MetroCluster and on definition of a disaster where the entire site is unavailable it can run the takeover command automatically. This reduces the time to failover.
However, ideally as an engineer you want to check that it is an actual disaster first before a failover occurs and the failback is the trickiest part of MetroClusters and involves a lot of work. You want to make sure that it’s a disaster before deciding to call a disaster and implement the DR plan. For some people this requires sign-off from someone in the business before failover occurs. This won’t be possible with Tie Breaker already. Also, Tie-Breaker cannot perform a failback.
For the most part Tie-Breaker is not really implemented.