Manually Failover activity in NetApp Metro cluster Environment
Today I want to write about the manually performing the takeover and failback activity in netapp metro cluster environment.
In metro cluster environment the takeover activity does not work just by giving the cmd cf takeover cmd.
Takeover process.
We need to manually fail the ISL link.
We need to give the “cf forcetakeover –d”. cmd.
Giveback process.
aggr status –r : Validate that you can access the remote storage. If remote shelves don’t show up, check connectivity
partner : Go into partner mode on the surviving node.
aggr status –r: Determine which aggregates are at the surviving site and which aggregates are at the disaster site by entering the command at the left.
Aggregates at the disaster site show plexes that are in a failed state with an out-of-date status. Aggregates at the Surviving site show the plex online.
Note: If aggregates at the disaster site are online, take them offline by entering the following command for each online aggregate: aggr offline disaster_aggr (disaster_aggr is the name of the aggregate at the disaster site).
Note :( An error message appears if the aggregate is already offline.)
Recreate the mirrored aggregates by entering the following command for each aggregate that was split: “aggr mirror aggr_name -v disaster_aggr” (aggr_name is the aggregate on the surviving site’s node.disaster_aggr is the aggregate on the disaster site’s node. The aggr_name aggregate rejoins the disaster_aggr aggregate to reestablish the MetroCluster configuration. Caution: Make sure that resynchronization is complete on each aggregate before attempting the following step).
Partner (Return to the command prompt of the remote node).
Cf giveback (The node at the disaster site reboots).
Step by step Procedure
However, both of these solutions require physical access to the disaster site node. It is not always possible (or practical) for testing purposes.
Step by step Procedure
Description
To test Disaster Recovery, you must restrict access to the
disaster site node to prevent the node from resuming service. If you do
not, you risk the possibility of data corruption.
Procedure
Access to the disaster site note can be restricted in the
following ways:
- Turn off
the power to the disaster site node.
Or - Use "manual fencing" (Disconnect VI interconnects and fiber channel cables).
However, both of these solutions require physical access to the disaster site node. It is not always possible (or practical) for testing purposes.
Proceed with the steps below for "fencing" the
fabric MetroCluster without power loss and to test Disaster Recovery
without physical access.
Note: Site A is the takeover site. Site B is the disaster site.
Takeover procedure
Note: Site A is the takeover site. Site B is the disaster site.
Takeover procedure
- Stop ISL connections between sites.
- Connect on
both fabric MetroCluster switches on site A and block all ISL
ports. Retrieve the ISL port number.
SITEA02:admin> switchshow
switchName: SITEA02
switchType: 34.0
switchState: Online
switchMode: Native
switchRole: Principal
switchDomain: 2
switchId: fffc02
switchWwn: 10:00:00:05:1e:05:ca:b1
zoning: OFF
switchBeacon: OFF
Area Port Media Speed State Proto
=====================================
0 0 id N4 Online F-Port 21:00:00:1b:32:1f:ff:66
1 1 id N4 Online F-Port 50:0a:09:82:00:01:d7:40
2 2 id N4 Online F-Port 50:0a:09:80:00:01:d7:40
3 3 id N4 No_Light
4 4 id N4 No_Light
5 5 id N2 Online L-Port 28 public
6 6 id N2 Online L-Port 28 public
7 7 id N2 Online L-Port 28 public
8 8 id N4 Online Online LE E-Port 10:00:00:05:1e:05:d0:39 "SITEB02" (downstream)
9 9 id N4 No_Light
10 10 id N4 No_Light
11 11 id N4 No_Light
12 12 id N4 No_Light
13 13 id N2 Online L-Port 28 public
14 14 id N2 Online L-Port 28 public
15 15 id N4 No_Light
- Check
fabric before blocking the ISL port.
SITEA02:admin> fabricshow
Switch ID Worldwide Name Enet IP Addr FC IP Addr Name
-------------------------------------------------------------------------
1: fffc01 10:00:00:05:1e:05:d0:39 44.55.104.20 0.0.0.0 "SITEB02"
2: fffc02 10:00:00:05:1e:05:ca:b1 44.55.104.10 0.0.0.0 >"SITEA02"
Switch ID Worldwide Name Enet IP Addr FC IP Addr Name
-------------------------------------------------------------------------
1: fffc01 10:00:00:05:1e:05:d0:39 44.55.104.20 0.0.0.0 "SITEB02"
2: fffc02 10:00:00:05:1e:05:ca:b1 44.55.104.10 0.0.0.0 >"SITEA02"
The Fabric has 2 switches
- Disable
the ISL port.
SITEA02:admin> portdisable 8 Check split of the fabric.ss - Check
split of the fabric.
SITEA02:admin> fabricshow
Switch ID Worldwide Name Enet IP Addr FC IP Addr Name
-----------------------------------------------------------------------
2: fffc02 10:00:00:05:1e:05:ca:b1 44.55.104.10 0.0.0.0 >"SITEA02" 10:00:00:05:1e:05:d2:90 44.55.104.11 0.0.0.0 >"SITEA03" - Do the
same thing on the second switch.
SITEA03:admin> switchshow
switchName: SITEA03
switchType: 34.0
switchState: Online
switchMode: Native
switchRole: Principal
switchDomain: 4
switchId: fffc04
switchWwn: 10:00:00:05:1e:05:d2:90
zoning: OFF
switchBeacon: OFF
Area Port Media Speed State Proto
=====================================
0 0 id N4 Online F-Port 21:01:00:1b:32:3f:ff:66
1 1 id N4 Online F-Port 50:0a:09:83:00:01:d7:40
2 2 id N4 Online F-Port 50:0a:09:81:00:01:d7:40
3 3 id N4 No_Light
4 4 id N4 No_Light
5 5 id N2 Online L-Port 28 public
6 6 id N2 Online L-Port 28 public
7 7 id N2 Online L-Port 28 public
8 8 id N4 Online LE E-Port 10:00:00:05:1e:05:d1:c3 "SITEB03" (downstream)
9 9 id N4 No_Light
10 10 id N4 No_Light
11 11 id N4 No_Light
12 12 id N4 No_Light
13 13 id N2 Online L-Port 28 public
14 14 id N2 Online L-Port 28 public
15 15 id N4 No_Light
SITEA03:admin> fabricshow
Switch ID Worldwide Name Enet IP Addr FC IP Addr Name
-----------------------------------------------------------------------
3: fffc03 10:00:00:05:1e:05:d1:c3 44.55.104.21 0.0.0.0 "SITEB03"
4: fffc04 10:00:00:05:1e:05:d2:90 44.55.104.11 0.0.0.0 >"SITEA03"
The Fabric has 2 switches
SITEA03:admin> portdisable 8
SITEA03:admin> fabricshow
Switch ID Worldwide Name Enet IP Addr FC IP Addr Name
-----------------------------------------------------------------------
4: fffc04 10:00:00:05:1e:05:d2:90 44.55.104.11 0.0.0.0 >"SITEA03"
Switch ID Worldwide Name Enet IP Addr FC IP Addr Name
-----------------------------------------------------------------------
3: fffc03 10:00:00:05:1e:05:d1:c3 44.55.104.21 0.0.0.0 "SITEB03"
4: fffc04 10:00:00:05:1e:05:d2:90 44.55.104.11 0.0.0.0 >"SITEA03"
The Fabric has 2 switches
SITEA03:admin> portdisable 8
SITEA03:admin> fabricshow
Switch ID Worldwide Name Enet IP Addr FC IP Addr Name
-----------------------------------------------------------------------
4: fffc04 10:00:00:05:1e:05:d2:90 44.55.104.11 0.0.0.0 >"SITEA03"
- Check the
NetApp controller console for disks missing.
Tue Feb 5 16:21:37 CET [NetAppSiteA: raid.config.spare.disk.missing:info]: Spare Disk SITEB03:6.23 Shelf 1 Bay 7 [NETAPP X276_FAL9E288F10 NA02] S/N [DH07P7803V7L] is missing.
- Check all
aggregates are split.
NetAppSiteA> aggr status -r
Aggregate aggr0 (online, raid_dp, mirror degraded) (block checksums)
Plex /aggr0/plex0 (online, normal, active, pool0)
RAID group /aggr0/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
---------------------------------------------------------------------------------------
dparity SITEA03:5.16 0b 1 0 FC:B 0 FCAL 10000 272000/557056000 280104/573653840
parity SITEA02:5.32 0c 2 0 FC:A 0 FCAL 10000 272000/557056000 280104/573653840
data SITEA03:6.16 0d 1 0 FC:B 0 FCAL 10000 272000/557056000 280104/573653840
Plex /aggr0/plex1 (offline, failed, inactive, pool1)
RAID group /aggr0/plex1/rg0 (partial)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
---------------------------------------------------------------------------------
dparity FAILED N/A 272000/557056000
parity FAILED N/A 272000/557056000
data FAILED N/A 272000/557056000
Raid group is missing 3 disks.
NetAppSiteB> aggr status -r
Aggregate aggr0 (online, raid_dp, mirror degraded) (block checksums)
Plex /aggr0/plex0 (online, normal, active, pool0)
RAID group /aggr0/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
------------------------------------------------------------------------------------------
dparity SITEB03:13.17 0d 1 1 FC:B 0 FCAL 10000 272000/557056000 280104/573653840
parity SITEB03:13.32 0b 2 0 FC:B 0 FCAL 10000 272000/557056000 280104/573653840
data SITEB02:14.16 0a 1 0 FC:A 0 FCAL 10000 272000/557056000 280104/573653840
Plex /aggr0/plex1 (offline, failed, inactive, pool1)
RAID group /aggr0/plex1/rg0 (partial)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------------------------------------------------------------------------------------
dparity FAILED N/A 72000/557056000
parity FAILED N/A 72000/557056000
data FAILED N/A 72000/557056000
Raid group is missing 3 disks.
- Connect on
the Remote LAN Management (RLM) console on site B. Stop and power
off the NetApp controller.
NetAppSiteB> halt
Boot Loader version 1.2.3
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Portions Copyright (C) 2002-2006 NetApp Inc.
CPU Type: Dual Core AMD Opteron(tm) Processor 265
LOADER>
- Power off
the NetApp controller.
LOADER>
Ctrl-d
RLM NetAppSiteB> system power off
This will cause a dirty shutdown of your appliance. Continue? [y/n]
RLM NetAppSiteB> system power status
Power supply 1 status:
Present: yes
Turned on by Agent: no
Output power: no
Input power: yes
Fault: no
Power supply 2 status:
Present: yes
Turned on by Agent: no
Output power: no
Input power: yes
Fault: no
4. Now you can test Disaster Recovery.
NetAppSiteA>
cf forcetakeover -d
----
NetAppSiteA(takeover)>
NetAppSiteA(takeover)> aggr status -v
Aggr State Status Options
aggr0 online raid_dp, aggr root, diskroot, nosnap=off,
mirror degraded raidtype=raid_dp, raidsize=16,
ignore_inconsistent=off,
snapmirrored=off,
resyncsnaptime=60,
fs_size_fixed=off,
snapshot_autodelete=on,
lost_write_protect=on
Volumes: vol0
Plex /aggr0/plex0: online, normal, active
RAID group /aggr0/plex0/rg0: normal
Plex /aggr0/plex1: offline, failed, inactive
NetAppSiteB/NetAppSiteA> aggr status -v
Aggr State Status Options
aggr0 online raid_dp, aggr root, diskroot, nosnap=off,
raidtype=raid_dp, raidsize=16,
ignore_inconsistent=off,
snapmirrored=off,
resyncsnaptime=60,
fs_size_fixed=off,
snapshot_autodelete=on,
lost_write_protect=on
Volumes: vol0
Plex /aggr0/plex1: online, normal, active
RAID group /aggr0/plex1/rg0: normal
----
NetAppSiteA(takeover)>
NetAppSiteA(takeover)> aggr status -v
Aggr State Status Options
aggr0 online raid_dp, aggr root, diskroot, nosnap=off,
mirror degraded raidtype=raid_dp, raidsize=16,
ignore_inconsistent=off,
snapmirrored=off,
resyncsnaptime=60,
fs_size_fixed=off,
snapshot_autodelete=on,
lost_write_protect=on
Volumes: vol0
Plex /aggr0/plex0: online, normal, active
RAID group /aggr0/plex0/rg0: normal
Plex /aggr0/plex1: offline, failed, inactive
NetAppSiteB/NetAppSiteA> aggr status -v
Aggr State Status Options
aggr0 online raid_dp, aggr root, diskroot, nosnap=off,
raidtype=raid_dp, raidsize=16,
ignore_inconsistent=off,
snapmirrored=off,
resyncsnaptime=60,
fs_size_fixed=off,
snapshot_autodelete=on,
lost_write_protect=on
Volumes: vol0
Plex /aggr0/plex1: online, normal, active
RAID group /aggr0/plex1/rg0: normal
Giveback procedure
5. After testing Disaster Recovery, unblock all ISL ports.
5. After testing Disaster Recovery, unblock all ISL ports.
SITEA03:admin> portenable 8
- Wait
awhile (Fabric initialization)
SITEA03: admin> fabricshow
Switch ID Worldwide Name Enet IP Addr FC IP Addr Name
-----------------------------------------------------------------------------------------------------------
3: fffc03 10:00:00:05:1e:05:d1:c3 44.55.104.21 0.0.0.0 SITEB03"
4: fffc04 10:00:00:05:1e:05:d2:90 44.55.104.11 0.0.0.0 SITEA03"
The Fabric has 2 switches
SITEA02:admin> portenable 8
- Wait
awhile (Fabric initialization)
SITEA02:admin> fabricshow
Switch ID Worldwide Name Enet IP Addr FC IP Addr Name
-------------------------------------------------------------------------
1: fffc01 10:00:00:05:1e:05:d0:39 44.55.104.20 0.0.0.0 "SITEB02"
2: fffc02 10:00:00:05:1e:05:ca:b1 44.55.104.10 0.0.0.0 >"SITEA02"
The Fabric has 2 switches
6.
Synchronize all aggregates.
NetAppSiteB/NetAppSiteA> aggr status -v
Aggr State Status Options
aggr0(1) failed raid_dp, aggr diskroot, raidtype=raid_dp,
out-of-date raidsize=16, resyncsnaptime=60,
lost_write_protect=off
Volumes:
Plex /aggr0(1)/plex0: offline, normal, out-of-date
RAID group /aggr0(1)/plex0/rg0: normal
Aggr State Status Options
aggr0(1) failed raid_dp, aggr diskroot, raidtype=raid_dp,
out-of-date raidsize=16, resyncsnaptime=60,
lost_write_protect=off
Volumes:
Plex /aggr0(1)/plex0: offline, normal, out-of-date
RAID group /aggr0(1)/plex0/rg0: normal
Plex /aggr0(1)/plex1: offline, failed, out-of-date
aggr0 online raid_dp, aggr root, diskroot, nosnap=off,
raidtype=raid_dp, raidsize=16,
ignore_inconsistent=off,
snapmirrored=off,
resyncsnaptime=60,
fs_size_fixed=off,
snapshot_autodelete=on,
lost_write_protect=on
Volumes: vol0
Plex /aggr0/plex1: online, normal, active
RAID group /aggr0/plex1/rg0: normal
aggr0 online raid_dp, aggr root, diskroot, nosnap=off,
raidtype=raid_dp, raidsize=16,
ignore_inconsistent=off,
snapmirrored=off,
resyncsnaptime=60,
fs_size_fixed=off,
snapshot_autodelete=on,
lost_write_protect=on
Volumes: vol0
Plex /aggr0/plex1: online, normal, active
RAID group /aggr0/plex1/rg0: normal
- Launch
aggregate mirror for each one.
NetAppSiteB/NetAppSiteA> aggr mirror aggr0 –v aggr0(1)
- Wait
awhile for all aggregates to synchronize.
NetAppSiteB/NetAppSiteA: raid.mirror.resync.done:notice]: /aggr0: resynchronization completed in 0:03.36
NetAppSiteB/NetAppSiteA> aggr mirror aggr0 -v aggr0(1)
Aggr State Status Options
aggr0 online raid_dp, aggr root, diskroot, nosnap=off,
mirrored raidtype=raid_dp, raidsize=16,
ignore_inconsistent=off,
snapmirrored=off,
resyncsnaptime=60,
fs_size_fixed=off,
snapshot_autodelete=on,
lost_write_protect=on
Volumes: vol0
Plex /aggr0/plex1: online, normal, active
RAID group /aggr0/plex1/rg0: normal
Plex /aggr0/plex3: online, normal, active
RAID group /aggr0/plex3/rg0: normal
7. After
re-synchronization is done, power on and boot the NetApp controller on site B.
RLM NetAppSiteB> system power on
RLM NetAppSiteB> system console
Type Ctrl-D to exit.
Boot Loader version 1.2.3
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Portions Copyright (C) 2002-2006 NetApp Inc.
NetApp Release 7.2.3: Sat Oct 20 17:27:02 PDT 2007
Copyright (c) 1992-2007 NetApp, Inc.
Starting boot on Tue Feb 5 15:37:40 GMT 2008
Tue Feb 5 15:38:31 GMT [ses.giveback.wait:info]: Enclosure Services will be unavailable while waiting for giveback.
Press Ctrl-C for Maintenance menu to release disks.
Waiting for giveback
RLM NetAppSiteB> system console
Type Ctrl-D to exit.
Boot Loader version 1.2.3
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Portions Copyright (C) 2002-2006 NetApp Inc.
NetApp Release 7.2.3: Sat Oct 20 17:27:02 PDT 2007
Copyright (c) 1992-2007 NetApp, Inc.
Starting boot on Tue Feb 5 15:37:40 GMT 2008
Tue Feb 5 15:38:31 GMT [ses.giveback.wait:info]: Enclosure Services will be unavailable while waiting for giveback.
Press Ctrl-C for Maintenance menu to release disks.
Waiting for giveback
- On site
A, execute cf
giveback
NetAppSiteA(takeover)> cf status
NetAppSiteA has taken over NetAppSiteB.
NetAppSiteB is ready for giveback.
NetAppSiteA(takeover)> cf giveback
please make sure you have rejoined your aggr before giveback.
Do you wish to continue [y/n] ?? y
NetAppSiteA> cf status
Tue Feb 5 16:41:00 CET [NetAppSiteA: monitor.globalStatus.ok:info]: The system's global status is normal.
Cluster enabled, NetAppSiteB is up.
No comments:
Post a Comment