vipulvajpayee blog: Manually Failover activity in NetApp Metro cluster Environment

Manually Failover activity in NetApp Metro cluster Environment

Today I want to write about the manually performing the takeover and failback activity in netapp metro cluster environment.

In metro cluster environment the takeover activity does not work just by giving the cmd cf takeover cmd.

Takeover process.

We need to manually fail the ISL link.

We need to give the “cf forcetakeover –d”. cmd.

Giveback process.

aggr status –r : Validate that you can access the remote storage. If remote shelves don’t show up, check connectivity

partner : Go into partner mode on the surviving node.

aggr status –r: Determine which aggregates are at the surviving site and which aggregates are at the disaster site by entering the command at the left.

Aggregates at the disaster site show plexes that are in a failed state with an out-of-date status. Aggregates at the Surviving site show the plex online.

Note: If aggregates at the disaster site are online, take them offline by entering the following command for each online aggregate: aggr offline disaster_aggr (disaster_aggr is the name of the aggregate at the disaster site).

Note :( An error message appears if the aggregate is already offline.)

Recreate the mirrored aggregates by entering the following command for each aggregate that was split: “aggr mirror aggr_name -v disaster_aggr” (aggr_name is the aggregate on the surviving site’s node.disaster_aggr is the aggregate on the disaster site’s node. The aggr_name aggregate rejoins the disaster_aggr aggregate to reestablish the MetroCluster configuration. Caution: Make sure that resynchronization is complete on each aggregate before attempting the following step).

Partner (Return to the command prompt of the remote node).

Cf giveback (The node at the disaster site reboots).

Step by step Procedure

Description

To test Disaster Recovery, you must restrict access to the disaster site node to prevent the node from resuming service. If you do not, you risk the possibility of data corruption.

Procedure

Access to the disaster site note can be restricted in the following ways:

Turn off the power to the disaster site node.

Or
Use "manual fencing" (Disconnect VI interconnects and fiber channel cables).

However, both of these solutions require physical access to the disaster site node. It is not always possible (or practical) for testing purposes.

Proceed with the steps below for "fencing" the fabric MetroCluster without power loss and to test Disaster Recovery without physical access.

Note: Site A is the takeover site. Site B is the disaster site.

Takeover procedure

Stop ISL connections between sites.

Connect on both fabric MetroCluster switches on site A and block all ISL ports.  Retrieve the ISL port number.

SITEA02:admin> switchshow
switchName:     SITEA02
switchType:     34.0
switchState:    Online
switchMode:     Native
switchRole:     Principal
switchDomain:   2
switchId:       fffc02
switchWwn:      10:00:00:05:1e:05:ca:b1
zoning:         OFF
switchBeacon:   OFF

Area Port Media Speed State     Proto
=====================================
0   0 id    N4   Online    F-Port 21:00:00:1b:32:1f:ff:66
1   1   id    N4   Online    F-Port 50:0a:09:82:00:01:d7:40
2   2   id    N4   Online    F-Port 50:0a:09:80:00:01:d7:40
3   3   id    N4   No_Light
4   4   id    N4   No_Light
5   5   id    N2   Online    L-Port 28 public
6   6   id    N2   Online    L-Port 28 public
7   7   id    N2   Online    L-Port 28 public
8   8   id    N4   Online    Online LE E-Port 10:00:00:05:1e:05:d0:39 "SITEB02" (downstream)
9   9   id    N4   No_Light
10 10   id    N4   No_Light
11 11   id    N4   No_Light
12 12   id    N4   No_Light
13 13   id    N2   Online    L-Port 28 public
14 14   id    N2   Online    L-Port 28 public
15 15   id    N4   No_Light

Check fabric before blocking the ISL port.

SITEA02:admin> fabricshow
Switch ID   Worldwide Name         Enet IP Addr   FC IP Addr Name
-------------------------------------------------------------------------
1: fffc01 10:00:00:05:1e:05:d0:39 44.55.104.20   0.0.0.0     "SITEB02"
2: fffc02 10:00:00:05:1e:05:ca:b1 44.55.104.10   0.0.0.0   >"SITEA02"

The Fabric has 2 switches

Disable the ISL port.

SITEA02:admin> portdisable 8 Check split of the fabric.ss
Check split of the fabric.

SITEA02:admin> fabricshow
Switch ID Worldwide Name Enet IP Addr FC IP Addr Name
-----------------------------------------------------------------------
2: fffc02 10:00:00:05:1e:05:ca:b1 44.55.104.10 0.0.0.0 >"SITEA02" 10:00:00:05:1e:05:d2:90 44.55.104.11 0.0.0.0 >"SITEA03"
Do the same thing on the second switch.

SITEA03:admin> switchshow
switchName:     SITEA03
switchType:     34.0
switchState:    Online
switchMode:     Native
switchRole:     Principal
switchDomain:   4
switchId:       fffc04
switchWwn:      10:00:00:05:1e:05:d2:90
zoning:         OFF
switchBeacon:   OFF

Area Port Media Speed State     Proto
=====================================
0   0   id    N4   Online     F-Port 21:01:00:1b:32:3f:ff:66
1   1   id    N4   Online     F-Port 50:0a:09:83:00:01:d7:40
2   2   id    N4   Online     F-Port 50:0a:09:81:00:01:d7:40
3   3   id    N4   No_Light
4   4   id    N4   No_Light
5   5   id    N2   Online     L-Port 28 public
6   6   id    N2   Online     L-Port 28 public
7   7   id    N2   Online     L-Port 28 public
8   8   id    N4   Online     LE E-Port 10:00:00:05:1e:05:d1:c3 "SITEB03" (downstream)
9   9   id    N4   No_Light
10 10   id    N4   No_Light
11 11   id    N4   No_Light
12 12   id    N4   No_Light
13 13   id    N2   Online     L-Port 28 public
14 14   id    N2   Online     L-Port 28 public
15 15   id    N4   No_Light

SITEA03:admin> fabricshow
Switch ID   Worldwide Name          Enet IP Addr  FC IP Addr Name
-----------------------------------------------------------------------
3: fffc03 10:00:00:05:1e:05:d1:c3 44.55.104.21  0.0.0.0    "SITEB03"
4: fffc04 10:00:00:05:1e:05:d2:90 44.55.104.11  0.0.0.0    >"SITEA03"

The Fabric has 2 switches

SITEA03:admin> portdisable 8
SITEA03:admin> fabricshow
Switch ID   Worldwide Name          Enet IP Addr  FC IP Addr Name
-----------------------------------------------------------------------
4: fffc04 10:00:00:05:1e:05:d2:90 44.55.104.11  0.0.0.0    >"SITEA03"

Check the NetApp controller console for disks missing.

Tue Feb 5 16:21:37 CET [NetAppSiteA: raid.config.spare.disk.missing:info]: Spare Disk SITEB03:6.23 Shelf 1 Bay 7 [NETAPP X276_FAL9E288F10 NA02] S/N [DH07P7803V7L] is missing.

Check all aggregates are split.

NetAppSiteA> aggr status -r
Aggregate aggr0 (online, raid_dp, mirror degraded) (block checksums)
Plex /aggr0/plex0 (online, normal, active, pool0)
RAID group /aggr0/plex0/rg0 (normal)

RAID Disk Device     HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)   Phys (MB/blks)
---------------------------------------------------------------------------------------
dparity SITEA03:5.16 0b  1     0   FC:B 0 FCAL 10000 272000/557056000 280104/573653840
parity  SITEA02:5.32 0c  2     0   FC:A 0 FCAL 10000 272000/557056000 280104/573653840
data    SITEA03:6.16 0d  1     0   FC:B 0 FCAL 10000 272000/557056000 280104/573653840

Plex /aggr0/plex1 (offline, failed, inactive, pool1)
RAID group /aggr0/plex1/rg0 (partial)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
---------------------------------------------------------------------------------
dparity   FAILED                     N/A            272000/557056000
parity    FAILED                     N/A            272000/557056000
data      FAILED                     N/A            272000/557056000
Raid group is missing 3 disks.
NetAppSiteB> aggr status -r
Aggregate aggr0 (online, raid_dp, mirror degraded) (block checksums)
Plex /aggr0/plex0 (online, normal, active, pool0)
RAID group /aggr0/plex0/rg0 (normal)

RAID Disk Device        HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)   Phys (MB/blks)
------------------------------------------------------------------------------------------
dparity   SITEB03:13.17 0d   1   1   FC:B 0  FCAL 10000 272000/557056000 280104/573653840
parity    SITEB03:13.32 0b   2   0 FC:B 0  FCAL 10000 272000/557056000 280104/573653840
data      SITEB02:14.16 0a   1   0   FC:A 0  FCAL 10000 272000/557056000 280104/573653840

Plex /aggr0/plex1 (offline, failed, inactive, pool1)
RAID group /aggr0/plex1/rg0 (partial)

RAID Disk Device        HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)    Phys (MB/blks)
--------------------------------------------------------------------------------------
dparity   FAILED                         N/A             72000/557056000
parity    FAILED                          N/A             72000/557056000
data      FAILED                         N/A             72000/557056000
Raid group is missing 3 disks.

Connect on the Remote LAN Management (RLM) console on site B. Stop and power off the NetApp controller.

NetAppSiteB> halt
Boot Loader version 1.2.3
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Portions Copyright (C) 2002-2006 NetApp Inc.

CPU Type: Dual Core AMD Opteron(tm) Processor 265
LOADER>

Power off the NetApp controller.

LOADER>
Ctrl-d
RLM NetAppSiteB> system power off
This will cause a dirty shutdown of your appliance. Continue? [y/n]

RLM NetAppSiteB> system power status
Power supply 1 status:
   Present: yes
   Turned on by Agent: no
   Output power: no
   Input power: yes
   Fault: no
Power supply 2 status:
   Present: yes
   Turned on by Agent: no
   Output power: no
   Input power: yes
   Fault: no

4. Now you can test Disaster Recovery.

NetAppSiteA> cf forcetakeover -d
----
NetAppSiteA(takeover)>

NetAppSiteA(takeover)> aggr status -v
Aggr State     Status          Options
aggr0 online raid_dp, aggr    root, diskroot, nosnap=off,
               mirror degraded  raidtype=raid_dp, raidsize=16,
                                ignore_inconsistent=off,
                                snapmirrored=off,
                                resyncsnaptime=60,
                                fs_size_fixed=off,
                                snapshot_autodelete=on,
                                lost_write_protect=on
                Volumes: vol0

                Plex /aggr0/plex0: online, normal, active
                    RAID group /aggr0/plex0/rg0: normal

                Plex /aggr0/plex1: offline, failed, inactive

NetAppSiteB/NetAppSiteA> aggr status -v
Aggr State      Status            Options
aggr0 online    raid_dp, aggr     root, diskroot, nosnap=off,
                                  raidtype=raid_dp, raidsize=16,
                                  ignore_inconsistent=off,
                                  snapmirrored=off,
                                  resyncsnaptime=60,
                                  fs_size_fixed=off,
                                  snapshot_autodelete=on,
                                  lost_write_protect=on
                Volumes: vol0

                Plex /aggr0/plex1: online, normal, active
                    RAID group /aggr0/plex1/rg0: normal

Giveback procedure

5. After testing Disaster Recovery, unblock all ISL ports.

SITEA03:admin> portenable 8

Wait awhile (Fabric initialization)

SITEA03: admin> fabricshow
Switch ID Worldwide Name           Enet IP Addr FC IP Addr    Name
-----------------------------------------------------------------------------------------------------------
3: fffc03 10:00:00:05:1e:05:d1:c3 44.55.104.21 0.0.0.0     SITEB03"
4: fffc04 10:00:00:05:1e:05:d2:90 44.55.104.11 0.0.0.0     SITEA03"
The Fabric has 2 switches

SITEA02:admin> portenable 8

Wait awhile (Fabric initialization)

SITEA02:admin> fabricshow
Switch ID   Worldwide Name           Enet IP Addr    FC IP Addr      Name
-------------------------------------------------------------------------
1: fffc01 10:00:00:05:1e:05:d0:39 44.55.104.20    0.0.0.0      "SITEB02"
2: fffc02 10:00:00:05:1e:05:ca:b1 44.55.104.10    0.0.0.0     >"SITEA02"
The Fabric has 2 switches

6. Synchronize all aggregates.

NetAppSiteB/NetAppSiteA> aggr status -v
      Aggr State      Status            Options
aggr0(1) failed     raid_dp, aggr     diskroot, raidtype=raid_dp,
                      out-of-date       raidsize=16, resyncsnaptime=60,
                                        lost_write_protect=off
           Volumes:
                            Plex /aggr0(1)/plex0: offline, normal, out-of-date
           RAID group /aggr0(1)/plex0/rg0: normal

               Plex /aggr0(1)/plex1: offline, failed, out-of-date

          aggr0 online    raid_dp, aggr     root, diskroot, nosnap=off,
                                            raidtype=raid_dp, raidsize=16,
                                            ignore_inconsistent=off,
                                            snapmirrored=off,
                                            resyncsnaptime=60,
                                            fs_size_fixed=off,
                                            snapshot_autodelete=on,
                                            lost_write_protect=on
                Volumes: vol0

                Plex /aggr0/plex1: online, normal, active
                    RAID group /aggr0/plex1/rg0: normal

Launch aggregate mirror for each one.

NetAppSiteB/NetAppSiteA> aggr mirror aggr0 –v aggr0(1)

Wait awhile for all aggregates to synchronize.

NetAppSiteB/NetAppSiteA: raid.mirror.resync.done:notice]: /aggr0: resynchronization completed in 0:03.36

NetAppSiteB/NetAppSiteA> aggr mirror aggr0 -v aggr0(1)
    Aggr State     Status           Options
    aggr0 online   raid_dp, aggr    root, diskroot, nosnap=off,
                   mirrored         raidtype=raid_dp, raidsize=16,
                                    ignore_inconsistent=off,
                                    snapmirrored=off,
                                    resyncsnaptime=60,
                                    fs_size_fixed=off,
                                    snapshot_autodelete=on,
                                  lost_write_protect=on
         Volumes: vol0

         Plex /aggr0/plex1: online, normal, active
         RAID group /aggr0/plex1/rg0: normal

         Plex /aggr0/plex3: online, normal, active
         RAID group /aggr0/plex3/rg0: normal

7. After re-synchronization is done, power on and boot the NetApp controller on site B.

RLM NetAppSiteB> system power on
RLM NetAppSiteB> system console
Type Ctrl-D to exit.

Boot Loader version 1.2.3
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Portions Copyright (C) 2002-2006 NetApp Inc.

NetApp Release 7.2.3: Sat Oct 20 17:27:02 PDT 2007
Copyright (c) 1992-2007 NetApp, Inc.
Starting boot on Tue Feb 5 15:37:40 GMT 2008
Tue Feb 5 15:38:31 GMT [ses.giveback.wait:info]: Enclosure Services will be unavailable while waiting for giveback.
Press Ctrl-C for Maintenance menu to release disks.
Waiting for giveback

On site A, execute cf giveback

NetAppSiteA(takeover)> cf status
NetAppSiteA has taken over NetAppSiteB.
NetAppSiteB is ready for giveback.

NetAppSiteA(takeover)> cf giveback
please make sure you have rejoined your aggr before giveback.
Do you wish to continue [y/n] ?? y

NetAppSiteA> cf status
Tue Feb 5 16:41:00 CET [NetAppSiteA: monitor.globalStatus.ok:info]: The system's global status is normal.
Cluster enabled, NetAppSiteB is up.

vipulvajpayee blog

Tuesday, 17 January 2012

Manually Failover activity in NetApp Metro cluster Environment

No comments:

Post a Comment

About Me

Blog Archive