Tuesday 17 January 2012

Manually Failover activity in NetApp Metro cluster Environment


Manually Failover activity in NetApp Metro cluster Environment
Today I want to write about the manually performing the takeover and failback activity in netapp metro cluster environment.
In metro cluster environment the takeover activity does not work just by giving the cmd cf takeover cmd.
Takeover process.
We need to manually fail the ISL link.
 We need to give the “cf forcetakeover –d”. cmd.
Giveback process.
aggr status –r   : Validate that you can access the remote storage. If remote shelves don’t show up, check      connectivity
partner   : Go into partner mode on the surviving node.
aggr status –r:  Determine which aggregates are at the surviving site and which aggregates are at the disaster site by entering the command at the left.
Aggregates at the disaster site show plexes that are in a failed state with an out-of-date status. Aggregates at the Surviving site show the plex online.
Note: If aggregates at the disaster site are online, take them offline by entering the following command for each online aggregate: aggr offline disaster_aggr (disaster_aggr is the name of the aggregate at the disaster site).
 Note :( An error message appears if the aggregate is already offline.)

Recreate the mirrored aggregates by entering the following command for each aggregate that was split: “aggr mirror aggr_name -v disaster_aggr” (aggr_name is the aggregate on the surviving site’s node.disaster_aggr is the aggregate on the disaster site’s node. The aggr_name aggregate rejoins the disaster_aggr aggregate to reestablish the MetroCluster configuration. Caution: Make sure that resynchronization is complete on each aggregate before attempting the following step).
Partner (Return to the command prompt of the remote node).


Cf giveback (The node at the disaster site reboots).

Step by step Procedure



Description
To test Disaster Recovery, you must restrict access to the disaster site node to prevent the node from resuming service.  If you do not, you risk the possibility of data corruption.
Procedure
Access to the disaster site note can be restricted in the following ways:
  • Turn off the power to the disaster site node.

    Or
  • Use "manual fencing" (Disconnect VI interconnects and fiber channel cables).

However, both of these solutions require physical access to the disaster site node.  It is not always possible (or practical) for testing purposes.
Proceed with the steps below for "fencing" the fabric MetroCluster without power loss and to test Disaster Recovery without physical access.

Note: Site A is the takeover site. Site B is the disaster site.

Takeover procedure
  1. Stop ISL connections between sites.
  • Connect on both fabric MetroCluster switches on site A and block all ISL ports.  Retrieve the ISL port number.

    SITEA02:admin> switchshow
    switchName:     SITEA02
    switchType:     34.0
    switchState:    Online
    switchMode:     Native
    switchRole:     Principal
    switchDomain:   2
    switchId:       fffc02
    switchWwn:      10:00:00:05:1e:05:ca:b1
    zoning:         OFF
    switchBeacon:   OFF

    Area Port Media Speed State     Proto
    =====================================
      0   0    id    N4   Online    F-Port  21:00:00:1b:32:1f:ff:66
      1   1    id    N4   Online    F-Port  50:0a:09:82:00:01:d7:40
      2   2    id    N4   Online    F-Port  50:0a:09:80:00:01:d7:40
      3   3    id    N4   No_Light
      4   4    id    N4   No_Light
      5   5    id    N2   Online    L-Port  28 public
      6   6    id    N2   Online    L-Port  28 public
      7   7    id    N2   Online    L-Port  28 public
      8   8    id    N4   Online    Online  LE E-Port 10:00:00:05:1e:05:d0:39 "SITEB02" (downstream)
      9   9    id    N4   No_Light
     10  10    id    N4   No_Light
     11  11    id    N4   No_Light
     12  12    id    N4   No_Light
     13  13    id    N2   Online    L-Port  28 public
     14  14    id    N2   Online    L-Port  28 public
     15  15    id    N4   No_Light
  • Check fabric before blocking the ISL port.   
     
SITEA02:admin> fabricshow
Switch ID   Worldwide Name         Enet IP Addr   FC IP Addr  Name
-------------------------------------------------------------------------
1: fffc01 10:00:00:05:1e:05:d0:39  44.55.104.20   0.0.0.0     "SITEB02"
2: fffc02 10:00:00:05:1e:05:ca:b1  44.55.104.10   0.0.0.0     >"SITEA02"
 
The Fabric has 2 switches
  • Disable the ISL port.

    SITEA02:admin> portdisable 8 Check split of the fabric.ss 
  • Check split of the fabric.

    SITEA02:admin> fabricshow
    Switch ID   Worldwide Name      Enet IP Addr    FC IP Addr   Name
    -----------------------------------------------------------------------
    2: fffc02 10:00:00:05:1e:05:ca:b1 44.55.104.10    0.0.0.0    >"SITEA02" 10:00:00:05:1e:05:d2:90           44.55.104.11    0.0.0.0    >"SITEA03"
  • Do the same thing on the second switch.

    SITEA03:admin> switchshow
    switchName:     SITEA03
    switchType:     34.0
    switchState:    Online
    switchMode:     Native
    switchRole:     Principal
    switchDomain:   4
    switchId:       fffc04
    switchWwn:      10:00:00:05:1e:05:d2:90
    zoning:         OFF
    switchBeacon:   OFF

    Area Port Media Speed State     Proto
    =====================================
      0   0   id    N4   Online     F-Port  21:01:00:1b:32:3f:ff:66
      1   1   id    N4   Online     F-Port  50:0a:09:83:00:01:d7:40
      2   2   id    N4   Online     F-Port  50:0a:09:81:00:01:d7:40
      3   3   id    N4   No_Light
      4   4   id    N4   No_Light
      5   5   id    N2   Online     L-Port  28 public
      6   6   id    N2   Online     L-Port  28 public
      7   7   id    N2   Online     L-Port  28 public
      8   8   id    N4   Online     LE E-Port  10:00:00:05:1e:05:d1:c3 "SITEB03" (downstream)
      9   9   id    N4   No_Light
     10  10   id    N4   No_Light
     11  11   id    N4   No_Light
     12  12   id    N4   No_Light
     13  13   id    N2   Online     L-Port  28 public
     14  14   id    N2   Online     L-Port  28 public
     15  15   id    N4   No_Light
SITEA03:admin> fabricshow
Switch ID   Worldwide Name          Enet IP Addr  FC IP Addr Name
-----------------------------------------------------------------------
  3: fffc03 10:00:00:05:1e:05:d1:c3 44.55.104.21  0.0.0.0    "SITEB03"
  4: fffc04 10:00:00:05:1e:05:d2:90 44.55.104.11  0.0.0.0    >"SITEA03"

The Fabric has 2 switches

SITEA03:admin> portdisable 8
SITEA03:admin> fabricshow
Switch ID   Worldwide Name          Enet IP Addr  FC IP Addr Name
-----------------------------------------------------------------------
  4: fffc04 10:00:00:05:1e:05:d2:90 44.55.104.11  0.0.0.0    >"SITEA03"
  • Check the NetApp controller console for disks missing.

    Tue Feb  5 16:21:37 CET [NetAppSiteA: raid.config.spare.disk.missing:info]: Spare Disk SITEB03:6.23 Shelf 1 Bay 7 [NETAPP   X276_FAL9E288F10 NA02] S/N [DH07P7803V7L] is missing. 
  1. Check all aggregates are split.

    NetAppSiteA> aggr status -r
    Aggregate aggr0 (online, raid_dp, mirror degraded) (block checksums)
    Plex /aggr0/plex0 (online, normal, active, pool0)
    RAID group /aggr0/plex0/rg0 (normal)

    RAID Disk Device     HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)   Phys (MB/blks)
    ---------------------------------------------------------------------------------------
    dparity SITEA03:5.16 0b  1     0   FC:B  0  FCAL  10000 272000/557056000 280104/573653840
    parity  SITEA02:5.32 0c  2     0   FC:A  0  FCAL  10000 272000/557056000 280104/573653840
    data    SITEA03:6.16 0d  1     0   FC:B  0  FCAL  10000 272000/557056000 280104/573653840

    Plex /aggr0/plex1 (offline, failed, inactive, pool1)
    RAID group /aggr0/plex1/rg0 (partial)

    RAID Disk Device HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks) Phys (MB/blks)
    ---------------------------------------------------------------------------------
    dparity   FAILED                     N/A            272000/557056000
    parity    FAILED                     N/A            272000/557056000
    data      FAILED                     N/A            272000/557056000
    Raid group is missing 3 disks.
    NetAppSiteB> aggr status -r
    Aggregate aggr0 (online, raid_dp, mirror degraded) (block checksums)
    Plex /aggr0/plex0 (online, normal, active, pool0)
    RAID group /aggr0/plex0/rg0 (normal)

    RAID Disk Device        HA SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)   Phys (MB/blks)
    ------------------------------------------------------------------------------------------
    dparity   SITEB03:13.17 0d   1   1   FC:B  0  FCAL  10000 272000/557056000 280104/573653840
    parity    SITEB03:13.32 0b   2   0   FC:B  0  FCAL  10000 272000/557056000 280104/573653840
    data      SITEB02:14.16 0a   1   0   FC:A  0  FCAL  10000 272000/557056000 280104/573653840

    Plex /aggr0/plex1 (offline, failed, inactive, pool1)
    RAID group /aggr0/plex1/rg0 (partial)

    RAID Disk Device        HA SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
    --------------------------------------------------------------------------------------
    dparity   FAILED                          N/A             72000/557056000
    parity    FAILED                          N/A             72000/557056000
    data      FAILED                          N/A             72000/557056000
    Raid group is missing 3 disks.
  1. Connect on the Remote LAN Management (RLM) console on site B.  Stop and power off the NetApp controller.

    NetAppSiteB> halt
    Boot Loader version 1.2.3
    Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
    Portions Copyright (C) 2002-2006 NetApp Inc.

    CPU Type: Dual Core AMD Opteron(tm) Processor 265
    LOADER>
  • Power off the NetApp controller.

    LOADER>
    Ctrl-d
    RLM NetAppSiteB> system power off
    This will cause a dirty shutdown of your appliance.  Continue? [y/n]

    RLM NetAppSiteB> system power status
    Power supply 1 status:
       Present: yes
       Turned on by Agent: no
       Output power: no
       Input power: yes
       Fault: no
    Power supply 2 status:
       Present: yes
       Turned on by Agent: no
       Output power: no
       Input power: yes
       Fault: no
       4.   Now you can test Disaster Recovery.
NetAppSiteA> cf forcetakeover -d
----
NetAppSiteA(takeover)>

NetAppSiteA(takeover)> aggr status -v
Aggr State     Status           Options
aggr0 online   raid_dp, aggr    root, diskroot, nosnap=off,
               mirror degraded  raidtype=raid_dp, raidsize=16,
                                ignore_inconsistent=off,
                                snapmirrored=off,
                                resyncsnaptime=60,
                                fs_size_fixed=off,
                                snapshot_autodelete=on,
                                lost_write_protect=on
                Volumes: vol0

                Plex /aggr0/plex0: online, normal, active
                    RAID group /aggr0/plex0/rg0: normal

                Plex /aggr0/plex1: offline, failed, inactive

NetAppSiteB/NetAppSiteA> aggr status -v
Aggr State      Status            Options
aggr0 online    raid_dp, aggr     root, diskroot, nosnap=off,
                                  raidtype=raid_dp, raidsize=16,
                                  ignore_inconsistent=off,
                                  snapmirrored=off,
                                  resyncsnaptime=60,
                                  fs_size_fixed=off,
                                  snapshot_autodelete=on,
                                  lost_write_protect=on
                Volumes: vol0

                Plex /aggr0/plex1: online, normal, active
                    RAID group /aggr0/plex1/rg0: normal 
Giveback procedure

       5.   After testing Disaster Recovery, unblock all ISL ports.
SITEA03:admin> portenable 8 
  • Wait awhile (Fabric initialization)

    SITEA03: admin> fabricshow
    Switch ID Worldwide Name           Enet IP Addr  FC IP Addr    Name
    -----------------------------------------------------------------------------------------------------------
    3: fffc03 10:00:00:05:1e:05:d1:c3  44.55.104.21  0.0.0.0     SITEB03"
    4: fffc04 10:00:00:05:1e:05:d2:90  44.55.104.11  0.0.0.0     SITEA03"
    The Fabric has 2 switches

    SITEA02:admin> portenable 8
  • Wait awhile (Fabric initialization)

    SITEA02:admin> fabricshow
    Switch ID   Worldwide Name           Enet IP Addr    FC IP Addr      Name
    -------------------------------------------------------------------------
    1: fffc01 10:00:00:05:1e:05:d0:39  44.55.104.20    0.0.0.0      "SITEB02"
    2: fffc02 10:00:00:05:1e:05:ca:b1  44.55.104.10    0.0.0.0     >"SITEA02"
    The Fabric has 2 switches
      6.    Synchronize all aggregates.
NetAppSiteB/NetAppSiteA> aggr status -v
      Aggr State      Status            Options
  aggr0(1) failed     raid_dp, aggr     diskroot, raidtype=raid_dp,
                      out-of-date       raidsize=16, resyncsnaptime=60,
                                        lost_write_protect=off
           Volumes:
                            Plex /aggr0(1)/plex0: offline, normal, out-of-date
           RAID group /aggr0(1)/plex0/rg0: normal
               Plex /aggr0(1)/plex1: offline, failed, out-of-date

          aggr0 online    raid_dp, aggr     root, diskroot, nosnap=off,
                                            raidtype=raid_dp, raidsize=16,
                                            ignore_inconsistent=off,
                                            snapmirrored=off,
                                            resyncsnaptime=60,
                                            fs_size_fixed=off,
                                            snapshot_autodelete=on,
                                            lost_write_protect=on
                Volumes: vol0

                Plex /aggr0/plex1: online, normal, active
                    RAID group /aggr0/plex1/rg0: normal 
  • Launch aggregate mirror for each one.

    NetAppSiteB/NetAppSiteA> aggr mirror aggr0 –v aggr0(1)
  • Wait awhile for all aggregates to synchronize.

    NetAppSiteB/NetAppSiteA: raid.mirror.resync.done:notice]: /aggr0: resynchronization completed in 0:03.36

    NetAppSiteB/NetAppSiteA> aggr mirror aggr0 -v aggr0(1)
        Aggr State     Status           Options
        aggr0 online   raid_dp, aggr    root, diskroot, nosnap=off,
                       mirrored         raidtype=raid_dp, raidsize=16,
                                        ignore_inconsistent=off,
                                        snapmirrored=off,
                                        resyncsnaptime=60,
                                        fs_size_fixed=off,
                                        snapshot_autodelete=on,
                                        lost_write_protect=on
             Volumes: vol0

             Plex /aggr0/plex1: online, normal, active
             RAID group /aggr0/plex1/rg0: normal

             Plex /aggr0/plex3: online, normal, active
             RAID group /aggr0/plex3/rg0: normal
       7.   After re-synchronization is done, power on and boot the NetApp controller on site B.
RLM NetAppSiteB> system power on
RLM NetAppSiteB> system console
Type Ctrl-D to exit.

Boot Loader version 1.2.3
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Portions Copyright (C) 2002-2006 NetApp Inc.

NetApp Release 7.2.3: Sat Oct 20 17:27:02 PDT 2007
Copyright (c) 1992-2007 NetApp, Inc.
Starting boot on Tue Feb  5 15:37:40 GMT 2008
Tue Feb  5 15:38:31 GMT [ses.giveback.wait:info]: Enclosure Services will be unavailable while waiting for giveback.
Press Ctrl-C for Maintenance menu to release disks.
Waiting for giveback
  1. On site A, execute cf giveback        

    NetAppSiteA(takeover)> cf status
    NetAppSiteA has taken over NetAppSiteB.
    NetAppSiteB is ready for giveback.

    NetAppSiteA(takeover)> cf giveback
    please make sure you have rejoined your aggr before giveback.
    Do you wish to continue [y/n] ?? y

    NetAppSiteA> cf status
    Tue Feb  5 16:41:00 CET [NetAppSiteA: monitor.globalStatus.ok:info]: The system's global status is normal.
    Cluster enabled, NetAppSiteB is up.
     

















 

No comments:

Post a Comment