High-Availability SSO with Cisco 9800 Wireless Controllers – PART I

High-Availability (HA) is a key feature of a wireless LAN design: it aims to keep all the wireless services up for the users without any downtime. It is achieved by eliminating single points of failure, detecting system failures and having redundancy to provide failover.

In a centralized WLAN architecture, the wireless controller is a crucial point of failure that handles the management and control planes (and optionally the data plane) for all the wireless Access Points. It is therefore important to add some system redundancies for the wireless controller and to monitor them.

There are three main forms of redundancy with Cisco wireless LAN controllers:

  • Port redundancy by bundling the distribution ports of the wireless controller as a Link Aggregation Group (LAG) in the event of port failures. CAPWAP traffic can also be load-balanced.

  • Creating an HA pair of wireless controllers using Stateful Switchover (SSO). One controller becomes Active while the other controller is in Hot Standby mode. If the Active controller fails or has a network connection failure, the Hot Standby controller detects the failure and takes the Active role. The Access Points see the HA pair as one single controller. All the CAPWAP connections with the Access Points and all the client states are seamlessly maintained (stateful) with minimal disruption. This article describes in detail SSO redundancy for Cisco 9800 wireless controllers.

  • Installing additional backup wireless controllers using an N+1 architecture. There are many aspects to take into consideration in order to make a controller failover as deterministic as possible. Access Points typically have a resilient mechanism to discover all potential controllers, so each Access Point should be configured with its primary/secondary/tertiary (preferred) controllers. However, using N+1 redundancy, some services can be disrupted for the end users during the time of the failover (compared to SSO). Maximum resiliency can be achieved using N+1 redundancy with several SSO pairs of controllers as primary/secondary/tertiary controllers.

Stateful Switchover (SSO)

The HA SSO feature allows the Active and Standby controllers to synchronise the CAPWAP sessions with each Access Point and to maintain the state of their associated wireless clients. In case of failure of the Active controller, the Standby controller performs a Switchover and takes over the Active role. The CAPWAP sessions with each Access Point remain established and the wireless clients remain associated with the new Active controller. The Access Points remain in the RUN state and do not go into the discovery process during a Switchover.

HA SSO provides other extra benefits. For example, the configuration of the Active and Standby controllers is constantly synchronised. This prevents any configuration discrepancies between the two controllers. Both controllers share the same management IP address. SSO also allows to perform zero downtime In-Service Software Upgrade (ISSU).

In addition, there is no pre-emption with HA SSO. However, it is possible to do a manual Switchover.

Finally, HA SSO can be configured in one of the two following modes:

  • RP mode: both controllers, also called chassis, are synchronised as a stack using a dedicated RP port. One controller receives the Active role during the Active/Standby election. The Standby controller sends Keepalives over the RP link to monitor the Active controller. If the Active is not reachable, it becomes the Active controller.

  • RMI+RP mode: which introduces an additional Redundancy Management Interface (RMI) using a secondary IP address on the management interface. The Standby controller can also detect a link failure of the RP link between both controllers through this secondary link between their management interfaces. Similar to the RP interface, the RMI interface also exchanges resource information about the chassis. In the case of an RP link failure, the RMI  mode avoids a Dual-Active chassis scenario (Dual-Active Detection). It also supports Management Gateway Failover, which detects the failure of the default gateway.

Requirements

The requirements to form an SSO pair are:

  • HA pair can only be formed between two wireless controllers of the same form factor (C9800-L-C, C9800-40-K9, C9800-80-K9, C9800-CL) and running the same software version.

    • Since release 17.5, the 9800-CL supports Auto-upgrade where the Active chassis upgrades the Standby chassis to the same version.

    • The scale of the C9800-CL must also match (CPU, memory and storage resources).

  • HA cannot be formed between 9800-L-C and 9800-L-F.

  • The RP port must be of the same type (Copper RP or Fibre RP).

  • Both 9800 must use the same interface number to form HA (for example, the gi3 interface).

  • The RP link must have the following capacity:

    • Maximum Latency: 80 ms RTT

    • Bandwidth: 60 Mbps

    • Minimum MTU: 1500 Bytes

RP mode was already available on the legacy AireOS wireless controllers (AP SSO since version 7.3 and Client SSO since version 7.5). Both AP and Client SSO are available on 9800 from version 16.10.

Since release 17.1 it is recommended to configure HA SSO using RMI+RP mode.

The following section quickly describes how to configure SSO using RMI+RP mode on C9800-CL.

Configuring Stateful Switchover (SSO) RMI+RP

At boot time, an election determines the Active and Standby chassis. And once the role has been assigned, the synchronisation of the configuration between the Active and the Standby controller is occurring through a dedicated Redundancy Port (RP). All the configuration of the Active chassis is pushed to the Standby chassis.

A chassis priority {1|2} must be defined for each chassis (chassis ID {1|2}). The highest priority value will be the preferred Active controller during an Active/Standby election when the pair of chassis reboots and initialises the stack. Once a chassis is elected as Active, it remains Active until it fails or reboots.

The interface chosen for the RP port must also be selected. The RP interface is used for the SSO control plane (Keepalives, role selection, etc).

Finally, the secondary IP address (RMI IP address) of the management interface of each chassis must be allocated.

In the examples of this blog, the IP addresses were used:

Wireless controller chassis ID 1 2
Chassis name 9800-1 9800-2
HA SSO role Active Standby
RP interface Gi3 Gi3
Management interface Gi2 Gi2
Management IP address 192.168.203.31

192.168.203.32 (before SSO, in standalone)

192.168.203.31 (when configured with SSO)

Secondary IP address 192.168.203.33 192.168.203.34
Chassis priority 2 1

The HA SSO configuration must be entered in both controllers when they are in Standalone mode or in Day 0 Setup. Both configuration are described below:

Using GUI:

On the preferred Active chassis:

In Standalone mode, go to Administration | Device | Redundancy and set the Redundancy Configuration to Enabled.

The following configuration is for the chassis ID 1. (Chassis renumber set to 1).

Enter the RMI IP addresses (secondary IP addresses) for chassis 1 (192.168.203.33) and chassis 2 (192.168.203.34).

Set the HA interface to the GigabitEthernet3.

Keep Management Gateway Failover as Enabled.

Set the Active Chassis Priority to 2. The chassis with the highest chassis priority will be preferred to the Active chassis during the initial election at boot time.

Click on Apply and configure the other HA chassis before rebooting.

On the preferred Standby chassis:

If the other chassis was already deployed in Standalone mode, go to Administration | Device | Redundancy and set the Redundancy Configuration to Enabled.

The following configuration is for the chassis ID 2. (Chassis renumber set to 2).

Enter the RMI IP addresses (secondary IP addresses) for chassis 1 (192.168.203.33) and for chassis 2 (192.168.203.34).

Note that the IP address values are not inverted in the fields on both the Active and Standby chassis configuration. This is one confusing difference with the original RP-only mode of SSO (not the RMI mode) where the configured IP addresses represent the HA link and are inverted on both the Active and Standby chassis configuration.

Set the HA interface to the GigabitEthernet3.

Keep Management Gateway Failover to Enabled.

Set the Active Chassis Priority to 1. The chassis with the lowest chassis priority will be preferred to the Standby chassis during the election.

Click on Apply and click on Administration | Reload to reboot the chassis.

Or if the HA chassis was booting for the first time with the Configuration Setup Wizard:

Set Deployment Mode to Standby.

Set Port Number to GigabitEthernet3 and configure the Wireless Management VLAN.

Set the RMI IP address for Chassis 1 (192.168.203.33) and for Chassis 2 (192.168.203.34).

Click on Summary then Finish.

You will see a notification to reload the chassis. Click on Yes.

After both chassis have been rebooted, the election will occur.

The Standby chassis ID will lose its previous wireless management IP address (192.168.203.32) which will not be used anymore. But the wireless management IP address of the Active chassis is now shared across the stack.

To verify that the Redundancy is set up, go to Monitoring | System | Redundancy. The General tab shows that the redundancy state is sso.

Using CLI:

First we will configure the chassis 1 as preferred Active chassis.

Set the priority of chassis 1 to a high priority

9800-CL#chassis 1 priority 2

9800-CL#show chassis

Chassis/Stack Mac Address : 0800.27cb.16f0 - Local Mac Address

Mac persistency wait time: Indefinite

                                             H/W   Current

Chassis#   Role    Mac Address     Priority Version  State                 IP

-------------------------------------------------------------------------------------

*1       Standby  0800.27cb.16f0     2      V02     Ready                0.0.0.0

 

Enter the following CLI command to configure the gi3 interface as the RP port and to configure the RMI secondary IP addresses of the two chassis:

9800-CL#chassis redundancy ha-interface gigabitEthernet 3

WARNING: Changing the switch HA interface may result in a configuration change for that switch.  The configuration associated with the old switch HA interface will remain as a provisioned configuration. New HA interface will be effective after next reboot. Do you want to continue?[y/n]? [yes]: yes

9800-CL(config)#redun-management interface vlan 1 chassis 1 address 192.168.203.33 chassis 2 address 192.168.203.34

Then we will configure the other chassis as preferred Standby.

The chassis ID must be set to 2. By default, the chassis number of a 9800-CL in standalone mode is 1.

9800-CL#chassis 1 renumber 2

WARNING: Changing the switch number may result in a configuration change for that switch.  The interface configuration associated with the old switch number will remain as a provisioned configuration. New Switch Number will be effective after next reboot. Do you want to continue?[y/n]? [yes]: yes

Reload this unit to apply the new chassis ID.

9800-CL#reload

Once rebooted, set the priority of chassis 2 to a low priority (default is 1):

9800-CL#chassis 2 priority 1

9800-CL#show chassis

Chassis/Stack Mac Address : 0800.274d.a612 - Local Mac Address

Mac persistency wait time: Indefinite

                                             H/W   Current

Chassis#   Role    Mac Address     Priority Version  State                 IP

-------------------------------------------------------------------------------------

*2       Active   0800.274d.a612     1      V02     Ready                0.0.0.0

Enter the same CLI commands to configure the gi3 interface as the RP port and to configure the RMI secondary IP addresses of the two chassis:

9800-CL#chassis redundancy ha-interface gigabitEthernet 3

WARNING: Changing the switch HA interface may result in a configuration change for that switch.  The configuration associated with the old switch HA interface will remain as a provisioned configuration. New HA interface will be effective after next reboot. Do you want to continue?[y/n]? [yes]: yes

9800-CL(config)#redun-management interface vlan 1 chassis 1 address 192.168.203.33 chassis 2 address 192.168.203.34

Save and reload the two chassis around the same time or at least within 120 seconds (STACK_DISCOVERY_TIMER) of each other.

9800-CL#write memory

9800-CL#reload

HA synchronisation: chassis priority, Active/Standby states

When a controller configured with SSO is booting, it performs a discovery phase during the boot time to advertise its capabilities and to find a peer.

During a new HA synchronisation where both chassis have no role yet (initial setup for example), the chassis with the highest priority will be elected as Active chassis. The default priority is 1.

If both chassis have the same priority, the chassis with the lowest bootup time is chosen. If both chassis have the same bootup time, the chassis with the lowest Ethernet Base MAC address will be the Active chassis.

If a chassis is re-joining an existing controller in Active mode, the current Active continues to be Active regardless of the priority and the joining chassis will be assigned to the Standby role.

Then the Standby chassis synchronises with the Active RP. After the bulk synchronisation, the Standby chassis goes automatically through a power cycle and both chassis will then operate in SSO mode fully synchronised.

The two chassis shares a MAC address that comes from the first Active unit’s MAC address, and a common management IP address. The connection to the stack is transparent for the AP and for the management users. CLI commands are only applied to the Active chassis, and each configuration change is incrementally synchronised to the Standby chassis.

After peering, the RP interface disappears from the configuration. However, the RP IP addresses are seen in the “show” commands and derived from the last two octets of the RMI IP address (169.254.x.y).

The redundancy state and RP IP addresses can be seen using GUI (Monitoring | System | Redundancy | General tab) or CLI:

9800-CL#show chassis rmi

Chassis/Stack Mac Address : 0800.27cb.16f0 - Local Mac Address

Mac persistency wait time: Indefinite

                                             H/W   Current

Chassis#   Role    Mac Address     Priority Version  State                 IP                RMI-IP

--------------------------------------------------------------------------------------------------------

*1       Active   0800.27cb.16f0     2      V02     Ready                169.254.203.33     192.168.203.33

 2       Standby  0800.274d.a612     1      V02     Ready                169.254.203.34     192.168.203.34

This output displays the chassis number, the priority, RP IP, RMI IP, MAC address of each chassis, their roles and states. The * symbol indicates the chassis console from which you run the command.

The primary address on the Active controller is the management IP address. The RMI IP address is automatically set as a secondary IPv4 address on the management VLAN.

9800-CL#show running-config

interface Vlan1

 ip dhcp client client-id ascii cisco-001e.bdf8.22ff-Vl1

 ip address 192.168.203.33 255.255.255.0 secondary

 ip address 192.168.203.31 255.255.255.0

On the Standby chassis, the RMI IP address is set as the primary IP address of the management VLAN:

9800-CL#show running-config

interface Vlan1

 ip address 192.168.203.34 255.255.255.0

The HA synchronisation is described in the logs:

9800-CL#show logging

HA synchronisation initialized

*Mar 13 10:24:47.335: WLC-HA-Notice: RF Progression event: RF_PROG_ACTIVE_FAST, Switchover triggered

*Mar 13 10:23:44.426: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 1 on Chassis 1 is down

*Mar 13 10:23:44.426: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 2 on Chassis 1 is down

*Mar 13 10:23:44.426: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 1 on Chassis 1 is up

*Mar 13 10:23:44.426: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 2 on Chassis 1 is up

*Mar 13 10:23:44.448: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 1 has been added to the stack.

 

Chassis 2 detected and HA Sync in progress

*Mar 13 10:23:44.747: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 1 R0/0: rif_mgr: The RP link is UP.

*Mar 13 10:23:45.729: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been added to the stack.

*Mar 13 10:23:46.843: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 1 has been added to the stack.

*Mar 13 10:23:48.882: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 1 has been added to the stack.

*Mar 13 10:23:48.882: %STACKMGR-6-ACTIVE_ELECTED: Chassis 1 R0/0: stack_mgr: Chassis 1 has been elected ACTIVE.

*Mar 13 10:23:50.494: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 1 R0/0: rif_mgr: The RP link is UP.

*Mar 13 10:23:50.494: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 1 R0/0: stack_mgr: Dual Active Detection link is available now

*Mar 13 10:24:15.811: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 1 R0/0: wncmgrd: INFO: Bulk sync status : CONFIG_DONE

 

*Mar 13 20:24:59.854: %SYS-6-CLOCKUPDATE: System clock has been updated from 10:24:59 UTC Mon Mar 13 2023 to 20:24:59 Austral Mon Mar 13 2023, configured from console by vty0.

 

Mar 13 20:25:00.279: % Redundancy mode change to SSO

 

Mar 13 20:25:00.279: %VOICE_HA-7-STATUS: NONE->SSO; SSO mode will not take effect until after a platform reload.

Mar 13 20:25:00.508: RMI-HAINFRA-INFO: Learning Management IP: 192.168.203.31, mask: 255.255.255.0, if_number: 11l

 

Mar 13 20:25:05.679: %SYS-6-BOOTTIME: Time taken to reboot after reload =  153 seconds

 

Chassis 2 elected as Standby

Mar 13 20:25:21.946: %IOSXE_REDUNDANCY-6-PEER: Active detected chassis 2 as standby.

Mar 13 20:25:21.943: %STACKMGR-6-STANDBY_ELECTED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been elected STANDBY.

Mar 13 20:25:27.004: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_FOUND(4))

Mar 13 20:25:27.004: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_REDUNDANCY_STATE_CHANGE(5))

Mar 13 20:25:28.850: % Redundancy mode change to SSO

Mar 13 20:25:28.850: %VOICE_HA-7-STATUS: NONE->SSO; SSO mode will not take effect until after a platform reload.

 

Chassis 2 rebooted because of Bulk synchronisation failure

Mar 13 20:26:48.159: Config Sync: Bulk-sync failure due to PRC mismatch. Please check the full list of PRC failures via:

  show redundancy config-sync failures prc

Mar 13 20:26:48.159: Config Sync: Starting lines from PRC file:

- cabundle nvram:ios_core.p7b

Mar 13 20:26:48.159: Config Sync: Bulk-sync failure, Reloading Standby

 

Mar 13 20:26:48.211: %VOICE_HA-7-STATUS: VOICE HA bulk sync done.

Mar 13 20:26:48.819: %RF-5-RF_RELOAD: Peer reload. Reason: Bulk Sync Failure

Mar 13 20:26:49.101: %IOSXE_REDUNDANCY-6-PEER_LOST: Active detected chassis 2 is no longer standby

Mar 13 20:26:49.165: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_NOT_PRESENT)

Mar 13 20:26:49.166: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_DOWN)

Mar 13 20:26:49.169: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_REDUNDANCY_STATE_CHANGE)

Mar 13 20:26:49.858: %RF-5-RF_RELOAD: Peer reload. Reason: EHSA standby down

Mar 13 20:26:49.085: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 2 on Chassis 1 is down

Mar 13 20:26:49.086: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 1 on Chassis 1 is down

Mar 13 20:26:49.086: %STACKMGR-6-CHASSIS_REMOVED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been removed from the stack.

Mar 13 20:26:50.682: %RF-5-RF_RELOAD: Peer reload. Reason: Active and Standby configuration out of sync

Mar 13 20:27:00.848: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 1 R0/0: rif_mgr: The RMI link is UP.

Mar 13 20:27:00.862: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway reachable from Standby

Mar 13 20:27:02.088: %RIF_MGR_FSM-6-RMI_LINK_DOWN: Chassis 1 R0/0: rif_mgr: The RMI link is DOWN.

Mar 13 20:27:30.837: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 1 R0/0: rif_mgr: Setting RP link status to DOWN

Mar 13 20:27:30.837: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 1 R0/0: stack_mgr: Dual Active Detection links are not available anymore

 

Chassis 2 re-detected and HA Sync in progress

Mar 13 20:27:56.232: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 2 on Chassis 1 is up

Mar 13 20:27:56.232: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 1 on Chassis 1 is up

Mar 13 20:27:56.264: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been added to the stack.

Mar 13 20:27:58.066: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been added to the stack.

Mar 13 20:28:01.726: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 1 R0/0: rif_mgr: The RP link is UP.

Mar 13 20:28:01.726: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 1 R0/0: stack_mgr: Dual Active Detection link is available now

Mar 13 20:28:11.300: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 1 R0/0: wncmgrd: INFO: Bulk sync status : COLD

Mar 13 20:28:21.730: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 1 R0/0: rif_mgr: The RP link is UP.

Mar 13 20:28:31.885: %IOSXE_REDUNDANCY-6-PEER: Active detected chassis 2 as standby.

Mar 13 20:28:31.878: %STACKMGR-6-STANDBY_ELECTED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been elected STANDBY.

Mar 13 20:28:41.893: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_FOUND(4))

 

Mar 13 20:28:41.893: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_REDUNDANCY_STATE_CHANGE(5))

 

Mar 13 20:28:52.235: Syncing vlan database

Mar 13 20:28:52.296: Vlan Database sync done from bootflash:vlan.dat to stby-bootflash:vlan.dat (556 bytes)

 

Configuration synchronised and final SSO state

Mar 13 20:30:11.111: %HA_CONFIG_SYNC-6-BULK_CFGSYNC_SUCCEED: Bulk Sync succeeded

Mar 13 20:30:11.161: %VOICE_HA-7-STATUS: VOICE HA bulk sync done.

Mar 13 20:30:12.170: %RF-5-RF_TERMINAL_STATE: Terminal state reached for (SSO)

Mar 13 20:30:12.783: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 1 R0/0: rif_mgr: The RMI link is UP.

Mar 13 20:30:24.182: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway reachable from Standby

Once the pair of chassis is in Active/Standby, the redundancy state is sso (instead of non-redundant) and the hardware mode is duplex (two CPU controller modules, instead of simplex).

The redundancy mode, the redundancy state and the chassis state are displayed on the following command:

9800-CL#show redundancy states

       my state = 13 -ACTIVE

     peer state = 8  -STANDBY HOT

           Mode = Duplex

           Unit = Primary

        Unit ID = 1

 

Redundancy Mode (Operational) = sso

Redundancy Mode (Configured)  = sso

Redundancy State              = sso

     Maintenance Mode = Disabled

    Manual Swact = enabled

 Communications = Up

 

   client count = 129

 client_notification_TMR = 30000 milliseconds

           RF debug mask = 0x0

Gateway Monitoring = Enabled

Gateway monitoring interval  = 8 secs

Or to display the HA status of the local or Active or Standby chassis:

9800-CL#show chassis ha-status {active | local | standby}

 

9800-CL#show chassis ha-status active

                      My state = ACTIVE

                    Peer state = STANDBY HOT

        Last switchover reason = none

          Last switchover time = none

                 Image Version = 17.6.4

 

Chassis-HA   Local-IP         Remote-IP        MASK             HA-Interface

-----------------------------------------------------------------------------

This Boot: 169.254.203.33   169.254.203.34   255.255.255.0    GigabitEthernet3

 

Next Boot: 169.254.203.33   169.254.203.34   255.255.255.0    GigabitEthernet3

 

 

Chassis-HA   Chassis#    Priority        IFMac Address         Peer-timeout(ms)*Max-retry

-----------------------------------------------------------------------------------------

This Boot:    1             2           08:00:27:CB:16:F0          100*5

 

Next Boot:    1             2           08:00:27:CB:16:F0          100*5

Note that the Local IP and Remote IP for the RP interface are auto-generated from the RMI IP addresses with a 169.254 prefix.

During the HA synchronisation, each peer goes through the following redundancy states:

My state (chassis 1) and peer state (chassis 2) on the Active unit after both chassis rebooted

The Cold adjective refers to when no state information is maintained between the Active and the Standby. No Switchover can occur immediately. The Hot adjective means that the system is redundant and capable of immediate Switchover.

To display the redundancy history with the chassis redundancy states:

9800-CL#show redundancy history

00:00:06 *my state = INITIALIZATION(2) peer state = DISABLED(1)

00:00:06 *my state = NEGOTIATION(3) peer state = DISABLED(1)

00:00:06 *my state = ACTIVE-FAST(9) peer state = DISABLED(1)

00:00:10 *my state = ACTIVE-DRAIN(10) peer state = DISABLED(1)

00:00:10 *my state = ACTIVE_PRECONFIG(11) peer state = DISABLED(1)

00:00:10 *my state = ACTIVE_POSTCONFIG(12) peer state = DISABLED(1)

00:00:10 *my state = ACTIVE(13) peer state = DISABLED(1)

Mar 13 20:25:00.278 Changing to system clock timestamps at uptime 1905

Mar 13 20:25:05.463 System initialization complete

Mar 13 20:25:28.949  my state = ACTIVE(13) *peer state = UNKNOWN(0)

Mar 13 20:25:29.055  my state = ACTIVE(13) *peer state = STANDBY COLD(4)

Mar 13 20:25:45.959  my state = ACTIVE(13) *peer state = STANDBY_ISSU_NEGOTIATION_LATE(35)

Mar 13 20:25:48.013  my state = ACTIVE(13) *peer state = STANDBY COLD-CONFIG(5)

Mar 13 20:26:33.996  my state = ACTIVE(13) *peer state = STANDBY COLD-FILESYS(6)

Mar 13 20:26:35.336  my state = ACTIVE(13) *peer state = STANDBY COLD-BULK(7)

Mar 13 20:26:48.105  my state = ACTIVE(13) *peer state = STANDBY HOT(8)

Mar 13 20:28:48.074  my state = ACTIVE(13) *peer state = UNKNOWN(0)

Mar 13 20:28:48.185  my state = ACTIVE(13) *peer state = STANDBY COLD(4)

Mar 13 20:29:05.983  my state = ACTIVE(13) *peer state = STANDBY_ISSU_NEGOTIATION_LATE(35)

Mar 13 20:29:08.112  my state = ACTIVE(13) *peer state = STANDBY COLD-CONFIG(5)

Mar 13 20:29:55.516  my state = ACTIVE(13) *peer state = STANDBY COLD-FILESYS(6)

Mar 13 20:29:57.284  my state = ACTIVE(13) *peer state = STANDBY COLD-BULK(7)

Mar 13 20:30:11.065  my state = ACTIVE(13) *peer state = STANDBY HOT(8)

The interfaces are all up on the Active chassis:

9800-CL#show ip interface brief

Interface              IP-Address      OK? Method Status                Protocol

GigabitEthernet1       192.168.201.151 YES NVRAM  up                    up

GigabitEthernet2       unassigned      YES unset  up                    up

Vlan1                  192.168.203.31  YES NVRAM  up                    up

 

The physical interfaces are down on the Standby chassis (except for the RP port that is not displayed here):

9800-CL#show ip interface brief

Interface              IP-Address      OK? Method Status                Protocol

GigabitEthernet1       192.168.201.151 YES NVRAM  down                  down

GigabitEthernet2       unassigned      YES unset  down                  down

Vlan1                  192.168.203.34  YES unset  up                    up

Failover due to controller failure: RP Keepalives, Switchover process

The following scenario describes the failover process when the Active chassis 1 fails (powered down for example). The Standby chassis 2 immediately took over the Active role.

The CAPWAP state of all the joined Access Points was maintained in the Standby controller, so they did not go into Discovery mode to join other controllers.

The clients that were in the RUN state were also maintained in the Standby controller. So those clients did not need to re-authenticate to the Access Points.

To detect the failure, the Standby chassis monitors the Active chassis using Keepalive messages that are sent every 100 ms over the Redundancy Port (RP). After 500 ms, the chassis declares that the other peer is unreachable and eventually changes its role.

The peer Keepalive timeout (100 ms by default) and maximum retry (5 retries) that make the peer timeout (500 ms) are displayed in the following commands:

9800-CL#show chassis ha-status {active | standby | local}

 

                      My state = ACTIVE

                    Peer state = STANDBY HOT

        Last switchover reason = active unit removed

          Last switchover time = 20:26:00 Austral Mon Jan 23 2023

                 Image Version = 17.6.4

 

Chassis-HA   Local-IP         Remote-IP        MASK             HA-Interface

-----------------------------------------------------------------------------

This Boot: 169.254.203.34   169.254.203.33   255.255.255.0    GigabitEthernet3

 

Next Boot: 169.254.203.34   169.254.203.33   255.255.255.0    GigabitEthernet3

 

 

Chassis-HA   Chassis#    Priority        IFMac Address         Peer-timeout(ms)*Max-retry

-----------------------------------------------------------------------------------------

This Boot:    2             1           08:00:27:CB:16:F0          100*5

 

Next Boot:    2             1           08:00:27:CB:16:F0          100*5

 

9800-CL#show platform software stack-mgr chassis {active | standby} R0 peer-timeout

Peer Chassis    Peer-timeout (ms)   50% Mark            75% Mark

--------------------------------------------------------------------------

2               500                 0                   0

The Keepalive counter can be checked with the following command:

9800-CL#show platform software stack-mgr chassis {active | standby} R0 sdp-counters

Stack Discovery Protocol (SDP) Counters

 

---------------------------------------

 

Message                 Tx Success    Tx Fail       Rx Success    Rx Fail

------------------------------------------------------------------------------

Discovery               25            2             29            0

Neighbor                11            3             10            0

Keepalive               3552          670           3552          0

SEPPUKU                 0             0             0             0

Standby Elect Req       2             0             0             0

Standby Elect Ack       0             0             2             0

Standby IOS State       0             0             3             0

Reload Req              1             0             0             0

Reload Ack              0             0             1             0

SESA Mesg               0             0             0             0

RTU Msg                 0             0             0             0

Disc Timer Stop         1             0             2             0

The Keepalive timers can also be changed:

9800-CL#chassis redundancy keep-alive timer ?

  <1-10>  Chassis peer keep-alive time interval in multiple of 100 ms (enter 1 for default)

 

9800-CL#chassis redundancy keep-alive retries ?

  <5-10>  Chassis peer keep-alive retries before claiming peer is down (enter 5 for default)

The Keepalive messages are sent using UDP port 2300 over the RP link.

The following packet capture shows when the chassis 1 failed and when no Keepalive message was received from the chassis 1 (no more UDP 2300 packet from 169.254.203.33).

The chassis 2 logs notified when some Keepalive messages timed out (missed for 2 times as a warning message) and when the chassis 1 was removed from the stack 500 ms after the first missed Keepalive. The Switchover was very fast (about 1 second) in this case.

9800-CL#show logging

Mar 13 20:37:03.831: %STACKMGR-6-KA_MISSED: Chassis 2 R0/0: stack_mgr: Keepalive missed for 2 times for Chassis 1

Mar 13 20:37:04.133: %STACKMGR-6-CHASSIS_REMOVED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been removed from the stack.

Mar 13 20:37:04.136: %STACKMGR-6-CHASSIS_REMOVED_KA: Chassis 2 R0/0: stack_mgr: Chassis 1 has been removed from the stack due to keepalive failure.

Mar 13 20:37:04.404: %PLATFORM-6-HASTATUS: RP switchover, received chassis event to become active

Mar 13 20:37:04.404: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_NOT_PRESENT)

Mar 13 20:37:04.404: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_DOWN)

Mar 13 20:37:04.404: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_REDUNDANCY_STATE_CHANGE)

Mar 13 20:37:04.422: %PLATFORM-6-HASTATUS: RP switchover, sent message became active. IOS is ready to switch to primary after chassis confirmation

Mar 13 20:37:04.433: %PLATFORM-6-HASTATUS: RP switchover, received chassis event became active

Mar 13 20:37:04.460: RMI-HAINFRA-ERR: ARP Standby: Unexpected source IP on RMI interface, IP: 192.168.203.31

Mar 13 20:37:04.673: %PLATFORM-6-HASTATUS_DETAIL: RP switchover, received chassis event became active. Switch to primary (count 1)

Mar 13 20:37:04.673: %HA-6-SWITCHOVER: Route Processor switched from Standby to being active

Mar 13 20:37:04.717: WLC-HA-Notice: RF Progression event: RF_PROG_ACTIVE_FAST, Switchover triggered

Mar 13 20:37:05.040: RMI-HAINFRA-INFO: Configured primary IP 192.168.203.31/255.255.255.0 on active(mgmt)

Mar 13 20:37:05.041: RMI-HAINFRA-INFO: Configured secondary IP 192.168.203.34/255.255.255.0 on active(mgmt)

Mar 13 20:37:05.059: %VOICE_HA-2-SWITCHOVER_IND: SWITCHOVER, from STANDBY_HOT to ACTIVE state.

Mar 13 20:37:05.063: WLC-HA-Notice: Sending garp intf = GigabitEthernet1, addr=192.168.201.151

Mar 13 20:37:05.077: %PKI-6-CS_ENABLED: Certificate server now enabled.

Mar 13 20:37:05.082: WLC-HA-Notice: Sending garp intf = LIIN0, addr=192.168.1.6

Mar 13 20:37:05.093: WLC-HA-Notice: Sending garp intf = Vlan1, addr=192.168.203.31

Mar 13 20:37:05.728: %CALL_HOME-6-CALL_HOME_ENABLED: Call-home is enabled by Smart Agent for Licensing.

Mar 13 20:37:07.058: %LINK-3-UPDOWN: Interface Null0, changed state to up

Mar 13 20:37:07.058: %LINK-3-UPDOWN: Interface GigabitEthernet1, changed state to up

Mar 13 20:37:07.059: %LINK-3-UPDOWN: Interface GigabitEthernet2, changed state to up

Mar 13 20:37:07.060: %LINK-3-UPDOWN: Interface Vlan1, changed state to up

Mar 13 20:37:08.058: %LINEPROTO-5-UPDOWN: Line protocol on Interface Null0, changed state to up

Mar 13 20:37:08.058: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1, changed state to up

Mar 13 20:37:08.059: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2, changed state to up

Mar 13 20:37:08.063: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan1, changed state to up

Mar 13 20:37:09.181: %RIF_MGR_FSM-6-GW_REACHABLE_ACTIVE: Chassis 2 R0/0: rif_mgr: Gateway reachable from Active

Mar 13 20:37:14.602: %RIF_MGR_FSM-6-RMI_LINK_DOWN: Chassis 2 R0/0: rif_mgr: The RMI link is DOWN.

Mar 13 20:37:26.779: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 2 R0/0: rif_mgr: Setting RP link status to DOWN

Mar 13 20:37:26.779: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 2 R0/0: stack_mgr: Dual Active Detection links are not available anymore

The following command on the new Active chassis also shows the loss of communication and the change of chassis states:

9800-CL#show redundancy history

Mar 13 20:30:10.632 *my state = STANDBY HOT(8) peer state = ACTIVE(13)

Mar 13 20:37:04.692 Reloading peer (communication down)

Mar 13 20:37:04.693 Reloading peer (peer presence lost)

Mar 13 20:37:04.693 *my state = ACTIVE-FAST(9) peer state = DISABLED(1)

Mar 13 20:37:05.034 *my state = ACTIVE-DRAIN(10) peer state = DISABLED(1)

Mar 13 20:37:05.050 *my state = ACTIVE_PRECONFIG(11) peer state = DISABLED(1)

Mar 13 20:37:05.056 *my state = ACTIVE_POSTCONFIG(12) peer state = DISABLED(1)

Mar 13 20:37:05.057 *my state = ACTIVE(13) peer state = DISABLED(1)

The chassis 2 declares that the chassis 1 has failed (disabled/removed) and then takes over the Active role.

My state (chassis 2) and peer state (chassis 1) after the Active chassis 1 unit was powered down

The console to the management interface was connected to the chassis 2 (*):

9800-CL#show chassis rmi

Chassis/Stack Mac Address : 0800.27cb.16f0 - Local Mac Address

Mac persistency wait time: Indefinite

                                             H/W   Current

Chassis#   Role    Mac Address     Priority Version  State                 IP                RMI-IP

--------------------------------------------------------------------------------------------------------

 1       Member   0000.0000.0000     0      V02     Removed              169.254.203.33     192.168.203.33

*2       Active   0800.274d.a612     1      V02     Ready                169.254.203.34     192.168.203.34

The redundancy mode becomes simplex (non-redundant).

9800-CL#show redundancy states

       my state = 13 -ACTIVE

     peer state = 1  -DISABLED

           Mode = Simplex

           Unit = Primary

        Unit ID = 2

 

Redundancy Mode (Operational) = Non-redundant

Redundancy Mode (Configured)  = sso

Redundancy State              = Non Redundant

     Maintenance Mode = Disabled

    Manual Swact = disabled (system is simplex (no peer unit))

 Communications = Down      Reason: Simplex mode

 

   client count = 127

 client_notification_TMR = 30000 milliseconds

           RF debug mask = 0x0

Gateway Monitoring = Enabled

Gateway monitoring interval  = 8 secs

When the chassis 2 has taken the Active role, several broadcast Gratuitous ARP Requests advertising the management IP address were sent from the chassis 2 to update the network ARP tables. This way, the traffic for the management IP address was redirected to the new Active chassis. The chassis 2 also sent some ARP Requests to the Access Points (192.168.203.53 in this example) and to the default gateway (192.168.203.2 in this example).

The association uptime of the Access Point did not reset to 0:

9800-CL#show ap uptime

Number of APs: 1

 

 

AP Name                          Ethernet MAC    Radio MAC       AP Up Time                                          Association Up Time

---------------------------------------------------------------------------------------------------------------------------------------------------

AP1850                           38ed.18c8.4a40  38ed.18c9.8f80  2 hours 3 minutes 7 seconds                         12 minutes 16 seconds

The Switchover history displays the time and the reason of the Switchover:

9800-CL#show redundancy switchover history

Index  Previous  Current  Switchover             Switchover

       active    active   reason                 time

-----  --------  -------  ----------             ----------

   1       1        2     active unit removed    20:37:05 Austral Mon Mar 13 2023

When the peer chassis is back after failover: bulk synchronisation

When the chassis 1 was powered back up, the chassis 2 remained in the Active state and the chassis 1 took the Hot Standby state. There is no automatic fallback method like the primary/secondary/tertiary WLC method in N+1 redundancy.

My state (chassis 2) and peer state (chassis 1) after the chassis 1 unit was powered back up

The redundancy mode went back to duplex once the peer chassis was in Standby mode.

9800-CL#show redundancy states

       my state = 13 -ACTIVE

     peer state = 8  -STANDBY HOT

           Mode = Duplex

           Unit = Primary

        Unit ID = 2

 

Redundancy Mode (Operational) = sso

Redundancy Mode (Configured)  = sso

Redundancy State              = sso

     Maintenance Mode = Disabled

    Manual Swact = enabled

 Communications = Up

 

   client count = 129

 client_notification_TMR = 30000 milliseconds

           RF debug mask = 0x0

Gateway Monitoring = Enabled

Gateway monitoring interval  = 8 secs

The console to the management interface was still connected to the chassis 2 (*) in Active mode:

9800-CL#show chassis rmi

Chassis/Stack Mac Address : 0800.27cb.16f0 - Local Mac Address

Mac persistency wait time: Indefinite

                                             H/W   Current

Chassis#   Role    Mac Address     Priority Version  State                 IP                RMI-IP

--------------------------------------------------------------------------------------------------------

 1       Standby  0800.27cb.16f0     2      V02     Ready                169.254.203.33     192.168.203.33

*2       Active   0800.274d.a612     1      V02     Ready                169.254.203.34     192.168.203.34

Note that when the chassis 1 came back online, it moved into the Hot Standby role regardless of its chassis priority. The chassis 2 remained in its Active role as shown in its logs:

9800-CL#show logging

Mar 13 20:40:55.507: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been added to the stack.

Mar 13 20:40:57.744: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been added to the stack.

Mar 13 20:41:00.457: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.

Mar 13 20:41:00.457: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 2 R0/0: stack_mgr: Dual Active Detection link is available now

Mar 13 20:41:18.404: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 2 R0/0: wncmgrd: INFO: Bulk sync status : COLD

Mar 13 20:41:20.450: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.

Mar 13 20:41:31.522: %IOSXE_REDUNDANCY-6-PEER: Active detected chassis 1 as standby.

Mar 13 20:41:31.516: %STACKMGR-6-STANDBY_ELECTED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been elected STANDBY.

Mar 13 20:42:26.531: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_FOUND(4))

Mar 13 20:42:26.531: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_REDUNDANCY_STATE_CHANGE(5))

Mar 13 20:42:37.208: Syncing vlan database

Mar 13 20:42:37.264: Vlan Database sync done from bootflash:vlan.dat to stby-bootflash:vlan.dat (556 bytes)

Mar 13 20:44:04.655: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 2 R0/0: rif_mgr: The RMI link is UP.

Mar 13 20:44:05.565: %HA_CONFIG_SYNC-6-BULK_CFGSYNC_SUCCEED: Bulk Sync succeeded

Mar 13 20:44:05.589: %VOICE_HA-7-STATUS: VOICE HA bulk sync done.

Mar 13 20:44:06.695: %RF-5-RF_TERMINAL_STATE: Terminal state reached for (SSO)

Mar 13 20:44:20.095: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 2 R0/0: rif_mgr: Gateway reachable from Standby

Next, I will describe additional commands and other failover scenarios in a second part.