High-Availability (HA) is a key feature of a wireless LAN design: it aims to keep all the wireless services up for the users without any downtime. It is achieved by eliminating single points of failure, detecting system failures and having redundancy to provide failover.
In a centralized WLAN architecture, the wireless controller is a crucial point of failure that handles the management and control planes (and optionally the data plane) for all the wireless Access Points. It is therefore important to add some system redundancies for the wireless controller and to monitor them.
There are three main forms of redundancy with Cisco wireless LAN controllers:
-
Port redundancy by bundling the distribution ports of the wireless controller as a Link Aggregation Group (LAG) in the event of port failures. CAPWAP traffic can also be load-balanced.
-
Creating an HA pair of wireless controllers using Stateful Switchover (SSO). One controller becomes Active while the other controller is in Hot Standby mode. If the Active controller fails or has a network connection failure, the Hot Standby controller detects the failure and takes the Active role. The Access Points see the HA pair as one single controller. All the CAPWAP connections with the Access Points and all the client states are seamlessly maintained (stateful) with minimal disruption. This article describes in detail SSO redundancy for Cisco 9800 wireless controllers.
-
Installing additional backup wireless controllers using an N+1 architecture. There are many aspects to take into consideration in order to make a controller failover as deterministic as possible. Access Points typically have a resilient mechanism to discover all potential controllers, so each Access Point should be configured with its primary/secondary/tertiary (preferred) controllers. However, using N+1 redundancy, some services can be disrupted for the end users during the time of the failover (compared to SSO). Maximum resiliency can be achieved using N+1 redundancy with several SSO pairs of controllers as primary/secondary/tertiary controllers.
Stateful Switchover (SSO)
The HA SSO feature allows the Active and Standby controllers to synchronise the CAPWAP sessions with each Access Point and to maintain the state of their associated wireless clients. In case of failure of the Active controller, the Standby controller performs a Switchover and takes over the Active role. The CAPWAP sessions with each Access Point remain established and the wireless clients remain associated with the new Active controller. The Access Points remain in the RUN state and do not go into the discovery process during a Switchover.
HA SSO provides other extra benefits. For example, the configuration of the Active and Standby controllers is constantly synchronised. This prevents any configuration discrepancies between the two controllers. Both controllers share the same management IP address. SSO also allows to perform zero downtime In-Service Software Upgrade (ISSU).
In addition, there is no pre-emption with HA SSO. However, it is possible to do a manual Switchover.
Finally, HA SSO can be configured in one of the two following modes:
-
RP mode: both controllers, also called chassis, are synchronised as a stack using a dedicated RP port. One controller receives the Active role during the Active/Standby election. The Standby controller sends Keepalives over the RP link to monitor the Active controller. If the Active is not reachable, it becomes the Active controller.
-
RMI+RP mode: which introduces an additional Redundancy Management Interface (RMI) using a secondary IP address on the management interface. The Standby controller can also detect a link failure of the RP link between both controllers through this secondary link between their management interfaces. Similar to the RP interface, the RMI interface also exchanges resource information about the chassis. In the case of an RP link failure, the RMI mode avoids a Dual-Active chassis scenario (Dual-Active Detection). It also supports Management Gateway Failover, which detects the failure of the default gateway.
Requirements
The requirements to form an SSO pair are:
-
HA pair can only be formed between two wireless controllers of the same form factor (C9800-L-C, C9800-40-K9, C9800-80-K9, C9800-CL) and running the same software version.
- Since release 17.5, the 9800-CL supports Auto-upgrade where the Active chassis upgrades the Standby chassis to the same version.
-
The scale of the C9800-CL must also match (CPU, memory and storage resources).
-
HA cannot be formed between 9800-L-C and 9800-L-F.
-
The RP port must be of the same type (Copper RP or Fibre RP).
-
Both 9800 must use the same interface number to form HA (for example, the gi3 interface).
-
The RP link must have the following capacity:
-
Maximum Latency: 80 ms RTT
-
Bandwidth: 60 Mbps
-
Minimum MTU: 1500 Bytes
-
RP mode was already available on the legacy AireOS wireless controllers (AP SSO since version 7.3 and Client SSO since version 7.5). Both AP and Client SSO are available on 9800 from version 16.10.
Since release 17.1 it is recommended to configure HA SSO using RMI+RP mode.
The following section quickly describes how to configure SSO using RMI+RP mode on C9800-CL.
Configuring Stateful Switchover (SSO) RMI+RP
At boot time, an election determines the Active and Standby chassis. And once the role has been assigned, the synchronisation of the configuration between the Active and the Standby controller is occurring through a dedicated Redundancy Port (RP). All the configuration of the Active chassis is pushed to the Standby chassis.
A chassis priority {1|2} must be defined for each chassis (chassis ID {1|2}). The highest priority value will be the preferred Active controller during an Active/Standby election when the pair of chassis reboots and initialises the stack. Once a chassis is elected as Active, it remains Active until it fails or reboots.
The interface chosen for the RP port must also be selected. The RP interface is used for the SSO control plane (Keepalives, role selection, etc).
Finally, the secondary IP address (RMI IP address) of the management interface of each chassis must be allocated.
In the examples of this blog, the IP addresses were used:
Wireless controller chassis ID | 1 | 2 |
---|---|---|
Chassis name | 9800-1 | 9800-2 |
HA SSO role | Active | Standby |
RP interface | Gi3 | Gi3 |
Management interface | Gi2 | Gi2 |
Management IP address | 192.168.203.31 |
192.168.203.32 (before SSO, in standalone) 192.168.203.31 (when configured with SSO) |
Secondary IP address | 192.168.203.33 | 192.168.203.34 |
Chassis priority | 2 | 1 |
The HA SSO configuration must be entered in both controllers when they are in Standalone mode or in Day 0 Setup. Both configuration are described below:
Using GUI:
On the preferred Active chassis:
In Standalone mode, go to Administration | Device | Redundancy and set the Redundancy Configuration to Enabled.
The following configuration is for the chassis ID 1. (Chassis renumber set to 1).
Enter the RMI IP addresses (secondary IP addresses) for chassis 1 (192.168.203.33) and chassis 2 (192.168.203.34).
Set the HA interface to the GigabitEthernet3.
Keep Management Gateway Failover as Enabled.
Set the Active Chassis Priority to 2. The chassis with the highest chassis priority will be preferred to the Active chassis during the initial election at boot time.
Click on Apply and configure the other HA chassis before rebooting.
On the preferred Standby chassis:
If the other chassis was already deployed in Standalone mode, go to Administration | Device | Redundancy and set the Redundancy Configuration to Enabled.
The following configuration is for the chassis ID 2. (Chassis renumber set to 2).
Enter the RMI IP addresses (secondary IP addresses) for chassis 1 (192.168.203.33) and for chassis 2 (192.168.203.34).
Note that the IP address values are not inverted in the fields on both the Active and Standby chassis configuration. This is one confusing difference with the original RP-only mode of SSO (not the RMI mode) where the configured IP addresses represent the HA link and are inverted on both the Active and Standby chassis configuration.
Set the HA interface to the GigabitEthernet3.
Keep Management Gateway Failover to Enabled.
Set the Active Chassis Priority to 1. The chassis with the lowest chassis priority will be preferred to the Standby chassis during the election.
Click on Apply and click on Administration | Reload to reboot the chassis.
Or if the HA chassis was booting for the first time with the Configuration Setup Wizard:
Set Deployment Mode to Standby.
Set Port Number to GigabitEthernet3 and configure the Wireless Management VLAN.
Set the RMI IP address for Chassis 1 (192.168.203.33) and for Chassis 2 (192.168.203.34).
Click on Summary then Finish.
You will see a notification to reload the chassis. Click on Yes.
After both chassis have been rebooted, the election will occur.
The Standby chassis ID will lose its previous wireless management IP address (192.168.203.32) which will not be used anymore. But the wireless management IP address of the Active chassis is now shared across the stack.
To verify that the Redundancy is set up, go to Monitoring | System | Redundancy. The General tab shows that the redundancy state is sso.
Using CLI:
First we will configure the chassis 1 as preferred Active chassis.
Set the priority of chassis 1 to a high priority
9800-CL#chassis 1 priority 2
9800-CL#show chassis
Chassis/Stack Mac Address : 0800.27cb.16f0 - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Chassis# Role Mac Address Priority Version State IP
-------------------------------------------------------------------------------------
*1 Standby 0800.27cb.16f0 2 V02 Ready 0.0.0.0
Enter the following CLI command to configure the gi3 interface as the RP port and to configure the RMI secondary IP addresses of the two chassis:
9800-CL#chassis redundancy ha-interface gigabitEthernet 3
WARNING: Changing the switch HA interface may result in a configuration change for that switch. The configuration associated with the old switch HA interface will remain as a provisioned configuration. New HA interface will be effective after next reboot. Do you want to continue?[y/n]? [yes]: yes
9800-CL(config)#redun-management interface vlan 1 chassis 1 address 192.168.203.33 chassis 2 address 192.168.203.34
Then we will configure the other chassis as preferred Standby.
The chassis ID must be set to 2. By default, the chassis number of a 9800-CL in standalone mode is 1.
9800-CL#chassis 1 renumber 2
WARNING: Changing the switch number may result in a configuration change for that switch. The interface configuration associated with the old switch number will remain as a provisioned configuration. New Switch Number will be effective after next reboot. Do you want to continue?[y/n]? [yes]: yes
Reload this unit to apply the new chassis ID.
9800-CL#reload
Once rebooted, set the priority of chassis 2 to a low priority (default is 1):
9800-CL#chassis 2 priority 1
9800-CL#show chassis
Chassis/Stack Mac Address : 0800.274d.a612 - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Chassis# Role Mac Address Priority Version State IP
-------------------------------------------------------------------------------------
*2 Active 0800.274d.a612 1 V02 Ready 0.0.0.0
Enter the same CLI commands to configure the gi3 interface as the RP port and to configure the RMI secondary IP addresses of the two chassis:
9800-CL#chassis redundancy ha-interface gigabitEthernet 3
WARNING: Changing the switch HA interface may result in a configuration change for that switch. The configuration associated with the old switch HA interface will remain as a provisioned configuration. New HA interface will be effective after next reboot. Do you want to continue?[y/n]? [yes]: yes
9800-CL(config)#redun-management interface vlan 1 chassis 1 address 192.168.203.33 chassis 2 address 192.168.203.34
Save and reload the two chassis around the same time or at least within 120 seconds (STACK_DISCOVERY_TIMER) of each other.
9800-CL#write memory
9800-CL#reload
HA synchronisation: chassis priority, Active/Standby states
When a controller configured with SSO is booting, it performs a discovery phase during the boot time to advertise its capabilities and to find a peer.
During a new HA synchronisation where both chassis have no role yet (initial setup for example), the chassis with the highest priority will be elected as Active chassis. The default priority is 1.
If both chassis have the same priority, the chassis with the lowest bootup time is chosen. If both chassis have the same bootup time, the chassis with the lowest Ethernet Base MAC address will be the Active chassis.
If a chassis is re-joining an existing controller in Active mode, the current Active continues to be Active regardless of the priority and the joining chassis will be assigned to the Standby role.
Then the Standby chassis synchronises with the Active RP. After the bulk synchronisation, the Standby chassis goes automatically through a power cycle and both chassis will then operate in SSO mode fully synchronised.
The two chassis shares a MAC address that comes from the first Active unit’s MAC address, and a common management IP address. The connection to the stack is transparent for the AP and for the management users. CLI commands are only applied to the Active chassis, and each configuration change is incrementally synchronised to the Standby chassis.
After peering, the RP interface disappears from the configuration. However, the RP IP addresses are seen in the “show” commands and derived from the last two octets of the RMI IP address (169.254.x.y).
The redundancy state and RP IP addresses can be seen using GUI (Monitoring | System | Redundancy | General tab) or CLI:
9800-CL#show chassis rmi
Chassis/Stack Mac Address : 0800.27cb.16f0 - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Chassis# Role Mac Address Priority Version State IP RMI-IP
--------------------------------------------------------------------------------------------------------
*1 Active 0800.27cb.16f0 2 V02 Ready 169.254.203.33 192.168.203.33
2 Standby 0800.274d.a612 1 V02 Ready 169.254.203.34 192.168.203.34
This output displays the chassis number, the priority, RP IP, RMI IP, MAC address of each chassis, their roles and states. The * symbol indicates the chassis console from which you run the command.
The primary address on the Active controller is the management IP address. The RMI IP address is automatically set as a secondary IPv4 address on the management VLAN.
9800-CL#show running-config
…
interface Vlan1
ip dhcp client client-id ascii cisco-001e.bdf8.22ff-Vl1
ip address 192.168.203.33 255.255.255.0 secondary
ip address 192.168.203.31 255.255.255.0
On the Standby chassis, the RMI IP address is set as the primary IP address of the management VLAN:
9800-CL#show running-config
…
interface Vlan1
ip address 192.168.203.34 255.255.255.0
The HA synchronisation is described in the logs:
9800-CL#show logging
HA synchronisation initialized
*Mar 13 10:24:47.335: WLC-HA-Notice: RF Progression event: RF_PROG_ACTIVE_FAST, Switchover triggered
*Mar 13 10:23:44.426: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 1 on Chassis 1 is down
*Mar 13 10:23:44.426: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 2 on Chassis 1 is down
*Mar 13 10:23:44.426: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 1 on Chassis 1 is up
*Mar 13 10:23:44.426: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 2 on Chassis 1 is up
*Mar 13 10:23:44.448: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 1 has been added to the stack.
Chassis 2 detected and HA Sync in progress
*Mar 13 10:23:44.747: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 1 R0/0: rif_mgr: The RP link is UP.
*Mar 13 10:23:45.729: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been added to the stack.
*Mar 13 10:23:46.843: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 1 has been added to the stack.
*Mar 13 10:23:48.882: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 1 has been added to the stack.
*Mar 13 10:23:48.882: %STACKMGR-6-ACTIVE_ELECTED: Chassis 1 R0/0: stack_mgr: Chassis 1 has been elected ACTIVE.
*Mar 13 10:23:50.494: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 1 R0/0: rif_mgr: The RP link is UP.
*Mar 13 10:23:50.494: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 1 R0/0: stack_mgr: Dual Active Detection link is available now
*Mar 13 10:24:15.811: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 1 R0/0: wncmgrd: INFO: Bulk sync status : CONFIG_DONE
*Mar 13 20:24:59.854: %SYS-6-CLOCKUPDATE: System clock has been updated from 10:24:59 UTC Mon Mar 13 2023 to 20:24:59 Austral Mon Mar 13 2023, configured from console by vty0.
Mar 13 20:25:00.279: % Redundancy mode change to SSO
Mar 13 20:25:00.279: %VOICE_HA-7-STATUS: NONE->SSO; SSO mode will not take effect until after a platform reload.
Mar 13 20:25:00.508: RMI-HAINFRA-INFO: Learning Management IP: 192.168.203.31, mask: 255.255.255.0, if_number: 11l
Mar 13 20:25:05.679: %SYS-6-BOOTTIME: Time taken to reboot after reload = 153 seconds
Chassis 2 elected as Standby
Mar 13 20:25:21.946: %IOSXE_REDUNDANCY-6-PEER: Active detected chassis 2 as standby.
Mar 13 20:25:21.943: %STACKMGR-6-STANDBY_ELECTED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been elected STANDBY.
Mar 13 20:25:27.004: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_FOUND(4))
Mar 13 20:25:27.004: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_REDUNDANCY_STATE_CHANGE(5))
Mar 13 20:25:28.850: % Redundancy mode change to SSO
Mar 13 20:25:28.850: %VOICE_HA-7-STATUS: NONE->SSO; SSO mode will not take effect until after a platform reload.
Chassis 2 rebooted because of Bulk synchronisation failure
Mar 13 20:26:48.159: Config Sync: Bulk-sync failure due to PRC mismatch. Please check the full list of PRC failures via:
show redundancy config-sync failures prc
Mar 13 20:26:48.159: Config Sync: Starting lines from PRC file:
- cabundle nvram:ios_core.p7b
Mar 13 20:26:48.159: Config Sync: Bulk-sync failure, Reloading Standby
Mar 13 20:26:48.211: %VOICE_HA-7-STATUS: VOICE HA bulk sync done.
Mar 13 20:26:48.819: %RF-5-RF_RELOAD: Peer reload. Reason: Bulk Sync Failure
Mar 13 20:26:49.101: %IOSXE_REDUNDANCY-6-PEER_LOST: Active detected chassis 2 is no longer standby
Mar 13 20:26:49.165: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_NOT_PRESENT)
Mar 13 20:26:49.166: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_DOWN)
Mar 13 20:26:49.169: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_REDUNDANCY_STATE_CHANGE)
Mar 13 20:26:49.858: %RF-5-RF_RELOAD: Peer reload. Reason: EHSA standby down
Mar 13 20:26:49.085: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 2 on Chassis 1 is down
Mar 13 20:26:49.086: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 1 on Chassis 1 is down
Mar 13 20:26:49.086: %STACKMGR-6-CHASSIS_REMOVED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been removed from the stack.
Mar 13 20:26:50.682: %RF-5-RF_RELOAD: Peer reload. Reason: Active and Standby configuration out of sync
Mar 13 20:27:00.848: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 1 R0/0: rif_mgr: The RMI link is UP.
Mar 13 20:27:00.862: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway reachable from Standby
Mar 13 20:27:02.088: %RIF_MGR_FSM-6-RMI_LINK_DOWN: Chassis 1 R0/0: rif_mgr: The RMI link is DOWN.
Mar 13 20:27:30.837: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 1 R0/0: rif_mgr: Setting RP link status to DOWN
Mar 13 20:27:30.837: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 1 R0/0: stack_mgr: Dual Active Detection links are not available anymore
Chassis 2 re-detected and HA Sync in progress
Mar 13 20:27:56.232: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 2 on Chassis 1 is up
Mar 13 20:27:56.232: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 1 R0/0: stack_mgr: Stack port 1 on Chassis 1 is up
Mar 13 20:27:56.264: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been added to the stack.
Mar 13 20:27:58.066: %STACKMGR-6-CHASSIS_ADDED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been added to the stack.
Mar 13 20:28:01.726: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 1 R0/0: rif_mgr: The RP link is UP.
Mar 13 20:28:01.726: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 1 R0/0: stack_mgr: Dual Active Detection link is available now
Mar 13 20:28:11.300: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 1 R0/0: wncmgrd: INFO: Bulk sync status : COLD
Mar 13 20:28:21.730: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 1 R0/0: rif_mgr: The RP link is UP.
Mar 13 20:28:31.885: %IOSXE_REDUNDANCY-6-PEER: Active detected chassis 2 as standby.
Mar 13 20:28:31.878: %STACKMGR-6-STANDBY_ELECTED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been elected STANDBY.
Mar 13 20:28:41.893: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_FOUND(4))
Mar 13 20:28:41.893: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_REDUNDANCY_STATE_CHANGE(5))
Mar 13 20:28:52.235: Syncing vlan database
Mar 13 20:28:52.296: Vlan Database sync done from bootflash:vlan.dat to stby-bootflash:vlan.dat (556 bytes)
Configuration synchronised and final SSO state
Mar 13 20:30:11.111: %HA_CONFIG_SYNC-6-BULK_CFGSYNC_SUCCEED: Bulk Sync succeeded
Mar 13 20:30:11.161: %VOICE_HA-7-STATUS: VOICE HA bulk sync done.
Mar 13 20:30:12.170: %RF-5-RF_TERMINAL_STATE: Terminal state reached for (SSO)
Mar 13 20:30:12.783: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 1 R0/0: rif_mgr: The RMI link is UP.
Mar 13 20:30:24.182: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway reachable from Standby
Once the pair of chassis is in Active/Standby, the redundancy state is sso (instead of non-redundant) and the hardware mode is duplex (two CPU controller modules, instead of simplex).
The redundancy mode, the redundancy state and the chassis state are displayed on the following command:
9800-CL#show redundancy states
my state = 13 -ACTIVE
peer state = 8 -STANDBY HOT
Mode = Duplex
Unit = Primary
Unit ID = 1
Redundancy Mode (Operational) = sso
Redundancy Mode (Configured) = sso
Redundancy State = sso
Maintenance Mode = Disabled
Manual Swact = enabled
Communications = Up
client count = 129
client_notification_TMR = 30000 milliseconds
RF debug mask = 0x0
Gateway Monitoring = Enabled
Gateway monitoring interval = 8 secs
Or to display the HA status of the local or Active or Standby chassis:
9800-CL#show chassis ha-status {active | local | standby}
9800-CL#show chassis ha-status active
My state = ACTIVE
Peer state = STANDBY HOT
Last switchover reason = none
Last switchover time = none
Image Version = 17.6.4
Chassis-HA Local-IP Remote-IP MASK HA-Interface
-----------------------------------------------------------------------------
This Boot: 169.254.203.33 169.254.203.34 255.255.255.0 GigabitEthernet3
Next Boot: 169.254.203.33 169.254.203.34 255.255.255.0 GigabitEthernet3
Chassis-HA Chassis# Priority IFMac Address Peer-timeout(ms)*Max-retry
-----------------------------------------------------------------------------------------
This Boot: 1 2 08:00:27:CB:16:F0 100*5
Next Boot: 1 2 08:00:27:CB:16:F0 100*5
Note that the Local IP and Remote IP for the RP interface are auto-generated from the RMI IP addresses with a 169.254 prefix.
During the HA synchronisation, each peer goes through the following redundancy states:
My state (chassis 1) and peer state (chassis 2) on the Active unit after both chassis rebooted
The Cold adjective refers to when no state information is maintained between the Active and the Standby. No Switchover can occur immediately. The Hot adjective means that the system is redundant and capable of immediate Switchover.
To display the redundancy history with the chassis redundancy states:
9800-CL#show redundancy history
00:00:06 *my state = INITIALIZATION(2) peer state = DISABLED(1)
00:00:06 *my state = NEGOTIATION(3) peer state = DISABLED(1)
00:00:06 *my state = ACTIVE-FAST(9) peer state = DISABLED(1)
00:00:10 *my state = ACTIVE-DRAIN(10) peer state = DISABLED(1)
00:00:10 *my state = ACTIVE_PRECONFIG(11) peer state = DISABLED(1)
00:00:10 *my state = ACTIVE_POSTCONFIG(12) peer state = DISABLED(1)
00:00:10 *my state = ACTIVE(13) peer state = DISABLED(1)
Mar 13 20:25:00.278 Changing to system clock timestamps at uptime 1905
Mar 13 20:25:05.463 System initialization complete
Mar 13 20:25:28.949 my state = ACTIVE(13) *peer state = UNKNOWN(0)
Mar 13 20:25:29.055 my state = ACTIVE(13) *peer state = STANDBY COLD(4)
Mar 13 20:25:45.959 my state = ACTIVE(13) *peer state = STANDBY_ISSU_NEGOTIATION_LATE(35)
Mar 13 20:25:48.013 my state = ACTIVE(13) *peer state = STANDBY COLD-CONFIG(5)
Mar 13 20:26:33.996 my state = ACTIVE(13) *peer state = STANDBY COLD-FILESYS(6)
Mar 13 20:26:35.336 my state = ACTIVE(13) *peer state = STANDBY COLD-BULK(7)
Mar 13 20:26:48.105 my state = ACTIVE(13) *peer state = STANDBY HOT(8)
Mar 13 20:28:48.074 my state = ACTIVE(13) *peer state = UNKNOWN(0)
Mar 13 20:28:48.185 my state = ACTIVE(13) *peer state = STANDBY COLD(4)
Mar 13 20:29:05.983 my state = ACTIVE(13) *peer state = STANDBY_ISSU_NEGOTIATION_LATE(35)
Mar 13 20:29:08.112 my state = ACTIVE(13) *peer state = STANDBY COLD-CONFIG(5)
Mar 13 20:29:55.516 my state = ACTIVE(13) *peer state = STANDBY COLD-FILESYS(6)
Mar 13 20:29:57.284 my state = ACTIVE(13) *peer state = STANDBY COLD-BULK(7)
Mar 13 20:30:11.065 my state = ACTIVE(13) *peer state = STANDBY HOT(8)
The interfaces are all up on the Active chassis:
9800-CL#show ip interface brief
Interface IP-Address OK? Method Status Protocol
GigabitEthernet1 192.168.201.151 YES NVRAM up up
GigabitEthernet2 unassigned YES unset up up
Vlan1 192.168.203.31 YES NVRAM up up
The physical interfaces are down on the Standby chassis (except for the RP port that is not displayed here):
9800-CL#show ip interface brief
Interface IP-Address OK? Method Status Protocol
GigabitEthernet1 192.168.201.151 YES NVRAM down down
GigabitEthernet2 unassigned YES unset down down
Vlan1 192.168.203.34 YES unset up up
Failover due to controller failure: RP Keepalives, Switchover process
The following scenario describes the failover process when the Active chassis 1 fails (powered down for example). The Standby chassis 2 immediately took over the Active role.
The CAPWAP state of all the joined Access Points was maintained in the Standby controller, so they did not go into Discovery mode to join other controllers.
The clients that were in the RUN state were also maintained in the Standby controller. So those clients did not need to re-authenticate to the Access Points.
To detect the failure, the Standby chassis monitors the Active chassis using Keepalive messages that are sent every 100 ms over the Redundancy Port (RP). After 500 ms, the chassis declares that the other peer is unreachable and eventually changes its role.
The peer Keepalive timeout (100 ms by default) and maximum retry (5 retries) that make the peer timeout (500 ms) are displayed in the following commands:
9800-CL#show chassis ha-status {active | standby | local}
My state = ACTIVE
Peer state = STANDBY HOT
Last switchover reason = active unit removed
Last switchover time = 20:26:00 Austral Mon Jan 23 2023
Image Version = 17.6.4
Chassis-HA Local-IP Remote-IP MASK HA-Interface
-----------------------------------------------------------------------------
This Boot: 169.254.203.34 169.254.203.33 255.255.255.0 GigabitEthernet3
Next Boot: 169.254.203.34 169.254.203.33 255.255.255.0 GigabitEthernet3
Chassis-HA Chassis# Priority IFMac Address Peer-timeout(ms)*Max-retry
-----------------------------------------------------------------------------------------
This Boot: 2 1 08:00:27:CB:16:F0 100*5
Next Boot: 2 1 08:00:27:CB:16:F0 100*5
9800-CL#show platform software stack-mgr chassis {active | standby} R0 peer-timeout
Peer Chassis Peer-timeout (ms) 50% Mark 75% Mark
--------------------------------------------------------------------------
2 500 0 0
The Keepalive counter can be checked with the following command:
9800-CL#show platform software stack-mgr chassis {active | standby} R0 sdp-counters
Stack Discovery Protocol (SDP) Counters
---------------------------------------
Message Tx Success Tx Fail Rx Success Rx Fail
------------------------------------------------------------------------------
Discovery 25 2 29 0
Neighbor 11 3 10 0
Keepalive 3552 670 3552 0
SEPPUKU 0 0 0 0
Standby Elect Req 2 0 0 0
Standby Elect Ack 0 0 2 0
Standby IOS State 0 0 3 0
Reload Req 1 0 0 0
Reload Ack 0 0 1 0
SESA Mesg 0 0 0 0
RTU Msg 0 0 0 0
Disc Timer Stop 1 0 2 0
The Keepalive timers can also be changed:
9800-CL#chassis redundancy keep-alive timer ?
<1-10> Chassis peer keep-alive time interval in multiple of 100 ms (enter 1 for default)
9800-CL#chassis redundancy keep-alive retries ?
<5-10> Chassis peer keep-alive retries before claiming peer is down (enter 5 for default)
The Keepalive messages are sent using UDP port 2300 over the RP link.
The following packet capture shows when the chassis 1 failed and when no Keepalive message was received from the chassis 1 (no more UDP 2300 packet from 169.254.203.33).
The chassis 2 logs notified when some Keepalive messages timed out (missed for 2 times as a warning message) and when the chassis 1 was removed from the stack 500 ms after the first missed Keepalive. The Switchover was very fast (about 1 second) in this case.
9800-CL#show logging
Mar 13 20:37:03.831: %STACKMGR-6-KA_MISSED: Chassis 2 R0/0: stack_mgr: Keepalive missed for 2 times for Chassis 1
Mar 13 20:37:04.133: %STACKMGR-6-CHASSIS_REMOVED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been removed from the stack.
Mar 13 20:37:04.136: %STACKMGR-6-CHASSIS_REMOVED_KA: Chassis 2 R0/0: stack_mgr: Chassis 1 has been removed from the stack due to keepalive failure.
Mar 13 20:37:04.404: %PLATFORM-6-HASTATUS: RP switchover, received chassis event to become active
Mar 13 20:37:04.404: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_NOT_PRESENT)
Mar 13 20:37:04.404: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_DOWN)
Mar 13 20:37:04.404: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_REDUNDANCY_STATE_CHANGE)
Mar 13 20:37:04.422: %PLATFORM-6-HASTATUS: RP switchover, sent message became active. IOS is ready to switch to primary after chassis confirmation
Mar 13 20:37:04.433: %PLATFORM-6-HASTATUS: RP switchover, received chassis event became active
Mar 13 20:37:04.460: RMI-HAINFRA-ERR: ARP Standby: Unexpected source IP on RMI interface, IP: 192.168.203.31
Mar 13 20:37:04.673: %PLATFORM-6-HASTATUS_DETAIL: RP switchover, received chassis event became active. Switch to primary (count 1)
Mar 13 20:37:04.673: %HA-6-SWITCHOVER: Route Processor switched from Standby to being active
Mar 13 20:37:04.717: WLC-HA-Notice: RF Progression event: RF_PROG_ACTIVE_FAST, Switchover triggered
Mar 13 20:37:05.040: RMI-HAINFRA-INFO: Configured primary IP 192.168.203.31/255.255.255.0 on active(mgmt)
Mar 13 20:37:05.041: RMI-HAINFRA-INFO: Configured secondary IP 192.168.203.34/255.255.255.0 on active(mgmt)
Mar 13 20:37:05.059: %VOICE_HA-2-SWITCHOVER_IND: SWITCHOVER, from STANDBY_HOT to ACTIVE state.
Mar 13 20:37:05.063: WLC-HA-Notice: Sending garp intf = GigabitEthernet1, addr=192.168.201.151
Mar 13 20:37:05.077: %PKI-6-CS_ENABLED: Certificate server now enabled.
Mar 13 20:37:05.082: WLC-HA-Notice: Sending garp intf = LIIN0, addr=192.168.1.6
Mar 13 20:37:05.093: WLC-HA-Notice: Sending garp intf = Vlan1, addr=192.168.203.31
Mar 13 20:37:05.728: %CALL_HOME-6-CALL_HOME_ENABLED: Call-home is enabled by Smart Agent for Licensing.
Mar 13 20:37:07.058: %LINK-3-UPDOWN: Interface Null0, changed state to up
Mar 13 20:37:07.058: %LINK-3-UPDOWN: Interface GigabitEthernet1, changed state to up
Mar 13 20:37:07.059: %LINK-3-UPDOWN: Interface GigabitEthernet2, changed state to up
Mar 13 20:37:07.060: %LINK-3-UPDOWN: Interface Vlan1, changed state to up
Mar 13 20:37:08.058: %LINEPROTO-5-UPDOWN: Line protocol on Interface Null0, changed state to up
Mar 13 20:37:08.058: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1, changed state to up
Mar 13 20:37:08.059: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2, changed state to up
Mar 13 20:37:08.063: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan1, changed state to up
Mar 13 20:37:09.181: %RIF_MGR_FSM-6-GW_REACHABLE_ACTIVE: Chassis 2 R0/0: rif_mgr: Gateway reachable from Active
Mar 13 20:37:14.602: %RIF_MGR_FSM-6-RMI_LINK_DOWN: Chassis 2 R0/0: rif_mgr: The RMI link is DOWN.
Mar 13 20:37:26.779: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 2 R0/0: rif_mgr: Setting RP link status to DOWN
Mar 13 20:37:26.779: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 2 R0/0: stack_mgr: Dual Active Detection links are not available anymore
The following command on the new Active chassis also shows the loss of communication and the change of chassis states:
9800-CL#show redundancy history
Mar 13 20:30:10.632 *my state = STANDBY HOT(8) peer state = ACTIVE(13)
Mar 13 20:37:04.692 Reloading peer (communication down)
Mar 13 20:37:04.693 Reloading peer (peer presence lost)
Mar 13 20:37:04.693 *my state = ACTIVE-FAST(9) peer state = DISABLED(1)
Mar 13 20:37:05.034 *my state = ACTIVE-DRAIN(10) peer state = DISABLED(1)
Mar 13 20:37:05.050 *my state = ACTIVE_PRECONFIG(11) peer state = DISABLED(1)
Mar 13 20:37:05.056 *my state = ACTIVE_POSTCONFIG(12) peer state = DISABLED(1)
Mar 13 20:37:05.057 *my state = ACTIVE(13) peer state = DISABLED(1)
The chassis 2 declares that the chassis 1 has failed (disabled/removed) and then takes over the Active role.
My state (chassis 2) and peer state (chassis 1) after the Active chassis 1 unit was powered down
The console to the management interface was connected to the chassis 2 (*):
9800-CL#show chassis rmi
Chassis/Stack Mac Address : 0800.27cb.16f0 - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Chassis# Role Mac Address Priority Version State IP RMI-IP
--------------------------------------------------------------------------------------------------------
1 Member 0000.0000.0000 0 V02 Removed 169.254.203.33 192.168.203.33
*2 Active 0800.274d.a612 1 V02 Ready 169.254.203.34 192.168.203.34
The redundancy mode becomes simplex (non-redundant).
9800-CL#show redundancy states
my state = 13 -ACTIVE
peer state = 1 -DISABLED
Mode = Simplex
Unit = Primary
Unit ID = 2
Redundancy Mode (Operational) = Non-redundant
Redundancy Mode (Configured) = sso
Redundancy State = Non Redundant
Maintenance Mode = Disabled
Manual Swact = disabled (system is simplex (no peer unit))
Communications = Down Reason: Simplex mode
client count = 127
client_notification_TMR = 30000 milliseconds
RF debug mask = 0x0
Gateway Monitoring = Enabled
Gateway monitoring interval = 8 secs
When the chassis 2 has taken the Active role, several broadcast Gratuitous ARP Requests advertising the management IP address were sent from the chassis 2 to update the network ARP tables. This way, the traffic for the management IP address was redirected to the new Active chassis. The chassis 2 also sent some ARP Requests to the Access Points (192.168.203.53 in this example) and to the default gateway (192.168.203.2 in this example).
The association uptime of the Access Point did not reset to 0:
9800-CL#show ap uptime
Number of APs: 1
AP Name Ethernet MAC Radio MAC AP Up Time Association Up Time
---------------------------------------------------------------------------------------------------------------------------------------------------
AP1850 38ed.18c8.4a40 38ed.18c9.8f80 2 hours 3 minutes 7 seconds 12 minutes 16 seconds
The Switchover history displays the time and the reason of the Switchover:
9800-CL#show redundancy switchover history
Index Previous Current Switchover Switchover
active active reason time
----- -------- ------- ---------- ----------
1 1 2 active unit removed 20:37:05 Austral Mon Mar 13 2023
When the peer chassis is back after failover: bulk synchronisation
When the chassis 1 was powered back up, the chassis 2 remained in the Active state and the chassis 1 took the Hot Standby state. There is no automatic fallback method like the primary/secondary/tertiary WLC method in N+1 redundancy.
My state (chassis 2) and peer state (chassis 1) after the chassis 1 unit was powered back up
The redundancy mode went back to duplex once the peer chassis was in Standby mode.
9800-CL#show redundancy states
my state = 13 -ACTIVE
peer state = 8 -STANDBY HOT
Mode = Duplex
Unit = Primary
Unit ID = 2
Redundancy Mode (Operational) = sso
Redundancy Mode (Configured) = sso
Redundancy State = sso
Maintenance Mode = Disabled
Manual Swact = enabled
Communications = Up
client count = 129
client_notification_TMR = 30000 milliseconds
RF debug mask = 0x0
Gateway Monitoring = Enabled
Gateway monitoring interval = 8 secs
The console to the management interface was still connected to the chassis 2 (*) in Active mode:
9800-CL#show chassis rmi
Chassis/Stack Mac Address : 0800.27cb.16f0 - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Chassis# Role Mac Address Priority Version State IP RMI-IP
--------------------------------------------------------------------------------------------------------
1 Standby 0800.27cb.16f0 2 V02 Ready 169.254.203.33 192.168.203.33
*2 Active 0800.274d.a612 1 V02 Ready 169.254.203.34 192.168.203.34
Note that when the chassis 1 came back online, it moved into the Hot Standby role regardless of its chassis priority. The chassis 2 remained in its Active role as shown in its logs:
9800-CL#show logging
Mar 13 20:40:55.507: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been added to the stack.
Mar 13 20:40:57.744: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been added to the stack.
Mar 13 20:41:00.457: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.
Mar 13 20:41:00.457: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 2 R0/0: stack_mgr: Dual Active Detection link is available now
Mar 13 20:41:18.404: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 2 R0/0: wncmgrd: INFO: Bulk sync status : COLD
Mar 13 20:41:20.450: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.
Mar 13 20:41:31.522: %IOSXE_REDUNDANCY-6-PEER: Active detected chassis 1 as standby.
Mar 13 20:41:31.516: %STACKMGR-6-STANDBY_ELECTED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been elected STANDBY.
Mar 13 20:42:26.531: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_FOUND(4))
Mar 13 20:42:26.531: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a standby insertion (raw-event=PEER_REDUNDANCY_STATE_CHANGE(5))
Mar 13 20:42:37.208: Syncing vlan database
Mar 13 20:42:37.264: Vlan Database sync done from bootflash:vlan.dat to stby-bootflash:vlan.dat (556 bytes)
Mar 13 20:44:04.655: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 2 R0/0: rif_mgr: The RMI link is UP.
Mar 13 20:44:05.565: %HA_CONFIG_SYNC-6-BULK_CFGSYNC_SUCCEED: Bulk Sync succeeded
Mar 13 20:44:05.589: %VOICE_HA-7-STATUS: VOICE HA bulk sync done.
Mar 13 20:44:06.695: %RF-5-RF_TERMINAL_STATE: Terminal state reached for (SSO)
Mar 13 20:44:20.095: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 2 R0/0: rif_mgr: Gateway reachable from Standby
Next, I will describe additional commands and other failover scenarios in a second part.
No comments:
Post a Comment