This article is following the first part about SSO redundancy on Cisco 9800 wireless controllers. It covers various failover scenarios triggered by power failures and network failures. And some extra configuration settings.
Access to the Standby chassis
When the HA pair is established, configuration commands can only be done on the Active chassis.
Console access of the Standby chassis is not available by default. The console shows the following error message.
9800-CL-stby>?
Standy console disabled
However, the Standby console can be enabled using the following command on the Active chassis:
9800-CL(config)#redundancy
9800-CL(config-red)#main-cpu
9800-CL(config-r-mc)#standby console enable
The Standby console then provides a restricted set of commands, mostly “show” commands:
9800-CL-stby>?
Exec commands:
access-profile Apply user-orfile to interface
app-hosting Application hosting
…
Another additional benefit of the RMI mode compared to the RP mode is that it allows the access to every peer chassis via SSH, HTTPS or NETCONF through the RMI interface. Accessing the Standby chassis from the network can help troubleshooting HA issues. Previously when using the RP mode, only the currently Active chassis was reachable via SSH.
The RMI interface also allows to monitor the reachability of each chassis via ICMP.
C:\WINDOWS\system32>ping 192.168.203.33
Pinging 192.168.203.33 with 32 bytes of data:
Reply from 192.168.203.33: bytes=32 time<1ms TTL=255
Reply from 192.168.203.33: bytes=32 time<1ms TTL=255
C:\WINDOWS\system32>ping 192.168.203.34
Pinging 192.168.203.34 with 32 bytes of data:
Request timed out.
Reply from 192.168.203.34: bytes=32 time=1ms TTL=255
Reply from 192.168.203.34: bytes=32 time=1ms TTL=255
Configure a chassis name
Each chassis is identified by its configured chassis ID (1 or 2).
When connecting to each chassis, the console prompt displays the chassis stack hostname (“9800-CL” in this lab) with a suffix indicating the chassis state: 9800-CL# on the Active and 9800-CL-stby# on the Standby.
The local chassis ID can be identify using show chassis local command but is not directly visible from the prompt.
However, a chassis name can be assigned and displayed on the console prompt using the following command:
9800-CL#redun-management hostname chassis 1 name 9800-1 chassis 2 name 9800-2
9800-1#
Manual Switchover
There is no automatic fallback with SSO. However, there is a manual Switchover command which can be performed by the administrator to transition the Active chassis to Hot Standby:
9800-1#redundancy force-switchover
Proceed with switchover to Standby RP? [confirm]
The command will trigger a reload of the Active chassis (9800-1) causing the Standby chassis (9800-2) to become Active.
9800-2#show redundancy switchover history
Index Previous Current Switchover Switchover
active active reason time
----- -------- ------- ---------- ----------
1 1 2 user forced 20:49:22 Austral Mon Mar 13 2023
When the peer chassis which is in Hot Standby is powered down: no redundancy (simplex).
In this test, the Hot Standby had a system failure in a stack running in SSO mode. The Active chassis detected the loss of Keepalive on the RP link but remained in Active mode. The redundancy mode on the Active chassis became simplex (non-redundant).
9800-1#show redundancy states
my state = 13 -ACTIVE
peer state = 1 -DISABLED
Mode = Simplex
Unit = Primary
Unit ID = 1
Redundancy Mode (Operational) = Non-redundant
Redundancy Mode (Configured) = sso
Redundancy State = Non Redundant
Maintenance Mode = Disabled
Manual Swact = disabled (system is simplex (no peer unit))
Communications = Down Reason: Simplex mode
client count = 127
client_notification_TMR = 30000 milliseconds
RF debug mask = 0x0
Gateway Monitoring = Enabled
Gateway monitoring interval = 8 secs
The failed Standby chassis was removed from the stack.
9800-1#show chassis rmi
Chassis/Stack Mac Address : 0800.27cb.16f0 - Local Mac Address
Mac persistency wait time: Indefinite
H/W Current
Chassis# Role Mac Address Priority Version State IP RMI-IP
--------------------------------------------------------------------------------------------------------
*1 Active 0800.27cb.16f0 2 V02 Ready 169.254.203.33 192.168.203.33
2 Member 0000.0000.0000 0 V02 Removed 169.254.203.34 192.168.203.34
The redundancy history and chassis logs show when the communication was lost with the Standby chassis.
9800-1#show redundancy history
Mar 13 18:44:39.885 my state = ACTIVE(13) *peer state = STANDBY HOT(8)
Mar 13 18:55:32.019 Reloading peer (communication down)
Mar 13 18:55:32.021 Reloading peer (peer presence lost)
9800-1#show logging
Mar 13 18:55:31.488: %IOSXE_REDUNDANCY-6-PEER_LOST: Active detected chassis 2 is no longer Standby
Mar 13 18:55:31.097: %STACKMGR-6-KA_MISSED: Chassis 1 R0/0: stack_mgr: Keepalive missed for 2 times for Chassis 2
Mar 13 18:55:31.489: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_NOT_PRESENT)
Mar 13 18:55:31.489: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_DOWN)
Mar 13 18:55:31.489: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_REDUNDANCY_STATE_CHANGE)
Mar 13 18:55:31.911: %RF-5-RF_RELOAD: Peer reload. Reason: EHSA Standby down
Mar 13 18:55:31.399: %STACKMGR-6-CHASSIS_REMOVED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been removed from the stack.
Mar 13 18:55:31.403: %STACKMGR-6-CHASSIS_REMOVED_KA: Chassis 1 R0/0: stack_mgr: Chassis 2 has been removed from the stack due to keepalive failure.
Mar 13 18:55:48.887: %RIF_MGR_FSM-6-RMI_LINK_DOWN: Chassis 1 R0/0: rif_mgr: The RMI link is DOWN.
Mar 13 18:55:55.430: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 1 R0/0: rif_mgr: Setting RP link status to DOWN
Mar 13 18:55:55.430: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 1 R0/0: stack_mgr: Dual Active Detection links are not available anymore
When the chassis 2 came back online, it moved back to the Hot Standby role and the Active chassis remained in its role.
Failover due to network failure (default gateway not reachable from Active): Switchover, Management Gateway Failover, Active-Recovery mode and Standby-Recovery mode.
In this scenario, the default gateway became unreachable from the Active chassis 1. The uplink of the switch connected to chassis 1 went down but the GigabitEthernet 2 on the Active chassis did not go down.
A Switchover occurred when the loss of default gateway was detected on the Active chassis. It triggered a reboot of the Active chassis and the Standby chassis took over the Active role about 10 seconds after the loss of gateway.
This scenario uses another RMI feature called Management Gateway Failover, or Gateway Reachability Check.
When enabled, the RMI interface checks the reachability of the default gateway by sending ICMP Request every second using the RMI interface as Source IP address. If there are 4 failed ICMP Request and then 4 failed ARP Request to the default gateway (for a total of 8 seconds), the 9800 considers the default gateway as unreachable.
The duration of the Gateway Failover Interval (8 seconds by default) can be modified using the following command:
management gateway-failover interval <6-12>
Since release 17.2, the default gateway is taken from the static routes instead of using the default gateway “ip default-gateway x.x.x.x” command.
Interestingly, the process of ARP or ICMP discovery changed its discovery protocol (ICMP/ARP) at each lost gateway events. In this case, the Active chassis was monitoring the default gateway using ARP, then switched to ICMP after 4 ARP failures. In this example, the uplink of the switch was disconnected at 20:59:00. Later in this article, the Active chassis will be monitoring the default gateway using ICMP, then switched to ARP after 4 ICMP failures.
The Active chassis detected the loss of gateway on the management/RMI interface at 20:59:10 and went into Active-Recovery mode while preparing to reboot due to reason RIF: GW DOWN. Then the chassis 1 rebooted at 21:59:29.
A Recovery mode occurs when one resource (RP link, RMI link, gateway) does not become available on the chassis. In this case, the RMI link was down. In recovery mode, all the ports are administratively down and no synchronisation or configuration change can occur.
9800-1(recovery-mode)#show logging
Apr 2 20:59:10.078: RMI-HAINFRA-INFO: Originating event to Shut all Interfaces
Apr 2 20:59:10.078: RMI-HAINFRA-INFO: Shutting down all interfaces in ActiveRecovery
Apr 2 20:59:10.080: RMI-HAINFRA-INFO: Not shutting down the interface-rmi: Vlan1
Apr 2 20:59:10.054: %RIF_MGR_FSM-6-GW_UNREACHABLE_ACTIVE: Chassis 1 R0/0: rif_mgr: Gateway not reachable from Active
Apr 2 20:59:10.059: %STACKMGR-1-RELOAD: Chassis 1 R0/0: stack_mgr: Reloading due to reason Reload Command - RIF: GW DOWN
Apr 2 20:59:10.070: %RIF_MGR_FSM-6-RMI_ACTIVE_RECOVERY_MODE: Chassis 1 R0/0: rif_mgr: Going to Active(Recovery) from Active state
Apr 2 20:59:11.055: %RIF_MGR_FSM-6-RMI_GW_DECISION_DEFERRED: Chassis 1 R0/0: rif_mgr: High CPU utilisation on active or Standby, deferring action on gateway-down event
Apr 2 20:59:12.079: %LINK-5-CHANGED: Interface GigabitEthernet1, changed state to administratively down
Apr 2 20:59:13.080: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1, changed state to down
Apr 2 20:59:13.603: %RIF_MGR_FSM-6-RMI_LINK_DOWN: Chassis 1 R0/0: rif_mgr: The RMI link is DOWN.
Apr 2 20:59:28.824: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (KEEPALIVE_FAILURE)
Apr 2 20:59:29.735: %RF-5-RF_RELOAD: Peer reload. Reason: EHSA Standby down
The chassis 1 prompt also changed to Recovery mode:
9800-1(recovery-mode)#
In the meanwhile, the RP link stayed up and the Keepalive messages kept being exchanged every 100 ms. When the chassis 1 went into Active-Recovery at 20:59:10, the Keepalive messages stopped and the chassis 1 sends a last reply to the Keep alive with an ICMP Port Unreachable on the RP link. The Hot Standby chassis then detects the failure of the RP link and triggered a Switchover.
The the Standby chassis 2 declared that the chassis 1 was lost at 20:59:10 and started the Switchover. The redundancy history on the chassis 2 shows the Switchover event:
9800-2#show redundancy history
Apr 2 20:56:46.918 *my state = STANDBY HOT(8) peer state = ACTIVE(13)
Apr 2 20:59:10.523 Reloading peer (communication down)
Apr 2 20:59:10.525 Reloading peer (peer presence lost)
Apr 2 20:59:10.525 *my state = ACTIVE-FAST(9) peer state = DISABLED(1)
Apr 2 20:59:10.988 *my state = ACTIVE-DRAIN(10) peer state = DISABLED(1)
Apr 2 20:59:11.000 *my state = ACTIVE_PRECONFIG(11) peer state = DISABLED(1)
Apr 2 20:59:11.006 *my state = ACTIVE_POSTCONFIG(12) peer state = DISABLED(1)
Apr 2 20:59:11.008 *my state = ACTIVE(13) peer state = DISABLED(1)
The chassis 2 logs shows that the chassis 1 was removed from the stack and and that the chassis 2 immediately took the Active role.
9800-2#show logging
Apr 2 20:59:10.148: %PLATFORM-6-HASTATUS: RP switchover, received chassis event to become active
Apr 2 20:59:10.149: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_REDUNDANCY_STATE_CHANGE)
Apr 2 20:59:10.223: %PLATFORM-6-HASTATUS: RP switchover, sent message became active. IOS is ready to switch to primary after chassis confirmation
Apr 2 20:59:10.238: %PLATFORM-6-HASTATUS: RP switchover, received chassis event became active
Apr 2 20:59:10.060: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 2 on Chassis 2 is down
Apr 2 20:59:10.060: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 1 on Chassis 2 is down
Apr 2 20:59:10.060: %STACKMGR-6-CHASSIS_REMOVED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been removed from the stack.
Apr 2 20:59:10.473: %PLATFORM-6-HASTATUS_DETAIL: RP switchover, received chassis event became active. Switch to primary (count 1)
Apr 2 20:59:10.474: %HA-6-SWITCHOVER: Route Processor switched from Standby to being active
Apr 2 20:59:10.526: pm_port_em_recovery
Apr 2 20:59:10.574: WLC-HA-Notice: RF Progression event: RF_PROG_ACTIVE_FAST, Switchover triggered
Apr 2 20:59:10.994: RMI-HAINFRA-INFO: Configured primary IP 192.168.203.31/255.255.255.0 on active(mgmt)
Apr 2 20:59:10.994: RMI-HAINFRA-INFO: Configured secondary IP 192.168.203.34/255.255.255.0 on active(mgmt)
Apr 2 20:59:11.010: %VOICE_HA-2-SWITCHOVER_IND: SWITCHOVER, from STANDBY_HOT to ACTIVE state.
Apr 2 20:59:11.013: WLC-HA-Notice: Sending garp intf = GigabitEthernet1, addr=192.168.201.151
Apr 2 20:59:11.029: %PKI-6-CS_ENABLED: Certificate server now enabled.
Apr 2 20:59:11.029: WLC-HA-Notice: Sending garp intf = LIIN0, addr=192.168.1.6
Apr 2 20:59:11.043: WLC-HA-Notice: Sending garp intf = Vlan1, addr=192.168.203.31
Apr 2 20:59:11.588: %CALL_HOME-6-CALL_HOME_ENABLED: Call-home is enabled by Smart Agent for Licensing.
Apr 2 20:59:13.009: %LINK-3-UPDOWN: Interface Null0, changed state to up
Apr 2 20:59:13.010: %LINK-3-UPDOWN: Interface GigabitEthernet1, changed state to up
Apr 2 20:59:13.011: %LINK-3-UPDOWN: Interface GigabitEthernet2, changed state to up
Apr 2 20:59:13.013: %LINK-3-UPDOWN: Interface Vlan1, changed state to up
Apr 2 20:59:13.502: %RIF_MGR_FSM-6-RMI_LINK_DOWN: Chassis 2 R0/0: rif_mgr: The RMI link is DOWN.
Apr 2 20:59:14.009: %LINEPROTO-5-UPDOWN: Line protocol on Interface Null0, changed state to up
Apr 2 20:59:14.010: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1, changed state to up
Apr 2 20:59:14.012: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2, changed state to up
Apr 2 20:59:14.019: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan1, changed state to up
Apr 2 20:59:16.005: %RIF_MGR_FSM-6-GW_REACHABLE_ACTIVE: Chassis 2 R0/0: rif_mgr: Gateway reachable from Active
Apr 2 21:00:00.968: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 2 R0/0: rif_mgr: Setting RP link status to DOWN
Apr 2 21:00:00.968: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 2 R0/0: stack_mgr: Dual Active Detection links are not available anymore
Once the chassis 1 has finished rebooting, the chassis 1 took the Standby role:
9800-1#show redundancy history
Apr 2 20:56:47.419 my state = ACTIVE(13) *peer state = STANDBY HOT(8)
00:00:05 *my state = INITIALIZATION(2) peer state = DISABLED(1)
00:00:05 *my state = NEGOTIATION(3) peer state = DISABLED(1)
00:00:05 *my state = STANDBY COLD(4) peer state = DISABLED(1)
00:00:05 my state = STANDBY COLD(4) *peer state = ACTIVE(13)
00:08:03 *my state = STANDBY_ISSU_NEGOTIATION_LATE(35) peer state = ACTIVE(13)
Apr 2 21:02:17.154 *my state = STANDBY COLD-CONFIG(5) peer state = ACTIVE(13)
Apr 2 21:03:00.880 *my state = STANDBY COLD-FILESYS(6) peer state = ACTIVE(13)
Apr 2 21:03:02.666 *my state = STANDBY COLD-BULK(7) peer state = ACTIVE(13)
Apr 2 21:03:18.023 *my state = STANDBY HOT(8) peer state = ACTIVE(13)
9800-2#show redundancy history
Apr 2 21:01:56.348 my state = ACTIVE(13) *peer state = UNKNOWN(0)
Apr 2 21:01:56.465 my state = ACTIVE(13) *peer state = STANDBY COLD(4)
Apr 2 21:02:15.148 my state = ACTIVE(13) *peer state = STANDBY_ISSU_NEGOTIATION_LATE(35)
Apr 2 21:02:17.154 my state = ACTIVE(13) *peer state = STANDBY COLD-CONFIG(5)
Apr 2 21:03:00.444 my state = ACTIVE(13) *peer state = STANDBY COLD-FILESYS(6)
Apr 2 21:03:02.647 my state = ACTIVE(13) *peer state = STANDBY COLD-BULK(7)
Apr 2 21:03:18.704 my state = ACTIVE(13) *peer state = STANDBY HOT(8)
However, the chassis 1 went into Standby-Recovery mode because the default gateway was not reachable yet.
9800-1(recovery-mode)-stby#show logging
Apr 2 21:02:59.905: %SYS-5-RESTART: System restarted --
Apr 2 21:03:19.086: %PLATFORM-6-RF_PROG_SUCCESS: RF state STANDBY HOT
Apr 2 21:03:33.827: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:03:34.829: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:03:36.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:03:37.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:03:38.512: %RIF_MGR_FSM-6-GW_UNREACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway not reachable from Standby
Apr 2 21:03:38.512: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC: Chassis 1 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state
Apr 2 21:03:42.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:03:43.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:03:44.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:03:45.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:03:50.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:03:51.509: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
…
The pair of chassis still synchronised and established SSO redundancy.
9800-2#show logging
Apr 2 21:00:49.786: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 2 on Chassis 2 is up
Apr 2 21:00:49.787: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 1 on Chassis 2 is up
Apr 2 21:00:49.826: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been added to the stack.
Apr 2 21:00:52.134: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been added to the stack.
Apr 2 21:00:55.297: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.
Apr 2 21:00:55.297: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 2 R0/0: stack_mgr: Dual Active Detection link is available now
Apr 2 21:01:09.910: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 2 R0/0: wncmgrd: INFO: Bulk sync status : COLD
Apr 2 21:01:15.280: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.
Apr 2 21:01:26.200: %IOSXE_REDUNDANCY-6-PEER: Active detected chassis 1 as Standby.
Apr 2 21:01:26.191: %STACKMGR-6-STANDBY_ELECTED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been elected STANDBY.
Apr 2 21:01:51.282: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a Standby insertion (raw-event=PEER_FOUND(4))
Apr 2 21:01:51.282: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a Standby insertion (raw-event=PEER_REDUNDANCY_STATE_CHANGE(5))
Apr 2 21:02:00.277: Syncing vlan database
Apr 2 21:02:00.328: Vlan Database sync done from bootflash:vlan.dat to stby-bootflash:vlan.dat (556 bytes)
Apr 2 21:03:18.730: %HA_CONFIG_SYNC-6-BULK_CFGSYNC_SUCCEED: Bulk Sync succeeded
Apr 2 21:03:18.762: %VOICE_HA-7-STATUS: VOICE HA bulk sync done.
Apr 2 21:03:19.769: %RF-5-RF_TERMINAL_STATE: Terminal state reached for (SSO)
Apr 2 21:03:38.513: %RIF_MGR_FSM-6-GW_UNREACHABLE_STANDBY: Chassis 2 R0/0: rif_mgr: Gateway not reachable from Standby
Apr 2 21:03:38.513: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC: Chassis 2 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state
The chassis 1 prompt changed to Standby-Recovery mode:
9800-1(recovery-mode)-stby#show redundancy states
my state = 8 -STANDBY HOT
peer state = 13 -ACTIVE
Mode = Duplex
Unit = Primary
Unit ID = 1
Redundancy Mode (Operational) = sso
Redundancy Mode (Configured) = sso
Redundancy State = sso
Maintenance Mode = Disabled
Manual Swact = cannot be initiated from this the Standby unit
Communications = Up
client count = 127
client_notification_TMR = 30000 milliseconds
RF debug mask = 0x0
Gateway Monitoring = Enabled
Gateway monitoring interval = 8 secs
The cause of the Switchover is described as “Active lost GW”:
9800-2#show redundancy switchover history
Index Previous Current Switchover Switchover
active active reason time
----- -------- ------- ---------- ----------
1 1 2 Active lost GW 20:59:11 Austral Sun Apr 2 2023
At 21:06:03, the uplink on the switch was re-connected. Once the default gateway becomes reachable, the chassis 1 in Standby-Recovery mode went immediately to Hot Standby without a reboot and the RMI link was up again.
9800-1-stby#show logging
Apr 2 21:05:59.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:06:00.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:06:01.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2
Apr 2 21:06:03.510: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway reachable from Standby
Apr 2 21:06:03.510: %RIF_MGR_FSM-6-RMI_STBY_REC_TO_STBY: Chassis 1 R0/0: rif_mgr: Going from Standby(Recovery) to Standby state
Apr 2 21:06:05.334: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 1 R0/0: rif_mgr: The RMI link is UP.
9800-2#show logging
Apr 2 21:06:02.068: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 2 R0/0: rif_mgr: The RMI link is UP.
Apr 2 21:06:03.515: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 2 R0/0: rif_mgr: Gateway reachable from Standby
Failover due to network failure (management port goes down): Switchover, RMI port down.
This section describes the Switchover when the management/RMI interface (GigabitEthernet2 port) was disconnected on the Active chassis 1. This scenario is similar to the previous one: the default gateway became unreachable from the chassis 1 and the chassis 2 moved to the Active role. The Switchover occurred within 8 seconds.
In this test, the management port was disconnected on the chassis 1 at 21:14:19. The chassis 1 detected the loss of the gateway at 21:14:27, 8 seconds after the port went down. Then the chassis 1 went into Active-Recovery mode, shut down its interface and rebooted at 21:14:47.
In the meanwhile, the chassis 2 detected the loss of Keepalives through the RP port. The chassis 1 sent a last reply to the Keep alive with an ICMP Port Unreachable at 21:14:27.
When the chassis 2 detected the loss of the Keepalives over the RP port, it triggered a Switchover and became the Active chassis at 21:14:28.
Interestingly, there were still some communication between the chassis through the RP port after the loss of UDP Keepalive messages.
The pair of chassis have also formed GRE tunnels through the RP port to exchange information about the stack. Within the tunnel, the chassis 1 provided information including events such as the chassis reload and reason codes to the chassis 2.
When the Active chassis failed, the two chassis determined the reason for the Switchover and checked the resources (RP link, RMI interface, default gateway). The gateway reachability information of chassis 1 was exchanged with chassis 2 over the RP link.
In addition, the Switchover history showed a different reason Active RMI port down for the Switchover compared to the previous scenario.
9800-2#show redundancy switchover history
Index Previous Current Switchover Switchover
active active reason time
----- -------- ------- ---------- ----------
…
5 1 2 Active RMI port down 21:14:28 Austral Mon Mar 13 2023
When the chassis 1 was reconnected, it joined the stack with the Standby role.
Failover due to network failure (RP interface fails): no Switchover, RMI link, RP Standby-Recovery
In this case, the RP port was disconnected on the Active chassis 1 and there were no Switchover. The chassis 1 remained in the Active role and the Standby went into Standby-Recovery mode. When the RP port was reconnected, the Standby chassis rebooted for bulk-synchronisation and went back into Standby role.
In details, the RP port was disconnected at 21:29:00 on the Active chassis 1 and the Chassis 2 has detected the loss of Keepalive messages within 500 ms. The chassis 2 immediately went into Standy-Recovery mode with the reason RP DOWN.
9800-2(rp-rec-mode)#show logging
Mar 13 21:29:00.310: %STACKMGR-6-KA_MISSED: Chassis 2 R0/0: stack_mgr: Keepalive missed for 2 times for Chassis 1
Mar 13 21:29:00.614: %STACKMGR-6-CHASSIS_REMOVED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been removed from the stack.
Mar 13 21:29:00.616: %STACKMGR-6-CHASSIS_REMOVED_KA: Chassis 2 R0/0: stack_mgr: Chassis 1 has been removed from the stack due to keepalive failure.
Mar 13 21:29:00.623: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC_REASON: Chassis 2 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state, Reason: RP DOWN
Mar 13 21:29:28.152: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 2 R0/0: rif_mgr: Setting RP link status to DOWN
The RP link manages the Keepalive messages between the Active and the Standby chassis. However, the RP mode of SSO makes the RP link as a single point of failure that cannot distinguish between a controller failure and a link failure.
The RMI mode of SSO uses secondary IP address on the management interface to monitor the chassis via a second link between the Active and the Standby chassis. The RMI link uses TCP port 3200 to exchange resource health information including gateway reachability from each chassis. The Hot Standby chassis was sending TCP Keepalive messages from the RMI interface (gi2) to the RMI interface of the Active chassis every 15 seconds or when required.
In our example, when the chassis 2 detected the loss of Keepalive on the RP port, the Hot Standby chassis has immediately sent a TCP port 3200 message to the chassis 1 over the RMI link at 21:29:00. The chassis 1 replied and provided confirmation that the RMI link was still up.
The RMI link has prevented the Switchover and has allowed the Standby to move into Standby-Recovery.
The chassis 2 state changed to Standby-Recovery mode due to RP DOWN:
9800-2(rp-rec-mode)-stby#show redundancy states
my state = 101-STANDBY RECOVERY(RP DOWN)
peer state = 13 -ACTIVE
Mode = Simplex
Unit = Primary
Unit ID = 2
Redundancy Mode (Operational) = sso
Redundancy Mode (Configured) = sso
Redundancy State = Non Redundant
Maintenance Mode = Disabled
Manual Swact = cannot be initiated from this the Standby unit
Communications = Down Reason: Simplex mode
client count = 127
client_notification_TMR = 30000 milliseconds
RF debug mask = 0x0
Gateway Monitoring = Enabled
Gateway monitoring interval = 8 secs
The chassis 1 remained Active and removed the chassis 2 from the stack:
Mar 13 21:29:00.392: %STACKMGR-6-KA_MISSED: Chassis 1 R0/0: stack_mgr: Keepalive missed for 2 times for Chassis 2
Mar 13 21:29:01.000: %IOSXE_REDUNDANCY-6-PEER_LOST: Active detected chassis 2 is no longer Standby
Mar 13 21:29:00.619: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC_REASON: Chassis 1 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state, Reason: RP DOWN
Mar 13 21:29:00.710: %STACKMGR-6-CHASSIS_REMOVED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been removed from the stack.
Mar 13 21:29:00.718: %STACKMGR-6-CHASSIS_REMOVED_KA: Chassis 1 R0/0: stack_mgr: Chassis 2 has been removed from the stack due to keepalive failure.
Mar 13 21:29:01.007: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_NOT_PRESENT)
Mar 13 21:29:01.007: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_DOWN)
Mar 13 21:29:01.007: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_REDUNDANCY_STATE_CHANGE)
Mar 13 21:29:02.176: %RF-5-RF_RELOAD: Peer reload. Reason: EHSA Standby down
Mar 13 21:29:24.301: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 1 R0/0: rif_mgr: Setting RP link status to DOWN
The chassis 1 went into Simplex mode (no redundancy)
9800-1#show redundancy states
my state = 13 -ACTIVE
peer state = 1 -DISABLED
Mode = Simplex
Unit = Primary
Unit ID = 1
Redundancy Mode (Operational) = Non-redundant
Redundancy Mode (Configured) = sso
Redundancy State = Non Redundant
Maintenance Mode = Disabled
Manual Swact = disabled (system is simplex (no peer unit))
Communications = Down Reason: Simplex mode
client count = 127
client_notification_TMR = 30000 milliseconds
RF debug mask = 0x0
Gateway Monitoring = Enabled
Gateway monitoring interval = 8 secs
When the RP port was reconnected at 21:31:00, the chassis 2 has detected that the RP link was up with Keepalive messages. The chassis 2 had a reboot due to RIF: Bulk Sync to synchronise the configuration and became Hot Standby at 21:34:17.
9800-2#show logging
Mar 13 21:31:04.303: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.
Mar 13 21:31:04.305: %STACKMGR-1-RELOAD: Chassis 2 R0/0: stack_mgr: Reloading due to reason Reload Command - RIF: Bulk Sync
*Mar 13 11:32:19.714: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 1 on Chassis 2 is down
*Mar 13 11:32:19.714: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 2 on Chassis 2 is down
*Mar 13 11:32:19.714: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 1 on Chassis 2 is up
*Mar 13 11:32:19.714: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 2 on Chassis 2 is up
*Mar 13 11:32:19.735: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 2 has been added to the stack.
*Mar 13 11:32:20.206: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.
*Mar 13 11:32:21.572: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 2 has been added to the stack.
*Mar 13 11:32:25.249: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.
*Mar 13 11:32:25.249: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 2 R0/0: stack_mgr: Dual Active Detection link is available now
*Mar 13 11:32:35.336: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 2 R0/0: wncmgrd: INFO: Bulk sync status : COLD
Mar 13 11:33:18.612: RMI-HAINFRA-INFO: Invoked cstate change on Standby config
Mar 13 11:33:18.613: RMI-HAINFRA-INFO: Invoking FIB to create RMI IPv4 192.168.203.34 entry
Mar 13 21:33:22.865: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1, changed state to down
Mar 13 21:33:22.887: RMI-HAINFRA-INFO: Invoked cstate change on Standby config
Mar 13 21:33:22.887: RMI-HAINFRA-INFO: Adding Primary Mgmt IP 192.168.203.34/255.255.255.0
Mar 13 21:33:22.897: RMI-HAINFRA-INFO: Invoked Standby Config Message
Mar 13 21:33:22.897: RMI-HAINFRA-INFO: Adding Primary Mgmt IP 192.168.203.34/255.255.255.0
Mar 13 21:33:23.881: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2, changed state to down
Mar 13 21:33:59.199: %IOSXE_OIR-6-INSCARD: Card (fp) inserted in slot F0
Mar 13 21:33:59.199: %IOSXE_OIR-6-ONLINECARD: Card (fp) online in slot F0
Mar 13 21:33:59.922: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 2 R0/0: wncmgrd: INFO: Bulk sync status : CONFIG_DONE
Mar 13 21:34:02.057: %SYS-5-RESTART: System restarted --
Mar 13 21:34:02.066: RMI-HAINFRA-INFO: Invoked cstate change on Standby config
Mar 13 21:34:02.066: RMI-HAINFRA-INFO: Adding Primary Mgmt IP 192.168.203.34/255.255.255.0
Mar 13 21:34:02.067: RMI-HAINFRA-INFO: Invoking FIB to create RMI IPv4 192.168.203.34 entry
Mar 13 21:34:17.829: %PLATFORM-6-RF_PROG_SUCCESS: RF state STANDBY HOT
Mar 13 21:34:31.250: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 2 R0/0: rif_mgr: The RMI link is UP.
Note that if the RP link was down and the RMI interfaces were not reachable to each other, it would be a double fault which could result into two Active controllers until the connectivity is restored.
Failover due to network failure (default gateway not reachable from Active and from Standby): no Switchover
In this last scenario, the default gateway was not reachable from both the Active and the Standby chassis, while both the RP link and the RMI link remained UP because as both 9800-CL were connected to the same switch. There was no Switchover in this case. The Standby chassis went into Standby-Recovery mode after detecting the loss of default gateway.
When the default gateway was back, the Standby chassis returned to Hot Standby without a reboot.
At first, the default gateway was disconnected at 21:40:01. After 8 seconds, both chassis have detected the loss of default gateway using 4 ARP Requests and 4 ICMP Requests.
Interestingly, the detection protocol used before the failure was ICMP in this case (compared to ARP in previous scenarios), showing that the detection protocol alternates in cycles.
The chassis 1 remained Active and the chassis 2 moved into Standby-Recovery mode.
9800-1#show logging
Mar 13 21:40:09.191: %RIF_MGR_FSM-6-GW_UNREACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway not reachable from Standby
Mar 13 21:40:09.191: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC: Chassis 1 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state
Mar 13 21:40:09.605: %RIF_MGR_FSM-6-GW_UNREACHABLE_ACTIVE: Chassis 1 R0/0: rif_mgr: Gateway not reachable from Active
9800-2(recovery-mode)#show logging
Mar 13 21:40:09.157: %RIF_MGR_FSM-6-GW_UNREACHABLE_STANDBY: Chassis 2 R0/0: rif_mgr: Gateway not reachable from Standby
Mar 13 21:40:09.157: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC: Chassis 2 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state
9800-2(recovery-mode)#show redundancy
Redundant System Information :
------------------------------
Available system uptime = 1 hour, 16 minutes
Switchovers system experienced = 6
Hardware Mode = Duplex
Configured Redundancy Mode = sso
Operating Redundancy Mode = sso
Maintenance Mode = Disabled
Communications = Up
Current Processor Information :
-------------------------------
Standby Location = slot 2
Current Software state = STANDBY HOT
Uptime in current state = 5 minutes
Image Version = Cisco IOS Software [Bengaluru], C9800-CL Software (C9800-CL-K9_IOSXE), Version 17.6.4, RELEASE SOFTWARE (fc1)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2022 by Cisco Systems, Inc.
Compiled Sun 14-Aug-22 08:54 by mcpre
BOOT =
CONFIG_FILE =
Configuration register = 0x102
Recovery mode = Standby recovery mode
When the default gateway was restored at 21:42:00, the chassis 2 left the Standby-Recovery mode and immediately became Standby without any Bulk configuration or reboot.
9800-1#show logging
Mar 13 21:42:03.193: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway reachable from Standby
Mar 13 21:42:03.604: %RIF_MGR_FSM-6-GW_REACHABLE_ACTIVE: Chassis 1 R0/0: rif_mgr: Gateway reachable from Active
9800-2#show logging
Mar 13 21:42:03.187: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 2 R0/0: rif_mgr: Gateway reachable from Standby
Mar 13 21:42:03.187: %RIF_MGR_FSM-6-RMI_STBY_REC_TO_STBY: Chassis 2 R0/0: rif_mgr: Going from Standby(Recovery) to Standby state
Remove the HA SSO configuration
In order to clear an existing HA SSO configuration, go to Administration | Device | Redundancy and set the Redundancy Configuration to Disabled.
Using GUI:
Using CLI:
9800-1#clear chassis redundancy
WARNING: Clearing the chassis HA configuration will result in both the chassis move into Stand Alone mode. This involves reloading the Standby chassis after clearing its HA configuration and coming up with day-0 configuration. Do you wish to continue? [y/n]? [yes]: yes
The Active controller will go in Standalone mode and will keep its management IP address. The Standby controller will reboot and will go into Day 0 Setup to avoid any Duplicate IP address error.
Additional troubleshooting commands
Since release 17.5.1, some commands are available to verify the status of the RP interface and to test the connectivity to the peer:
9800-1#show platform hardware slot R0 ha_port interface stats
HA Port
ha_port Link encap:Ethernet HWaddr 08:00:27:cb:16:f0
inet addr:169.254.203.33 Bcast:169.254.203.255 Mask:255.255.255.0
inet6 addr: fe80::a00:27ff:fecb:16f0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:44660 errors:0 dropped:0 overruns:0 frame:0
TX packets:45814 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:16500609 (15.7 MiB) TX bytes:16217650 (15.4 MiB)
…
9800-1#test wireless redundancy rping
Redundancy Port ping
PING 169.254.203.34 (169.254.203.34) 56(84) bytes of data.
64 bytes from 169.254.203.34: icmp_seq=1 ttl=64 time=4.43 ms
64 bytes from 169.254.203.34: icmp_seq=2 ttl=64 time=2.92 ms
64 bytes from 169.254.203.34: icmp_seq=3 ttl=64 time=3.06 ms
--- 169.254.203.34 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 2.924/3.471/4.426/0.677 ms
Another command performs a packet capture on the RP port and will save the file to the flash.
test wireless redundancy packetdump [start [filter port <0-65535>] | stop]
To troubleshoot HA issues, there are several additional command providing more details about the HA process including:
9800-1#show tech-support ha
9800-1#show tech-support wireless redundancy
9800-1#show romvar
9800-1#show redundancy history
9800-1#show logging process stack_mgr internal
9800-1#show logging process rif_mgr internal
9800-1#show redundancy trace main
Additional Resources
High Availability SSO Deployment Guide for Cisco Catalyst 9800 Series Wireless Controllers, Release Cisco IOS XE Bengaluru 17.6:
Cisco 9800 RMI+RP High Availability Best Practice Configuration, How I WI-FI blog:
https://howiwifi.com/2021/01/17/cisco-9800-rmirp-high-availability-best-practice-configuration
Cisco Catalyst 9800-CL - Redundancy HA SSO (CLI and Deeper Dive), WiFi Ninjas blog 013:
Understanding and Troubleshooting Cisco Catalyst 9800 Series Wireless Controllers, S. Arena, F.S. Crippa, N. Darchis, S. Katgeri, Cisco Press:
https://www.ciscopress.com/store/understanding-and-troubleshooting-cisco-catalyst-9800-9780137492411
Configure High Availability SSO on Catalyst 9800 | Quick Start Guide
Configure Catalyst 9800 Wireless Controllers in High Availability (HA) Client Stateful Switch Over (SSO) in IOS-XE 16.12:
No comments:
Post a Comment