High-Availability SSO with Cisco 9800 Wireless Controllers – PART II

This article is following the first part about SSO redundancy on Cisco 9800 wireless controllers. It covers various failover scenarios triggered by power failures and network failures. And some extra configuration settings.

Access to the Standby chassis

When the HA pair is established, configuration commands can only be done on the Active chassis.

Console access of the Standby chassis is not available by default. The console shows the following error message.

9800-CL-stby>?

Standy console disabled

 

However, the Standby console can be enabled using the following command on the Active chassis:

9800-CL(config)#redundancy

9800-CL(config-red)#main-cpu

9800-CL(config-r-mc)#standby console enable

The Standby console then provides a restricted set of commands, mostly “show” commands:

9800-CL-stby>?

Exec commands:

  access-profile    Apply user-orfile to interface

  app-hosting       Application hosting

  …

 

Another additional benefit of the RMI mode compared to the RP mode is that it allows the access to every peer chassis via SSH, HTTPS or NETCONF through the RMI interface. Accessing the Standby chassis from the network can help troubleshooting HA issues. Previously when using the RP mode, only the currently Active chassis was reachable via SSH.

The RMI interface also allows to monitor the reachability of each chassis via ICMP.

C:\WINDOWS\system32>ping 192.168.203.33

 

Pinging 192.168.203.33 with 32 bytes of data:

Reply from 192.168.203.33: bytes=32 time<1ms TTL=255

Reply from 192.168.203.33: bytes=32 time<1ms TTL=255

 

C:\WINDOWS\system32>ping 192.168.203.34

 

Pinging 192.168.203.34 with 32 bytes of data:

Request timed out.

Reply from 192.168.203.34: bytes=32 time=1ms TTL=255

Reply from 192.168.203.34: bytes=32 time=1ms TTL=255

 

Configure a chassis name

Each chassis is identified by its configured chassis ID (1 or 2).

When connecting to each chassis, the console prompt displays the chassis stack hostname (“9800-CL” in this lab) with a suffix indicating the chassis state: 9800-CL# on the Active and 9800-CL-stby# on the Standby.

The local chassis ID can be identify using show chassis local command but is not directly visible from the prompt.

However, a chassis name can be assigned and displayed on the console prompt using the following command:

9800-CL#redun-management hostname chassis 1 name 9800-1 chassis 2 name 9800-2

9800-1#

 

Manual Switchover

There is no automatic fallback with SSO. However, there is a manual Switchover command which can be performed by the administrator to transition the Active chassis to Hot Standby:

9800-1#redundancy force-switchover

Proceed with switchover to Standby RP? [confirm]

 

The command will trigger a reload of the Active chassis (9800-1) causing the Standby chassis (9800-2) to become Active.

9800-2#show redundancy switchover history

Index  Previous  Current  Switchover             Switchover

       active    active   reason                 time

-----  --------  -------  ----------             ----------

   1       1        2     user forced            20:49:22 Austral Mon Mar 13 2023

 

When the peer chassis which is in Hot Standby is powered down: no redundancy (simplex).

In this test, the Hot Standby had a system failure in a stack running in SSO mode. The Active chassis detected the loss of Keepalive on the RP link but remained in Active mode. The redundancy mode on the Active chassis became simplex (non-redundant).

9800-1#show redundancy states

       my state = 13 -ACTIVE

     peer state = 1  -DISABLED

           Mode = Simplex

           Unit = Primary

        Unit ID = 1

 

Redundancy Mode (Operational) = Non-redundant

Redundancy Mode (Configured)  = sso

Redundancy State              = Non Redundant

     Maintenance Mode = Disabled

    Manual Swact = disabled (system is simplex (no peer unit))

 Communications = Down      Reason: Simplex mode

 

   client count = 127

 client_notification_TMR = 30000 milliseconds

           RF debug mask = 0x0

Gateway Monitoring = Enabled

Gateway monitoring interval  = 8 secs

 

The failed Standby chassis was removed from the stack.

9800-1#show chassis rmi

Chassis/Stack Mac Address : 0800.27cb.16f0 - Local Mac Address

Mac persistency wait time: Indefinite

                                             H/W   Current

Chassis#   Role    Mac Address     Priority Version  State                 IP                RMI-IP

--------------------------------------------------------------------------------------------------------

*1       Active   0800.27cb.16f0     2      V02     Ready                169.254.203.33     192.168.203.33

 2       Member   0000.0000.0000     0      V02     Removed              169.254.203.34     192.168.203.34

 

The redundancy history and chassis logs show when the communication was lost with the Standby chassis.

9800-1#show redundancy history

Mar 13 18:44:39.885  my state = ACTIVE(13) *peer state = STANDBY HOT(8)

Mar 13 18:55:32.019 Reloading peer (communication down)

Mar 13 18:55:32.021 Reloading peer (peer presence lost)

 

9800-1#show logging

Mar 13 18:55:31.488: %IOSXE_REDUNDANCY-6-PEER_LOST: Active detected chassis 2 is no longer Standby

Mar 13 18:55:31.097: %STACKMGR-6-KA_MISSED: Chassis 1 R0/0: stack_mgr: Keepalive missed for 2 times for Chassis 2

Mar 13 18:55:31.489: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_NOT_PRESENT)

Mar 13 18:55:31.489: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_DOWN)

Mar 13 18:55:31.489: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_REDUNDANCY_STATE_CHANGE)

Mar 13 18:55:31.911: %RF-5-RF_RELOAD: Peer reload. Reason: EHSA Standby down

Mar 13 18:55:31.399: %STACKMGR-6-CHASSIS_REMOVED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been removed from the stack.

Mar 13 18:55:31.403: %STACKMGR-6-CHASSIS_REMOVED_KA: Chassis 1 R0/0: stack_mgr: Chassis 2 has been removed from the stack due to keepalive failure.

Mar 13 18:55:48.887: %RIF_MGR_FSM-6-RMI_LINK_DOWN: Chassis 1 R0/0: rif_mgr: The RMI link is DOWN.

Mar 13 18:55:55.430: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 1 R0/0: rif_mgr: Setting RP link status to DOWN

Mar 13 18:55:55.430: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 1 R0/0: stack_mgr: Dual Active Detection links are not available anymore

 

When the chassis 2 came back online, it moved back to the Hot Standby role and the Active chassis remained in its role.

 

Failover due to network failure (default gateway not reachable from Active): Switchover, Management Gateway Failover, Active-Recovery mode and Standby-Recovery mode.

In this scenario, the default gateway became unreachable from the Active chassis 1. The uplink of the switch connected to chassis 1 went down but the GigabitEthernet 2 on the Active chassis did not go down.

A Switchover occurred when the loss of default gateway was detected on the Active chassis. It triggered a reboot of the Active chassis and the Standby chassis took over the Active role about 10 seconds after the loss of gateway.

This scenario uses another RMI feature called Management Gateway Failover, or Gateway Reachability Check.

When enabled, the RMI interface checks the reachability of the default gateway by sending ICMP Request every second using the RMI interface as Source IP address. If there are 4 failed ICMP Request and then 4 failed ARP Request to the default gateway (for a total of 8 seconds), the 9800 considers the default gateway as unreachable.

The duration of the Gateway Failover Interval (8 seconds by default) can be modified using the following command:

management gateway-failover interval <6-12>

Since release 17.2, the default gateway is taken from the static routes instead of using the default gateway “ip default-gateway x.x.x.x” command.

Interestingly, the process of ARP or ICMP discovery changed its discovery protocol (ICMP/ARP) at each lost gateway events. In this case, the Active chassis was monitoring the default gateway using ARP, then switched to ICMP after 4 ARP failures. In this example, the uplink of the switch was disconnected at 20:59:00. Later in this article, the Active chassis will be monitoring the default gateway using ICMP, then switched to ARP after 4 ICMP failures.

The Active chassis detected the loss of gateway on the management/RMI interface at 20:59:10 and went into Active-Recovery mode while preparing to reboot due to reason RIF: GW DOWN. Then the chassis 1 rebooted at 21:59:29.

A Recovery mode occurs when one resource (RP link, RMI link, gateway) does not become available on the chassis. In this case, the RMI link was down. In recovery mode, all the ports are administratively down and no synchronisation or configuration change can occur.

9800-1(recovery-mode)#show logging

Apr  2 20:59:10.078: RMI-HAINFRA-INFO: Originating event to Shut all Interfaces

Apr  2 20:59:10.078: RMI-HAINFRA-INFO: Shutting down all interfaces in ActiveRecovery

Apr  2 20:59:10.080: RMI-HAINFRA-INFO: Not shutting down the interface-rmi: Vlan1

Apr  2 20:59:10.054: %RIF_MGR_FSM-6-GW_UNREACHABLE_ACTIVE: Chassis 1 R0/0: rif_mgr: Gateway not reachable from Active

Apr  2 20:59:10.059: %STACKMGR-1-RELOAD: Chassis 1 R0/0: stack_mgr: Reloading due to reason Reload Command - RIF: GW DOWN

Apr  2 20:59:10.070: %RIF_MGR_FSM-6-RMI_ACTIVE_RECOVERY_MODE: Chassis 1 R0/0: rif_mgr: Going to Active(Recovery) from Active state

Apr  2 20:59:11.055: %RIF_MGR_FSM-6-RMI_GW_DECISION_DEFERRED: Chassis 1 R0/0: rif_mgr: High CPU utilisation on active or Standby, deferring action  on gateway-down event

Apr  2 20:59:12.079: %LINK-5-CHANGED: Interface GigabitEthernet1, changed state to administratively down

Apr  2 20:59:13.080: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1, changed state to down

Apr  2 20:59:13.603: %RIF_MGR_FSM-6-RMI_LINK_DOWN: Chassis 1 R0/0: rif_mgr: The RMI link is DOWN.

Apr  2 20:59:28.824: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (KEEPALIVE_FAILURE)

Apr  2 20:59:29.735: %RF-5-RF_RELOAD: Peer reload. Reason: EHSA Standby down

 

The chassis 1 prompt also changed to Recovery mode:

9800-1(recovery-mode)#

 

In the meanwhile, the RP link stayed up and the Keepalive messages kept being exchanged every 100 ms. When the chassis 1 went into Active-Recovery at 20:59:10, the Keepalive messages stopped and the chassis 1 sends a last reply to the Keep alive with an ICMP Port Unreachable on the RP link. The Hot Standby chassis then detects the failure of the RP link and triggered a Switchover.

 

The the Standby chassis 2 declared that the chassis 1 was lost at 20:59:10 and started the Switchover. The redundancy history on the chassis 2 shows the Switchover event:

9800-2#show redundancy history

Apr  2 20:56:46.918 *my state = STANDBY HOT(8) peer state = ACTIVE(13)

 

Apr  2 20:59:10.523 Reloading peer (communication down)

Apr  2 20:59:10.525 Reloading peer (peer presence lost)

Apr  2 20:59:10.525 *my state = ACTIVE-FAST(9) peer state = DISABLED(1)

Apr  2 20:59:10.988 *my state = ACTIVE-DRAIN(10) peer state = DISABLED(1)

Apr  2 20:59:11.000 *my state = ACTIVE_PRECONFIG(11) peer state = DISABLED(1)

Apr  2 20:59:11.006 *my state = ACTIVE_POSTCONFIG(12) peer state = DISABLED(1)

Apr  2 20:59:11.008 *my state = ACTIVE(13) peer state = DISABLED(1)

 

The chassis 2 logs shows that the chassis 1 was removed from the stack and and that the chassis 2 immediately took the Active role.

9800-2#show logging

Apr  2 20:59:10.148: %PLATFORM-6-HASTATUS: RP switchover, received chassis event to become active

Apr  2 20:59:10.149: %REDUNDANCY-3-SWITCHOVER: RP switchover (PEER_REDUNDANCY_STATE_CHANGE)

Apr  2 20:59:10.223: %PLATFORM-6-HASTATUS: RP switchover, sent message became active. IOS is ready to switch to primary after chassis confirmation

Apr  2 20:59:10.238: %PLATFORM-6-HASTATUS: RP switchover, received chassis event became active

Apr  2 20:59:10.060: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 2 on Chassis 2 is down

Apr  2 20:59:10.060: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 1 on Chassis 2 is down

Apr  2 20:59:10.060: %STACKMGR-6-CHASSIS_REMOVED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been removed from the stack.

Apr  2 20:59:10.473: %PLATFORM-6-HASTATUS_DETAIL: RP switchover, received chassis event became active. Switch to primary (count 1)

Apr  2 20:59:10.474: %HA-6-SWITCHOVER: Route Processor switched from Standby to being active

Apr  2 20:59:10.526: pm_port_em_recovery

Apr  2 20:59:10.574: WLC-HA-Notice: RF Progression event: RF_PROG_ACTIVE_FAST, Switchover triggered

Apr  2 20:59:10.994: RMI-HAINFRA-INFO: Configured primary IP 192.168.203.31/255.255.255.0 on active(mgmt)

Apr  2 20:59:10.994: RMI-HAINFRA-INFO: Configured secondary IP 192.168.203.34/255.255.255.0 on active(mgmt)

Apr  2 20:59:11.010: %VOICE_HA-2-SWITCHOVER_IND: SWITCHOVER, from STANDBY_HOT to ACTIVE state.

Apr  2 20:59:11.013: WLC-HA-Notice: Sending garp intf = GigabitEthernet1, addr=192.168.201.151

Apr  2 20:59:11.029: %PKI-6-CS_ENABLED: Certificate server now enabled.

Apr  2 20:59:11.029: WLC-HA-Notice: Sending garp intf = LIIN0, addr=192.168.1.6

Apr  2 20:59:11.043: WLC-HA-Notice: Sending garp intf = Vlan1, addr=192.168.203.31

Apr  2 20:59:11.588: %CALL_HOME-6-CALL_HOME_ENABLED: Call-home is enabled by Smart Agent for Licensing.

Apr  2 20:59:13.009: %LINK-3-UPDOWN: Interface Null0, changed state to up

Apr  2 20:59:13.010: %LINK-3-UPDOWN: Interface GigabitEthernet1, changed state to up

Apr  2 20:59:13.011: %LINK-3-UPDOWN: Interface GigabitEthernet2, changed state to up

Apr  2 20:59:13.013: %LINK-3-UPDOWN: Interface Vlan1, changed state to up

Apr  2 20:59:13.502: %RIF_MGR_FSM-6-RMI_LINK_DOWN: Chassis 2 R0/0: rif_mgr: The RMI link is DOWN.

Apr  2 20:59:14.009: %LINEPROTO-5-UPDOWN: Line protocol on Interface Null0, changed state to up

Apr  2 20:59:14.010: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1, changed state to up

Apr  2 20:59:14.012: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2, changed state to up

Apr  2 20:59:14.019: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan1, changed state to up

Apr  2 20:59:16.005: %RIF_MGR_FSM-6-GW_REACHABLE_ACTIVE: Chassis 2 R0/0: rif_mgr: Gateway reachable from Active

Apr  2 21:00:00.968: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 2 R0/0: rif_mgr: Setting RP link status to DOWN

Apr  2 21:00:00.968: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 2 R0/0: stack_mgr: Dual Active Detection links are not available anymore

 

Once the chassis 1 has finished rebooting, the chassis 1 took the Standby role:

9800-1#show redundancy history

Apr  2 20:56:47.419  my state = ACTIVE(13) *peer state = STANDBY HOT(8)

 

00:00:05 *my state = INITIALIZATION(2) peer state = DISABLED(1)

00:00:05 *my state = NEGOTIATION(3) peer state = DISABLED(1)

00:00:05 *my state = STANDBY COLD(4) peer state = DISABLED(1)

00:00:05  my state = STANDBY COLD(4) *peer state = ACTIVE(13)

00:08:03 *my state = STANDBY_ISSU_NEGOTIATION_LATE(35) peer state = ACTIVE(13)

Apr  2 21:02:17.154 *my state = STANDBY COLD-CONFIG(5) peer state = ACTIVE(13)

Apr  2 21:03:00.880 *my state = STANDBY COLD-FILESYS(6) peer state = ACTIVE(13)

Apr  2 21:03:02.666 *my state = STANDBY COLD-BULK(7) peer state = ACTIVE(13)

Apr  2 21:03:18.023 *my state = STANDBY HOT(8) peer state = ACTIVE(13)

 

9800-2#show redundancy history

Apr  2 21:01:56.348  my state = ACTIVE(13) *peer state = UNKNOWN(0)

Apr  2 21:01:56.465  my state = ACTIVE(13) *peer state = STANDBY COLD(4)

Apr  2 21:02:15.148  my state = ACTIVE(13) *peer state = STANDBY_ISSU_NEGOTIATION_LATE(35)

Apr  2 21:02:17.154  my state = ACTIVE(13) *peer state = STANDBY COLD-CONFIG(5)

Apr  2 21:03:00.444  my state = ACTIVE(13) *peer state = STANDBY COLD-FILESYS(6)

Apr  2 21:03:02.647  my state = ACTIVE(13) *peer state = STANDBY COLD-BULK(7)

Apr  2 21:03:18.704  my state = ACTIVE(13) *peer state = STANDBY HOT(8)

 

However, the chassis 1 went into Standby-Recovery mode because the default gateway was not reachable yet.

9800-1(recovery-mode)-stby#show logging

Apr  2 21:02:59.905: %SYS-5-RESTART: System restarted --

Apr  2 21:03:19.086: %PLATFORM-6-RF_PROG_SUCCESS: RF state STANDBY HOT

Apr  2 21:03:33.827: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

Apr  2 21:03:34.829: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

Apr  2 21:03:36.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

Apr  2 21:03:37.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

Apr  2 21:03:38.512: %RIF_MGR_FSM-6-GW_UNREACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway not reachable from Standby

Apr  2 21:03:38.512: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC: Chassis 1 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state

Apr  2 21:03:42.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

Apr  2 21:03:43.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

Apr  2 21:03:44.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

Apr  2 21:03:45.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

Apr  2 21:03:50.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

Apr  2 21:03:51.509: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

 

The pair of chassis still synchronised and established SSO redundancy.

9800-2#show logging

Apr  2 21:00:49.786: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 2 on Chassis 2 is up

Apr  2 21:00:49.787: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 1 on Chassis 2 is up

Apr  2 21:00:49.826: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been added to the stack.

Apr  2 21:00:52.134: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been added to the stack.

Apr  2 21:00:55.297: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.

Apr  2 21:00:55.297: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 2 R0/0: stack_mgr: Dual Active Detection link is available now

Apr  2 21:01:09.910: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 2 R0/0: wncmgrd: INFO: Bulk sync status : COLD

Apr  2 21:01:15.280: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.

Apr  2 21:01:26.200: %IOSXE_REDUNDANCY-6-PEER: Active detected chassis 1 as Standby.

Apr  2 21:01:26.191: %STACKMGR-6-STANDBY_ELECTED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been elected STANDBY.

Apr  2 21:01:51.282: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a Standby insertion (raw-event=PEER_FOUND(4))

Apr  2 21:01:51.282: %REDUNDANCY-5-PEER_MONITOR_EVENT: Active detected a Standby insertion (raw-event=PEER_REDUNDANCY_STATE_CHANGE(5))

Apr  2 21:02:00.277: Syncing vlan database

Apr  2 21:02:00.328: Vlan Database sync done from bootflash:vlan.dat to stby-bootflash:vlan.dat (556 bytes)

Apr  2 21:03:18.730: %HA_CONFIG_SYNC-6-BULK_CFGSYNC_SUCCEED: Bulk Sync succeeded

Apr  2 21:03:18.762: %VOICE_HA-7-STATUS: VOICE HA bulk sync done.

Apr  2 21:03:19.769: %RF-5-RF_TERMINAL_STATE: Terminal state reached for (SSO)

Apr  2 21:03:38.513: %RIF_MGR_FSM-6-GW_UNREACHABLE_STANDBY: Chassis 2 R0/0: rif_mgr: Gateway not reachable from Standby

Apr  2 21:03:38.513: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC: Chassis 2 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state

 

The chassis 1 prompt changed to Standby-Recovery mode:

9800-1(recovery-mode)-stby#show redundancy states

       my state = 8  -STANDBY HOT

     peer state = 13 -ACTIVE

           Mode = Duplex

           Unit = Primary

        Unit ID = 1

 

Redundancy Mode (Operational) = sso

Redundancy Mode (Configured)  = sso

Redundancy State              = sso

     Maintenance Mode = Disabled

    Manual Swact = cannot be initiated from this the Standby unit

 Communications = Up

 

   client count = 127

 client_notification_TMR = 30000 milliseconds

           RF debug mask = 0x0

Gateway Monitoring = Enabled

Gateway monitoring interval  = 8 secs

 

The cause of the Switchover is described as “Active lost GW”:

9800-2#show redundancy switchover history

Index  Previous  Current  Switchover             Switchover

       active    active   reason                 time

-----  --------  -------  ----------             ----------

   1       1        2     Active lost GW         20:59:11 Austral Sun Apr 2 2023

 

At 21:06:03, the uplink on the switch was re-connected. Once the default gateway becomes reachable, the chassis 1 in Standby-Recovery mode went immediately to Hot Standby without a reboot and the RMI link was up again.

9800-1-stby#show logging

Apr  2 21:05:59.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

 

Apr  2 21:06:00.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

 

Apr  2 21:06:01.510: RMI-GW-ERR: IPV4 GW/L2: Cannot get ARP entry for dest ip : 192.168.203.2

 

Apr  2 21:06:03.510: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway reachable from Standby

Apr  2 21:06:03.510: %RIF_MGR_FSM-6-RMI_STBY_REC_TO_STBY: Chassis 1 R0/0: rif_mgr: Going from Standby(Recovery) to Standby state

Apr  2 21:06:05.334: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 1 R0/0: rif_mgr: The RMI link is UP.

 

9800-2#show logging

Apr  2 21:06:02.068: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 2 R0/0: rif_mgr: The RMI link is UP.

Apr  2 21:06:03.515: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 2 R0/0: rif_mgr: Gateway reachable from Standby

 

Failover due to network failure (management port goes down): Switchover, RMI port down.

This section describes the Switchover when the management/RMI interface (GigabitEthernet2 port) was disconnected on the Active chassis 1. This scenario is similar to the previous one: the default gateway became unreachable from the chassis 1 and the chassis 2 moved to the Active role. The Switchover occurred within 8 seconds.

In this test, the management port was disconnected on the chassis 1 at 21:14:19. The chassis 1 detected the loss of the gateway at 21:14:27, 8 seconds after the port went down. Then the chassis 1 went into Active-Recovery mode, shut down its interface and rebooted at 21:14:47.

In the meanwhile, the chassis 2 detected the loss of Keepalives through the RP port. The chassis 1 sent a last reply to the Keep alive with an ICMP Port Unreachable at 21:14:27.

When the chassis 2 detected the loss of the Keepalives over the RP port, it triggered a Switchover and became the Active chassis at 21:14:28.

Interestingly, there were still some communication between the chassis through the RP port after the loss of UDP Keepalive messages.

The pair of chassis have also formed GRE tunnels through the RP port to exchange information about the stack. Within the tunnel, the chassis 1 provided information including events such as the chassis reload and reason codes to the chassis 2.

When the Active chassis failed, the two chassis determined the reason for the Switchover and checked the resources (RP link, RMI interface, default gateway). The gateway reachability information of chassis 1 was exchanged with chassis 2 over the RP link.

In addition, the Switchover history showed a different reason Active RMI port down for the Switchover compared to the previous scenario.

9800-2#show redundancy switchover history

Index  Previous  Current  Switchover             Switchover

       active    active   reason                 time

-----  --------  -------  ----------             ----------

   5       1        2     Active RMI port down   21:14:28 Austral Mon Mar 13 2023

 

When the chassis 1 was reconnected, it joined the stack with the Standby role.

 

Failover due to network failure (RP interface fails): no Switchover, RMI link, RP Standby-Recovery

In this case, the RP port was disconnected on the Active chassis 1 and there were no Switchover. The chassis 1 remained in the Active role and the Standby went into Standby-Recovery mode. When the RP port was reconnected, the Standby chassis rebooted for bulk-synchronisation and went back into Standby role.

In details, the RP port was disconnected at 21:29:00 on the Active chassis 1 and the Chassis 2 has detected the loss of Keepalive messages within 500 ms. The chassis 2 immediately went into Standy-Recovery mode with the reason RP DOWN.

9800-2(rp-rec-mode)#show logging

Mar 13 21:29:00.310: %STACKMGR-6-KA_MISSED: Chassis 2 R0/0: stack_mgr: Keepalive missed for 2 times for Chassis 1

Mar 13 21:29:00.614: %STACKMGR-6-CHASSIS_REMOVED: Chassis 2 R0/0: stack_mgr: Chassis 1 has been removed from the stack.

Mar 13 21:29:00.616: %STACKMGR-6-CHASSIS_REMOVED_KA: Chassis 2 R0/0: stack_mgr: Chassis 1 has been removed from the stack due to keepalive failure.

Mar 13 21:29:00.623: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC_REASON: Chassis 2 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state, Reason: RP DOWN

Mar 13 21:29:28.152: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 2 R0/0: rif_mgr: Setting RP link status to DOWN

 

The RP link manages the Keepalive messages between the Active and the Standby chassis. However, the RP mode of SSO makes the RP link as a single point of failure that cannot distinguish between a controller failure and a link failure.

The RMI mode of SSO uses secondary IP address on the management interface to monitor the chassis via a second link between the Active and the Standby chassis. The RMI link uses TCP port 3200 to exchange resource health information including gateway reachability from each chassis. The Hot Standby chassis was sending TCP Keepalive messages from the RMI interface (gi2) to the RMI interface of the Active chassis every 15 seconds or when required.

In our example, when the chassis 2 detected the loss of Keepalive on the RP port, the Hot Standby chassis has immediately sent a TCP port 3200 message to the chassis 1 over the RMI link at 21:29:00. The chassis 1 replied and provided confirmation that the RMI link was still up.

The RMI link has prevented the Switchover and has allowed the Standby to move into Standby-Recovery.

 

The chassis 2 state changed to Standby-Recovery mode due to RP DOWN:

9800-2(rp-rec-mode)-stby#show redundancy states

       my state = 101-STANDBY RECOVERY(RP DOWN)

     peer state = 13 -ACTIVE

           Mode = Simplex

           Unit = Primary

        Unit ID = 2

 

Redundancy Mode (Operational) = sso

Redundancy Mode (Configured)  = sso

Redundancy State              = Non Redundant

     Maintenance Mode = Disabled

    Manual Swact = cannot be initiated from this the Standby unit

 Communications = Down      Reason: Simplex mode

 

   client count = 127

 client_notification_TMR = 30000 milliseconds

           RF debug mask = 0x0

Gateway Monitoring = Enabled

Gateway monitoring interval  = 8 secs

 

The chassis 1 remained Active and removed the chassis 2 from the stack:

Mar 13 21:29:00.392: %STACKMGR-6-KA_MISSED: Chassis 1 R0/0: stack_mgr: Keepalive missed for 2 times for Chassis 2

Mar 13 21:29:01.000: %IOSXE_REDUNDANCY-6-PEER_LOST: Active detected chassis 2 is no longer Standby

Mar 13 21:29:00.619: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC_REASON: Chassis 1 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state, Reason: RP DOWN

Mar 13 21:29:00.710: %STACKMGR-6-CHASSIS_REMOVED: Chassis 1 R0/0: stack_mgr: Chassis 2 has been removed from the stack.

Mar 13 21:29:00.718: %STACKMGR-6-CHASSIS_REMOVED_KA: Chassis 1 R0/0: stack_mgr: Chassis 2 has been removed from the stack due to keepalive failure.

Mar 13 21:29:01.007: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_NOT_PRESENT)

Mar 13 21:29:01.007: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_DOWN)

Mar 13 21:29:01.007: %REDUNDANCY-3-STANDBY_LOST: Standby processor fault (PEER_REDUNDANCY_STATE_CHANGE)

Mar 13 21:29:02.176: %RF-5-RF_RELOAD: Peer reload. Reason: EHSA Standby down

Mar 13 21:29:24.301: %RIF_MGR_FSM-6-RP_LINK_DOWN: Chassis 1 R0/0: rif_mgr: Setting RP link status to DOWN

 

The chassis 1 went into Simplex mode (no redundancy)

9800-1#show redundancy states

       my state = 13 -ACTIVE

     peer state = 1  -DISABLED

           Mode = Simplex

           Unit = Primary

        Unit ID = 1

 

Redundancy Mode (Operational) = Non-redundant

Redundancy Mode (Configured)  = sso

Redundancy State              = Non Redundant

     Maintenance Mode = Disabled

    Manual Swact = disabled (system is simplex (no peer unit))

 Communications = Down      Reason: Simplex mode

 

   client count = 127

 client_notification_TMR = 30000 milliseconds

           RF debug mask = 0x0

Gateway Monitoring = Enabled

Gateway monitoring interval  = 8 secs

 

When the RP port was reconnected at 21:31:00, the chassis 2 has detected that the RP link was up with Keepalive messages. The chassis 2 had a reboot due to RIF: Bulk Sync to synchronise the configuration and became Hot Standby at 21:34:17.

9800-2#show logging

Mar 13 21:31:04.303: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.

Mar 13 21:31:04.305: %STACKMGR-1-RELOAD: Chassis 2 R0/0: stack_mgr: Reloading due to reason Reload Command - RIF: Bulk Sync

 

 

*Mar 13 11:32:19.714: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 1 on Chassis 2 is down

*Mar 13 11:32:19.714: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 2 on Chassis 2 is down

*Mar 13 11:32:19.714: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 1 on Chassis 2 is up

*Mar 13 11:32:19.714: %STACKMGR-6-STACK_LINK_CHANGE: Chassis 2 R0/0: stack_mgr: Stack port 2 on Chassis 2 is up

*Mar 13 11:32:19.735: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 2 has been added to the stack.

*Mar 13 11:32:20.206: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.

*Mar 13 11:32:21.572: %STACKMGR-6-CHASSIS_ADDED: Chassis 2 R0/0: stack_mgr: Chassis 2 has been added to the stack.

*Mar 13 11:32:25.249: %RIF_MGR_FSM-6-RP_LINK_UP: Chassis 2 R0/0: rif_mgr: The RP link is UP.

*Mar 13 11:32:25.249: %STACKMGR-1-DUAL_ACTIVE_CFG_MSG: Chassis 2 R0/0: stack_mgr: Dual Active Detection link is available now

*Mar 13 11:32:35.336: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 2 R0/0: wncmgrd: INFO: Bulk sync status : COLD

Mar 13 11:33:18.612: RMI-HAINFRA-INFO: Invoked cstate change on Standby config

Mar 13 11:33:18.613: RMI-HAINFRA-INFO: Invoking FIB to create RMI IPv4 192.168.203.34 entry

Mar 13 21:33:22.865: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet1, changed state to down

Mar 13 21:33:22.887: RMI-HAINFRA-INFO: Invoked cstate change on Standby config

Mar 13 21:33:22.887: RMI-HAINFRA-INFO: Adding Primary Mgmt IP 192.168.203.34/255.255.255.0

Mar 13 21:33:22.897: RMI-HAINFRA-INFO: Invoked Standby Config Message

Mar 13 21:33:22.897: RMI-HAINFRA-INFO: Adding Primary Mgmt IP 192.168.203.34/255.255.255.0

Mar 13 21:33:23.881: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet2, changed state to down

Mar 13 21:33:59.199: %IOSXE_OIR-6-INSCARD: Card (fp) inserted in slot F0

Mar 13 21:33:59.199: %IOSXE_OIR-6-ONLINECARD: Card (fp) online in slot F0

Mar 13 21:33:59.922: %EWLC_HA_LIB_MESSAGE-6-BULK_SYNC_STATE_INFO: Chassis 2 R0/0: wncmgrd: INFO: Bulk sync status : CONFIG_DONE

Mar 13 21:34:02.057: %SYS-5-RESTART: System restarted --

Mar 13 21:34:02.066: RMI-HAINFRA-INFO: Invoked cstate change on Standby config

Mar 13 21:34:02.066: RMI-HAINFRA-INFO: Adding Primary Mgmt IP 192.168.203.34/255.255.255.0

Mar 13 21:34:02.067: RMI-HAINFRA-INFO: Invoking FIB to create RMI IPv4 192.168.203.34 entry

Mar 13 21:34:17.829: %PLATFORM-6-RF_PROG_SUCCESS: RF state STANDBY HOT

Mar 13 21:34:31.250: %RIF_MGR_FSM-6-RMI_LINK_UP: Chassis 2 R0/0: rif_mgr: The RMI link is UP.

 

Note that if the RP link was down and the RMI interfaces were not reachable to each other, it would be a double fault which could result into two Active controllers until the connectivity is restored.

 

Failover due to network failure (default gateway not reachable from Active and from Standby): no Switchover

In this last scenario, the default gateway was not reachable from both the Active and the Standby chassis, while both the RP link and the RMI link remained UP because as both 9800-CL were connected to the same switch. There was no Switchover in this case. The Standby chassis went into Standby-Recovery mode after detecting the loss of default gateway.

When the default gateway was back, the Standby chassis returned to Hot Standby without a reboot.

At first, the default gateway was disconnected at 21:40:01. After 8 seconds, both chassis have detected the loss of default gateway using 4 ARP Requests and 4 ICMP Requests.

Interestingly, the detection protocol used before the failure was ICMP in this case (compared to ARP in previous scenarios), showing that the detection protocol alternates in cycles.

 

The chassis 1 remained Active and the chassis 2 moved into Standby-Recovery mode.

9800-1#show logging

Mar 13 21:40:09.191: %RIF_MGR_FSM-6-GW_UNREACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway not reachable from Standby

Mar 13 21:40:09.191: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC: Chassis 1 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state

Mar 13 21:40:09.605: %RIF_MGR_FSM-6-GW_UNREACHABLE_ACTIVE: Chassis 1 R0/0: rif_mgr: Gateway not reachable from Active

 

9800-2(recovery-mode)#show logging

Mar 13 21:40:09.157: %RIF_MGR_FSM-6-GW_UNREACHABLE_STANDBY: Chassis 2 R0/0: rif_mgr: Gateway not reachable from Standby

Mar 13 21:40:09.157: %RIF_MGR_FSM-6-RMI_STBY_TO_STDBY_REC: Chassis 2 R0/0: rif_mgr: Going from Standby to Standby(Recovery) state

 

9800-2(recovery-mode)#show redundancy

Redundant System Information :

------------------------------

       Available system uptime = 1 hour, 16 minutes

Switchovers system experienced = 6

 

                 Hardware Mode = Duplex

    Configured Redundancy Mode = sso

     Operating Redundancy Mode = sso

              Maintenance Mode = Disabled

                Communications = Up

 

Current Processor Information :

-------------------------------

              Standby Location = slot 2

        Current Software state = STANDBY HOT

       Uptime in current state = 5 minutes

                 Image Version = Cisco IOS Software [Bengaluru], C9800-CL Software (C9800-CL-K9_IOSXE), Version 17.6.4, RELEASE SOFTWARE (fc1)

Technical Support: http://www.cisco.com/techsupport

Copyright (c) 1986-2022 by Cisco Systems, Inc.

Compiled Sun 14-Aug-22 08:54 by mcpre

                          BOOT =

                   CONFIG_FILE =

        Configuration register = 0x102

               Recovery mode   = Standby recovery mode

 

When the default gateway was restored at 21:42:00, the chassis 2 left the Standby-Recovery mode and immediately became Standby without any Bulk configuration or reboot.

9800-1#show logging

Mar 13 21:42:03.193: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 1 R0/0: rif_mgr: Gateway reachable from Standby

Mar 13 21:42:03.604: %RIF_MGR_FSM-6-GW_REACHABLE_ACTIVE: Chassis 1 R0/0: rif_mgr: Gateway reachable from Active

 

9800-2#show logging

Mar 13 21:42:03.187: %RIF_MGR_FSM-6-GW_REACHABLE_STANDBY: Chassis 2 R0/0: rif_mgr: Gateway reachable from Standby

Mar 13 21:42:03.187: %RIF_MGR_FSM-6-RMI_STBY_REC_TO_STBY: Chassis 2 R0/0: rif_mgr: Going from Standby(Recovery) to Standby state

 

Remove the HA SSO configuration

In order to clear an existing HA SSO configuration, go to Administration | Device | Redundancy and set the Redundancy Configuration to Disabled.

Using GUI:

Using CLI:

9800-1#clear chassis redundancy

WARNING: Clearing the chassis HA configuration will result in both the chassis move into Stand Alone mode. This involves reloading the Standby chassis after clearing its HA configuration and coming up with day-0 configuration. Do you wish to continue? [y/n]? [yes]: yes

 

The Active controller will go in Standalone mode and will keep its management IP address. The Standby controller will reboot and will go into Day 0 Setup to avoid any Duplicate IP address error.

 

Additional troubleshooting commands

Since release 17.5.1, some commands are available to verify the status of the RP interface and to test the connectivity to the peer:

9800-1#show platform hardware slot R0 ha_port interface stats

HA Port

 

ha_port   Link encap:Ethernet  HWaddr 08:00:27:cb:16:f0

          inet addr:169.254.203.33  Bcast:169.254.203.255  Mask:255.255.255.0

          inet6 addr: fe80::a00:27ff:fecb:16f0/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:44660 errors:0 dropped:0 overruns:0 frame:0

          TX packets:45814 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:16500609 (15.7 MiB)  TX bytes:16217650 (15.4 MiB)

 

9800-1#test wireless redundancy rping

Redundancy Port ping

 

PING 169.254.203.34 (169.254.203.34) 56(84) bytes of data.

64 bytes from 169.254.203.34: icmp_seq=1 ttl=64 time=4.43 ms

64 bytes from 169.254.203.34: icmp_seq=2 ttl=64 time=2.92 ms

64 bytes from 169.254.203.34: icmp_seq=3 ttl=64 time=3.06 ms

 

--- 169.254.203.34 ping statistics ---

3 packets transmitted, 3 received, 0% packet loss, time 2002ms

rtt min/avg/max/mdev = 2.924/3.471/4.426/0.677 ms

 

Another command performs a packet capture on the RP port and will save the file to the flash.

test wireless redundancy packetdump [start [filter port <0-65535>] | stop]

 

To troubleshoot HA issues, there are several additional command providing more details about the HA process including:

9800-1#show tech-support ha

9800-1#show tech-support wireless redundancy

9800-1#show romvar

9800-1#show redundancy history

9800-1#show logging process stack_mgr internal

9800-1#show logging process rif_mgr internal

9800-1#show redundancy trace main

 

Additional Resources

High Availability SSO Deployment Guide for Cisco Catalyst 9800 Series Wireless Controllers, Release Cisco IOS XE Bengaluru 17.6:

https://www.cisco.com/c/dam/en/us/td/docs/wireless/controller/9800/17-6/deployment-guide/c9800-ha-sso-deployment-guide-rel-17-6.pdf

Cisco 9800 RMI+RP High Availability Best Practice Configuration, How I WI-FI blog:

https://howiwifi.com/2021/01/17/cisco-9800-rmirp-high-availability-best-practice-configuration

Cisco Catalyst 9800-CL - Redundancy HA SSO (CLI and Deeper Dive), WiFi Ninjas blog 013:

https://wifininjas.net/2019/08/26/wn-blog-013-cisco-c9800-cl-wlc-redundancy-ha-sso-cli-and-deeper-dive

Understanding and Troubleshooting Cisco Catalyst 9800 Series Wireless Controllers, S. Arena, F.S. Crippa, N. Darchis, S. Katgeri, Cisco Press:

https://www.ciscopress.com/store/understanding-and-troubleshooting-cisco-catalyst-9800-9780137492411

Configure High Availability SSO on Catalyst 9800 | Quick Start Guide

https://www.cisco.com/c/en/us/support/docs/wireless/catalyst-9800-series-wireless-controllers/220277-configure-high-availability-sso-on-catal.html

Configure Catalyst 9800 Wireless Controllers in High Availability (HA) Client Stateful Switch Over (SSO) in IOS-XE 16.12:

https://www.cisco.com/c/en/us/support/docs/wireless/catalyst-9800-series-wireless-controllers/213915-configure-catalyst-9800-wireless-control.html

No comments:

Post a Comment