The other day during a discussion with some users it became clear to me that they did not understand under what conditions a Stratus VSeries system would call home regarding a network adapter. This misunderstanding resulted in the system losing its network connection. If they didn’t understand I am sure that there are others who don’t as well, and so this blog will attempt to explain when a system will call home regarding a network adapter.
The simple answer is that the system will call home when it detects that the network adapter has failed. The gray area is what does failed mean. If the adapter breaks or is broken by the system a set of diagnostics are run, and if the adapter passes it is brought back into service. If it breaks too many times in too short of a time interval, the adapter will exceed the MTBF (Mean Time Between Failure) threshold and remain out of service. If that happens, the system will call home regarding the adapter. If the diagnostics run and fail, the adapter remains out of service and the system will call home.
If the sdlmux test messages, which are passed between the active and standby network adapter partners, fail to be received by the partner adapter the system will break the standby card, test it and bring it back into service. If whatever blocked the test messages is not resolved this pattern will repeat until the card exceeds the MTBF threshold and the system will call home. Note here that the problem may not be the adapter, but the network.
If a link drops, so that one adapter is no longer connected to the network the system will failover the network adapters as needed and will do nothing else. It will not call home. The reason for this is that there are too many reasons, almost all of them external to the adapter, for the link to drop. Twenty two years ago in VOS release 8.0 the original Stratus Ethernet implementation did call home when a link was lost. It resulted in many issues being opened, which just annoyed people when we called them back, since they knew they had rebooted a switch or removed a cable. It was decided at that time that a lost link would not be considered a failure. Note that the sdlmux test messages mentioned in the previous paragraph are not sent unless both adapters have a link to a network.
So what happened to the users mentioned in the opening paragraph? About a month ago they lost the link to one adapter. They were unaware of this because the system had no connectivity issues. Then a few days ago they lost the other link and at that point they were off the network. This scenario demonstrates why it is important to monitor the adapters on your system to confirm that they are connected to the network. Periodic monitoring would have identified the problem with the first adapter’s link in time to correct it before the problem with the second adapter’s link occurred.
The command macro monitor_sdlmux_adapter_status (figure 1) will periodically monitor all sdlmuxed partnered adapters. It will send a 25th line message to selected users and add an entry in the syserr_log. A syserr_log entry is made when the link is first lost but the monitor_sdlmux_adapter_status macro will add one every time it checks making it more likely to be seen (figure 2). The macro should be run as a started process with -privileged set to yes since it calls analyze_system to obtain the list of sdlmux devices.
The macro also checks if one of the adapters is “DOWN” and will report it the same way that it reports a dropped link. The system should have called home but it was easy to add the check and I figured that the extra notification can’t hurt.
The 25th line and syserr_log messages refer you the check_adapters file in the home dir of whoever is running the monitor_sdlmux_adapter_status macro. That file consists of the output from the dlmux_admin sdlmux_status command run against every sdlmux partnership on the system (figure 3). You will need to review that file to identify the specific adapter(s) with a problem.
The macro looks only at adapters that have been partnered with sdlmux. Issues with adapters that have not been partnered are immediately apparent so no extra monitoring is needed. Note that Stratus recommends that all network adapters be partnered with sdlmux.
& monitor_sdlmux_adapter_status begins here |
Figure 1 – the monitor_sdlmux_adapter_status command macro
d >system>syserr_log.10-05-24 -match check_adapters %phx_vos#m16_mas>system>syserr_log.10-05-24 10-05-25 08:27:29 mst . . . . . . . . . |
Figure 2 – syserr_log messages
d check_adapters
%phx_vos#m16_mas>SysAdmin>Noah_Davids>check_adapters 10-05-24 19:57:43 mst
************************************************** --------------- 10-05-24.19:56:47 ---------------- **************************************************
************************************************** Group Name: #sdlmuxA.m16.10-5-0.11-5-0 Device Name: %phx_vos#enetA.m16.11-5-0 Adapter State: ACTIVE UP Partner: %phx_vos#enetA.m16.10-5-0 Partner State: UP (network connection lost)
************************************************** Group Name: #sdlmuxA.m16.10-5-1.11-5-1 Device Name: %phx_vos#enetA.m16.10-5-1 Adapter State: ACTIVE UP Partner: %phx_vos#enetA.m16.11-5-1 Partner State: UP
************************************************** Group Name: #sdlmux.m16.11-2 Device Name: %phx_vos#enet.m16.11.11-2 Adapter State: ACTIVE UP Partner: %phx_vos#enet.m16.10.11-2 Partner State: UP
************************************************** Group Name: #sdlmux.m16.11-3 Device Name: %phx_vos#enet.m16.10.11-3 Adapter State: ACTIVE UP Partner: %phx_vos#enet.m16.11.11-3 Partner State: UP
ready 19:57:43 |
Figure 3 – output of the check_adapters file showing #enetA.m16.10-5-0 has lost link