When Sockets Go Bad

Sometimes netstat will show a socket that appears to be stuck. The remote application has been terminated, sometimes even the OpenVOS application has been terminated but netstat is still showing the socket. This article will explain why this happens and what you can do about it.

A brief introduction to TCP states

If you google “tcp state diagram” you will find a plethora of images, some barely legible others quite readable. Wikimedia has a very nice color coded one (http://commons.wikimedia.org/wiki/File:TCP_state_diagram.png). The TCP RFC (793) (http://www.rfc-editor.org/rfc/rfc793.txt) has an ASCII art diagram and of course explains the states in detail.

Sockets can get stuck when they are waiting for either the local application or the remote host to do something. There are three states where this occurs. In the FIN_WAIT_2 state the socket is waiting for the remote host to close the connection. In the CLOSE_WAIT state it is waiting for the local application to close the socket and in the ESTABLISHED state it is waiting for the remote host to open up its send window, or for the local application to send something. If the local TCP stack has sent data and is waiting for the remote host to acknowledge it cannot get stuck, it will eventually timeout, signal an error to the local application and close the socket.

FIN_WAIT_2 state

This is perhaps the most common stuck socket case that is called into the CAC. The local application has closed the socket and may have terminated. The typical reason for sockets to be stuck in this state is that the remote application is hung and not reading its socket.

Setting the STCP parameter finwait2 to some N > 0 will close sockets after they have been in the FIN_WAIT_2 for N seconds. Starting in releases 14.7.2bg, 14.7.tl1, 15.2.1aa, 15.2.tel.af, 15.3.0bd, 15.3.tel.ag, 16.2.1al, 17.0.0ai, and 17.1 the default value is 1200, before that the default was 0 (so if you want the sockets to time out you will have to set it yourself). You can see the current value with the list_stcp_params analyze_system request, you can change the value with the set_stcp_param analyze_system request. See the OpenVOS System Analysis (R073) manual, available at http://stratadoc.stratus.com, for documentation on these requests.

CLOSE_WAIT state

This is the next most common stuck socket state. Sockets in a CLOSE_WAIT state are waiting for the local application to close the socket. The typical reason for a socket to remain in this state is that the application is no longer reading the socket. The easiest way to close the socket is to terminate the local application.

If terminating the local application is not an option the only thing you can do is craft a packet with the RST (reset) flag set. This requires that you know the sequence numbers used by the socket (available using the dump_onetcb analyze_system request) and have a utility on another host on the local subnet that can build and send custom IP packets. These utilities are available for Windows and Linux systems.

ESTABLISHED state

Since most of the time netstat shows sockets in the ESTABLISHED state how can you tell when the socket is stuck? If you have a report that the remote host has crashed or the network between the local and remote hosts has failed for more than 10 minutes and the local application is waiting for the remote to send it something you can be pretty sure that the socket is stuck. In this case the send queue value (the number to the immediate left of the local IP address) reported by netstat will be 0. On the other hand, if the send queue value is larger than 0 and remains there you may have the case of the remote host advertising a zero window.

In the first case unless keep-alive is turned on the socket will remain in the ESTABLISHED state until the local application is terminated. If keep-alive is turned on then after the keep-alive timer expires, STCP will send a keep-alive probe. If the probe goes unanswered it will be retransmitted but after a few minutes and several retransmissions the connection will be terminated. By default the interface has keep-alive set but also by default sockets do not. To set keep-alive on a socket the application must use the setsockopt function call. See the OpenVOS STREAMS TCP/IP Programming (R420) manual, also available at http://stratadoc.stratus.com, for details. The default keep-alive time is 2 hours; that means that the first keep-alive probe is transmitted 2 hours after the last TCP segment is received from the remote host, so you must be patient. For those not so patient you can adjust the keep-alive time and also the time between probes and the number of probes with the set_stcp_parameter request within analyze_system. I do not recommend changing these parameters without a detailed analysis.

If the socket does not have keep-alive set the only simple option is to terminate the application that owns the socket. If that is not an option it is possible to close the socket by sending it a segment with the RST flag set..

To confirm the second case check the value of sndws displayed by the dump_onetcb analyze_system request. A 0 value indicates a closed window. You can also run packet_monitor to trace the connection and check the window value in the TCP header in segments from the remote host. A value of “n.a.” indicates 0.

I want to stress that this may be a recoverable condition. Applications sometimes get delayed and the TCP stack closes the window, when the application gets caught up the stack opens the window. However, in most cases I think it is safe to assume that if the application hasn’t recovered after a few minutes it is not going to. The exception might be something like a printer that is out of paper. A socket in this state can remain in this state even if the VOS application that created it terminates.

These sockets can get cleaned up by setting tcp_zerowin_abort_interval$ to some N > 0. Sockets will be cleaned up N seconds after the next TCP segment with a zero window is received. Window probes are sent every 100 seconds to confirm that the remote host’s receive window is still closed, so at worst a reply will be received within 100 seconds. The default value of tcp_zerowin_abort_interval$ is zero and I suggest that it remain at 0 unless you need to clean up a socket. At that point I suggest setting it to a small value, say 10 seconds, and once the socket is cleaned up resetting it to 0. I think this reduces the risk of clearing sockets that are recoverable

To set this value you must use the set_longword request in analyze_system so remember that N will be in hex. For example to set it to 10 the request would be:
set_longword tcp_zerowin_abort_interval$ a

PARTNERS

TOPICS

QUICK LINKS