Many times when working an issue I ask people to send me a network trace. A  network trace often points to the root cause of the problem.  At a minimum, it reduces the problem space to something that is manageable. For example, when a network printer started printing garbage, we spent several days looking at TTPs; when we finally got a network trace we discovered that we were sending perfectly good text; the problem was at the printer. In a different case, an OSL performance problem between modules several thousand miles apart appeared to be a network issue, but a trace clearly showed that the problem was on the server module.  I would still be working on both problems if it weren’t for the network trace.

How do you get a network trace? There are three possibilities.

packet_monitor

The VOS packet_monitor command, with some limitations, allows you to monitor everything that the module sends to and receives from the network. See  http://community.stratus.com/blog/openvos/getting-most-out-packetmonitor and  http://stratadoc.stratus.com/vos/17.0.1/r419-09/wwhelp/wwhimpl/js/html/wwhelp.htm?context=r419-09&file=ch9r419-09i.html for more information.

The packet_monitor command has several limitations. You cannot be 100% certain that a frame that is reported as sent was actually sent. For example, errors on the adapter may prevent the frame from being sent. In addition, frames received with errors, such as a CRC error, will not be sent upstream so that packet_monitor can see them. Because packet_monitor does not place the adapter in promiscuous mode, only those frames addressed to the adapter or the broadcast address will be passed upstream to it. A link can be 90% busy while packet_monitor reports just 1 frame a second because this is the only frame that is addressed to the adapter or the broadcast address.

In general, host-based monitors, such as packet_monitor, are only useful as an addition to other forms of tracing, or when no other form of network tracing is available.
Port Mirroring aka SPAN (Switch Port for ANalysis) port

A mirror or span port is a port on a switch that replicates all traffic seen on one or more ports (or even on a VLAN) of the switch. The mirror port must be connected to a network monitoring appliance. This appliance can be a special-purpose device or a PC running Linux or Microsoft Windows with a host-based monitor such as Wireshark (http://www.wireshark.org/) or tcpdump. Mirror ports are easy to set up, requiring typing only a few commands at the switch.

Unfortunately, there are several major disadvantages to using a mirror port. First, the port must be correctly configured. Incorrectly configured ports may lead to missed or duplicated frames in the network trace. Second, switches will not replicate a frame with any type of error, so these frames will not be traced. Third, a busy switch may drop frames instead of replicating them to the mirror port. Fourth, a mirror port that is accepting frames from an entire VLAN, or from multiple switch ports, or even just one full duplex port, may become overloaded and thus drop some frames. Fifth, errors introduced between the switch and the host cannot be seen by the network monitoring application connected to a completely different switch port. By the same token, errors introduced between the mirror port and the network application will give the network applicance a distorted view of what the host actually receives from the switch.
Network Taps

Taps are passive devices that connect between the switch and the host; they literally tap into the network connection. Like a mirror port, they must be connected to a network appliance but they have fewer disadvantages than a mirror port.

First, there is typically no configuration; you just plug it in and it works. Second, the more advanced  taps rely on power only to replicate frames to the monitoring port. and have dual power supplies to ensure that the replication activity is reliable.  If their power fails, these taps will continue to forward frames between the network ports; only replication stops. Third, a tap has just 1 function, which is to replicate and forward frames to the network monitor application. A tap is much less likely to be overwhelmed by a high volume of traffic. In addition, aggregating taps have buffer space so that they can forward high volumes of traffic on a full duplex link without dropping frames. Of course. a sustained high rate may still overwhelm the buffer. Aggregating taps also let you combine multiple inputs. For example, a two port device will let you monitor both the active and standby adapters of a duplex pair of adapters. This ensures continuous monitoring even if there is an adapter failover. Finally, by connecting the tap at the host adapter you have the best possible assurance that the network monitoring applicance will see all frames leaving the host adapter and all frames arriving from the switch to the host adapter.

One disadvantage that many taps share with a mirror port is that they will drop damaged frames. Since many network monitoring appliances, especially those that are just PCs with off-the-shelf Ethernet hardware will also drop damaged frames, manufacturers of taps do not consider this a critical fault.

For more comments on taps versus switch ports take a look at  http://www.lovemytool.com/blog/2007/08/span-ports-or-t.html or http://taosecurity.blogspot.com/2009/01/why-network-taps.htm l or type “network taps and span ports” into your favorite search engine.
Monitoring Challenges

The first challenge is when to monitor and for how long. Ideally, critical network links in a production system should be monitored continuously. Capturing a problem at the first occurrence, and having a network trace in hand, is much faster than encountering a problem, setting monitoring up, and either trying to duplicate it or waiting for it to happen again. Trace files can be large; a 50% load on a gigabit link produces approximately 62.5 megabytes per second or 3.75 gigabytes per minute. Trace files do not have to be kept for longer than your worst-case response time. If you can respond to a reported problem in an hour, then you only need to keep an hour of trace data. The more trace data that you save, the more leeway you have to respond, or to recognize that there was a problem that needs to be investigated.  Large disks are fairly inexpensive, at least compared to the cost of a disruption, so consider purchasing one or more terabyte-sized disk drives to hold the trace data.

Maintaining this level of monitoring may be difficult when using a span port. In a complex network the network administrators are pulled in many directions and maintaining a span port and continuous monitoring in case something goes wrong may be difficult.

On the other hand, a network tap installed next to the host is dedicated to that host. You can purchase a sophisticated network monitoring appliance, or you can get started by using a basic PC with a 1 terabyte hard drive running Linux or Windows and running the tshark program (the non-GUI interface to wireshark). This setup will give you 266 minutes of trace data (assuming a data rate of 500 mbps) and is perfectly adequate for most purposes.  You can purchase a 1 terabyte drive for less than $100 if you shop around.

The health of the underlying network is vital to a continuously-available application.  Make the effort to capture an accurate trace of network activity on a routine basis.  When problems do crop up, you will be able to resolve them quickly without waiting for them to reoccur.  As a bonus, you can also analyze the trace data and learn things about your network that were hidden before.packets