Traceroute can be an invaluable tool when trying to diagnose connection problems to hosts on other networks. However to be used effectively you have to understand how it works and what the output means.
Traceroute works by manipulating the Time-To-Live (TTL) value in the IP header (figure 1). Since I had to make my examples using the Stratus internal network I have X’ed out the hexadecimal representation of the IP address in the packet traces and translated all the octets in any doted decimal notation to a unique letter code A thru W. The number of letters represents the number of digits in the octet. That is AAA represents a 3 digit number while J represents a 1 digit number.
Figure 1 – packet_monitor frame with TTL high-lighted
Every router that processes the packet will decrement the TTL value by 1 and if the TTL value is greater than 0 will pass the packet to the next router or the final destination. If the TTL value is 0 the router will discard the packet and *may* send an ICMP time exceeded message back to the sender. The key is *may*, some routers will not send a time exceeded message back or will do so only if they are not busy. In addition some firewalls will block the ICMP message so even if the router sends the time exceeded message the host running traceroute may never see it.
Traceroute starts by sending a message to the target destination with a TTL value of 1. It times how long it takes from the time that it sends the message until it gets a response.
Figure 2 – packet_monitor traceroute frame and response
By default the traceroute command sends three messages with a TTL of 1 and reports all three times and the name and or IP address of the router that sent the reply. I always use the -numeric argument so I don’t have to wait for name resolution (figure 3). Traceroute then increments the TTL by 1 and does it again. It will continue incrementing the TTL until it gets a response from the target destination or some limit, 30 (by default), is reached. You can find documentation on all the traceroute arguments in the OpenVOS STREAMS TCP/IP Administrator’s Guide (R419 ) manual.
Figure 3 – traceroute output
You would expect that as the TTL increases the times would increase and this is generally the case, but not always, compare the third reported times of hop 6 (79 ms) and hop 5 (80 ms) in figure 3. There are several reasons for this. First, the network is not deterministic, sometimes it just takes longer. Second, the router at hop N-1 may be busier than the router at hop N and take longer to get around to sending the ICMP message. Finally, the return path from router N may be faster than the return path from router N-1. For example there is no requirement that the 4th router send its response back to the source via routers 3, 2 and 1. It could send its response directly to router 1 which may be significantly faster. This means that while traceroute is very good at reporting the route that a packet takes from the sending host to the target host you cannot rely on a packet sent from the target host to the sending host taking the same path only in reverse. This is known as asymmetrical routing.
Traceroute may append a flag to a time that indicates it received something other than the expected time exceeded message. The flags are:
!H – host unreachable
!N – network unreachable
!P – protocol unreachable
!N – network unreachable
!P – protocol unreachable
For example, figure 4 shows that the first router does not know how to reach the target network and so it returns a network unreachable message and traceroute terminates at that point.
Figure 4 network unreachable messages
An asterisk (*) in place of a time indicates that traceroute did not receive an answer. As I stated above it could be that the router does not send an ICMP time exceeded message, it could be that a firewall blocked the time exceeded message on its way back or that a firewall blocked the outgoing message. It could also be that the network just dropped either the outgoing or the returning message.
A single or even two timeouts on a hop (figure 5) probably indicates either that the router did not send the message because it was busy or that the network was dropping packets (or both). Notice the 8th hop in figure 5, the asterisk before the IP address means that the first message in the set timed out.
Figure 5 – busy router or dropped packets
If all three messages timeout but subsequent hops report times (figure 6) it is probable that the router for that hop just doesn’t send ICMP time exceeded messages.
Figure 6 – unresponsive router(s)
If all the hops past a certain point timeout (figure 7) it is probable that there is a firewall blocking either the outgoing or returning messages.
Figure 7 – firewall blocking
There is one other reason why everything past a certain point times out. That is when the target does not respond. Not all targets will respond and when that happens you have something that looks like figure 8. The difference between figures 7 and 8 is that in figure 8 the last router to send a response is the router that is local to the target. How can you tell that this is the case? Sometimes the addressing makes it obvious, for example the target is 192.168.1.12 and 192.168.1.1 is the last router to respond. It is not quite as obvious in figure 9, EEE.FFF.W.XXX may be the last router before EEE.FFF.GGG.HHH, they do have the first 16 bits in common. But without knowing the subnetting scheme used by the target’s network there is no way to be sure without asking.
Figure 8 – target not responding
It is also possible to get responses from two or more routers for the same hop (figure 9). This can happen if the routers are load balancing or the network is unstable and routes are changing or as in the case for figure 9, the router AAA.BBB.CC.KKK was not the optimum router and after forwarding the packet to AAA.BBB.II.J it sent a redirect message back to the source to change its routing table.
Figure 9 – two responses for the same hop
There are two types of traceroute commands distinguished by the type of message that they send. Some traceroutes, like the one found on Microsoft Windows systems send ICMP echo request (ping) messages; others like the one that runs under STCP send UDP messages. Knowing which type of traceroute you are using is important if you need to configure firewalls to let packets through or write protocol analyzer filters. Also the response of the target will be different. If the target receives a ping request it will send back a ping reply; if it receives a UDP message then the response depends on the port number. If the port is in use the listening application will more than likely discard the packet because it does not meet the application’s message structure requirements. If the port is not in use the host *may* send an ICMP destination port unreachable message back. For that reason traceroute selects ports that are typically not used.
What port number is seen by the target host will depend on how many hops are required to reach it. Traceroute starts the destination port at 33435 (by default) and increments the port number for each message (figure 10). The target therefore sees 3 different ports, chances are not all three will be in use. The source port is based on the process ID of the sending process. A given process will always use the same source port.
Figure 10 - (edited) packet traces showing port number changes
Finally, unless there is a problem with at the first hop there is probably not a lot you are going to be able to do to fix the problem. Once STCP (or any host’s network stack) sends a packet to the local router it is out of its (and your) hands. However, if you know the IP address of the last hop that responds or the hop where timeouts suddenly start to occur you have an idea of where the problem lies and can contact the correct group of network administrators to resolve it.