The impact of communication layer latency is typically under estimated when trying to fix application performance problems but correct understanding is critical if you are to direct your efforts toward practical solutions.
First, by communication layer latency I mean the time it takes a packet to get from the local system to the remote system and back again. The biggest factor in communication layer latency is distance – at least for geographically separated hosts. The other major component is the number of devices between the local and remote hosts that process the packets; this includes things like routers and firewalls. The bandwidth of the links plays a role but not as large as most people think. Bandwidth affects the insertion delay – how long it takes to put a packet on the wire but the signal on the wire travels at a speed based on the medium not the bandwidth. For the rest of this discussion I’ll just use latency when I mean communication layer latency.
To demonstrate how latency effects your application let’s say you enter some data into a client application and hit return. The client application sends a message to a server application, waits for a response and then sends another message, waits for a response, etc; for some number N of “turns”. At the end of N turns the client application presents the results back to you.
Assume that 10 turns are required and it takes the server application 1ms to process the client message and send back an answer. With a latency of 1ms you get a response time of 10 * (1ms + 1ms) or 20ms. Now get on a plane and travel to Chicago, the server stays in Boston so you now have a latency of say 50ms and a response time of 10 * (50ms + 1ms) or 550ms. This is enough to be noticeable but not painful. Increase the number of turns to 100 and you now have a painful 5.5 second response time. You may think 100 turns is excessive but some complex database queries or applications that fill complex forms can do just that. Do you know how many turns your applications require?
Doing a copy_file via OSL also exhibits this behavior. OSL will send a file in 4K transactions. Each transaction requires a response so a 1,000,000 byte file will require 1,000,000 / 4,096 = 245 transactions or turns to use the nomenclature from the previous paragraph. Again, assuming 1ms to process the transaction and a 1ms latency the copy_file will take 490ms. If we increase the latency to 50ms it will take 12.495 seconds. If we increase the file to 1,000,000,000 bytes it will require 244,141 transactions; with corresponding times of 488.282 seconds for 1ms latency and 12,451.191 seconds or almost 3.5 hours for 50ms latency.
The simplest way to measure latency is with ping.
Unfortunately it is becoming increasing common for networks to block ping packets, or for hosts to ignore them, you can even tell STCP to ignore them (starting in 15.3, 16.2 and 17x). If you cannot use ping you can use packet_monitor to time how long it takes to get a response to a connection request. For example start packet_monitor with the command “
start_process 'packet_monitor -numeric -time_stamp -filter -host A.B.C.D -port NNN' -privileged”. Then type the command “
telnet A.B.C.D NNN”. You should make several connections and get an average. Notice that I am connecting to an unused port on the remote host. This reduces the number of packets in the trace but if a firewall is blocking the port you may need to use an active port.
Latency times: 58.887 - 58.805 = 0.082 == 82ms
58.086 - 58.003 = 0.083 == 83ms
56.724 - 56.643 = 0.081 == 81ms
53.067 - 52.984 = 0.083 == 83ms
You can also use a program I wrote that times the connections without needing to use packet_monitor. See http://members.cox.net/ndav1/self_published/stcp_tping.doc. The stcp_tping command does require that you connect to an active port on the remote host. In this case I am using port 23 (telnet) but any active port will work. The number 1 at the end of the command indicates that a request will be sent once a second.
How can you estimate latency if you don’t already have the systems in place to measure? The simplest way is to find a system in the same geographic area and measure latency to it. This will give you a very rough estimate. I like to use colleges and universities since I know where they are located and the chances are that they host their own systems on campus and, at least the larger universities, probably have high bandwidth Internet links. The web site http://www.utexas.edu/world/univ/state/ lists the web sites of many universities by state. The web site http://www.bulter.nl/universities/ lists the web sites of universities from all over the world by country. Keep in mind that there could be a significant different between communicating over a corporate VPN and over the Internet.
How can you fix the problem? You probably can’t. You may have some control over the bandwidth of some of the links and maybe over some of the network devices but you certainly have no control over the distance. What you can do is change the application so that it is less sensitive to latency by reducing the number of turns required.
If you are using OSL to move large files long distances all I can say is don’t. Depending on the file type you may be able to use FTP, or SFTP or SCP. If not the Stratus ftp site has an application called tcp_save (ftp://ftp.stratus.com/vos/network/tcp_save.save.evf.gz) which allows you to effectively copy a file via TCP without using OSL. It requires some setup but can reduce the copy time of large files significantly.