OpenVOS Blog

Showing archives for category Programming

Sage advice from a graybeard!

3.12.2015Advice, ProgrammingBy: I came across an article on networkworld.com, by Peter Wayner, titled “7 Timeless Lessons of Programming ‘graybeards’”. Since I’m something of a graybeard myself, I read it. Peter discusses several insights that young software engineers might not know, but would do well to learn. Even us old-timers need a refresher now and then. I think that Peter’s comments are worth sharing. I hope you enjoy his article.

http://www.networkworld.com/article/2894245/careers/7-timeless-lessons-of-programming-graybeards.html

Clone Wars

1.26.2013ProgrammingBy: In this post I wish to discuss a common coding error that can result in a module running out of or at least seriously depleting the available number of stcp device clones for use in creating TCP sockets.

Read More

The Importance of -1

10.29.2012Availability, ProgrammingBy: Don’t worry, despite the title I haven’t reverted to math nerd mode. The topic is not number theory but socket programming and a coding mistake that I have seen way too often.

The code fragment in figure 1 demonstrates the mistake. There is a main while loop that loops forever calling select and waiting for characters to be available to be received. Once select indicates that characters are available the code goes through another loop until it has received 10 characters. After the recv function is called the code correctly checks for a 0 return which indicates that the remote peer has closed the socket. It then checks if errno is 0 and if it is it concatenates the characters just received into an application buffer and adds the number of characters just received to the total received for the current message. Finally, it checks errno for the value EWOULDBLOCK and if it is something else it exits the program with an error. At this point the inner loop completes and if the number of characters in the message is less than 10 it calls recv again.

 while (1)
   {
   FD_ZERO (&fdsetREAD);
   FD_ZERO (&fdsetNULL);
   FD_SET (sockAccepted, &fdsetREAD);
   iNumFDS = sockAccepted + 1;

/* wait for the start of a message to arrive */
   iSelected = select (iNumFDS,
                      &fdsetREAD, &fdsetNULL, &fdsetNULL, &timevalTimeout);
   if (iSelected < 0) /* Error from select, report and abort */
      {
      perror ("minus1: error from select");
      exit (errno);
      }
/* select indicates something to be read. Since there is only 1 socket
   there is no need to figure out which socket is ready. Note that if
   select returns 0 it just means that it timed out, we will just go around
   the loop again.*/
   else if (iSelected > 0)
        {
        szAppBuffer [0] = 0x00;       /* "zero out" the application buffer */
        iTotalCharsRecv = 0;          /* zero out the total characters count */
        while (iTotalCharsRecv < 10)  /* loop until all 10 characters read */
           {                          /* now read from socket */
           iNumCharsRecv = recv (sockAccepted, szRecvBuffer,
                                   10 - iTotalCharsRecv, 0);
           if (iDebugFlag)            /* debug output show */
              {                       /* value returned from recv and errno */
              printf ("%d  %d     ", iNumCharsRecv, errno);
              if (iNumCharsRecv > 0)  /* also received characters if any */
                 {
                 szRecvBuffer [iNumCharsRecv] = 0x00;
                 printf ("[%s]n", szRecvBuffer);
                 }
              else printf ("n");
              }
           if (iNumCharsRecv == 0)   /* If 0 characters received exit app */
              {
              printf ("minus1: socket closedn");
              exit (0);
              }
           else if (errno == 0)      /* if "no error" accumulate received */
              {                      /* chars into an applictaion buffer */
              szRecvBuffer [iNumCharsRecv] = 0x00;
              strcat (szAppBuffer, szRecvBuffer);
              iTotalCharsRecv = iTotalCharsRecv + iNumCharsRecv;
              szRecvBuffer [0] = 0x00;
              }
           else if (errno != EWOULDBLOCK) /* Ignore an EWOULDBLOCK error */
              {                           /* anything else report and abort */
              perror ("minus1: Error from recv");
              exit (errno);
              }
           if (iDebugFlag) sleep (1); /* this prevents the output from */
           }                          /* scrolling off the window */

        sprintf (szOut, "Message [%s] processedn", szAppBuffer);
        if (iDebugFlag) printf ("%sn", szOut);
        if (send (sockAccepted, szOut , strlen (szOut), 0) < 0)
           perror ("minus1: error from send");
        }
   }
Figure 1 – fragment of incorrect code

Figure 2 shows an example session. Characters sent to the server are highlighted in yellow, the processed message that is returned is not highlighted. The characters sent include a terminating new line character and are sent in 1 TCP segment. Everything works when exactly 10 characters are sent. But when only 6 characters are sent in a TCP segment the server stops responding.

123456789
Message [123456789
] processed
abcdefghi
Message [abcdefghi
] processed
12345abcd
Message [12345abcd
] processed
12345
789
abcdefghi
123456789
Figure 2 – client session

Figure 3 shows the server session with debug turned on. You can see that after the “12345<new line>” characters are received the next recv returns -1 and sets the errno to 5011, which is EWOULDBLOCK. The code then loops and the next recv returns the characters “789<new line>” but the errno value is still set to 5011. In fact every recv after that regardless of whether there are characters received or not has errno set to 5011.

Connection accepted
10  0     [123456789
]
Message [123456789
] processed

10  0     [abcdefghi
]
Message [abcdefghi
] processed

10  0     [12345abcd
]
Message [12345abcd
] processed

6  0     [12345
]
-1  5011
4  5011     [789
]
-1  5011 
4  5011     [abcd]
4  5011     [efgh]
4  5011     [i
12]
4  5011     [3456]
4  5011     [789
]
-1  5011
-1  5011
-1  5011
Figure 3 – server debug output

Because the errno value is not 0 the received characters are not concatenated into the application buffer so the code loops forever.

This is not a bug in the socket code. The socket API explicitly states that the value of errno is undefined unless the function returns a value of -1. Undefined means that the value is not set, so errno retains whatever value it previously had.

Now you might be thinking that no one would break a 10 character message up into 2 pieces and you might be correct; but imagine that instead of 10 characters the message length is 100 or 1000 characters. Also remember that TCP is a stream of bytes not messages; a TCP stack may split an application message up into multiple TCP segments whenever it wants. Certain conditions make this more likely, longer application messages, sending another application message before a previous one has been transmitted, and lost TCP segments are the ones that come readily to mind. Under the right conditions it is possible, even likely, that this server code would pass all its acceptance tests and run fine in a production environment, at least for a while.

The good news is that there is a very simple fix; instead of testing for errno == 0 just test for a return value greater than 0 , see the highlighted change in figure 4. Note also that the comment for the “errno != EWOULDBLOCK” test now points out that the only way to reach that if statement is if recv returned a negative value. The only negative value it returns is -1.

 while (1)
   {
   FD_ZERO (&fdsetREAD);
   FD_ZERO (&fdsetNULL);
   FD_SET (sockAccepted, &fdsetREAD);
   iNumFDS = sockAccepted + 1;

/* wait for the start of a message to arrive */
   iSelected = select (iNumFDS,
                      &fdsetREAD, &fdsetNULL, &fdsetNULL, &timevalTimeout);
   if (iSelected < 0) /* Error from select, report and abort */
      {
      perror ("minus1: error from select");
      exit (errno);
      }
/* select indicates something to be read. Since there is only 1 socket
   there is no need to figure out which socket is ready. Note that if
   select returns 0 it just means that it timed out, we will just go around
   the loop again.*/
   else if (iSelected > 0)
        {
        szAppBuffer [0] = 0x00;       /* "zero out" the application buffer */
        iTotalCharsRecv = 0;          /* zero out the total characters count */
        while (iTotalCharsRecv < 10)  /* loop until all 10 characters read */
           {                          /* now read from socket */
           iNumCharsRecv = recv (sockAccepted, szRecvBuffer,
                                   10 - iTotalCharsRecv, 0);
           if (iDebugFlag)            /* debug output show */
              {                       /* value returned from recv and errno */
              printf ("%d  %d     ", iNumCharsRecv, errno);
              if (iNumCharsRecv > 0)  /* also received characters if any */
                 {
                 szRecvBuffer [iNumCharsRecv] = 0x00;
                 printf ("[%s]n", szRecvBuffer);
                 }
              else printf ("n");
              }
           if (iNumCharsRecv == 0)   /* If 0 characters received exit app */
              {
              printf ("minus1: socket closedn");
              exit (0);
              }
           else if (iNumCharsRecv > 0) /* if no error accumulate received */
              {                        /* chars into an applictaion buffer */
              szRecvBuffer [iNumCharsRecv] = 0x00;
              strcat (szAppBuffer, szRecvBuffer);
              iTotalCharsRecv = iTotalCharsRecv + iNumCharsRecv;
              szRecvBuffer [0] = 0x00;
              }
           else if (errno != EWOULDBLOCK) /* if we get here iNumCharsRecv */
              {                           /* must be -1 so errno is defined */
              perror                      /* Ignore an EWOULDBLOCK error */
               ("minus1: Error from recv"); /* anything else report */
              exit (errno);               /* and abort */
              }
           if (iDebugFlag) sleep (1); /* this prevents the output from */
           }                          /* scrolling off the window */

        sprintf (szOut, "Message [%s] processedn", szAppBuffer);
        if (iDebugFlag) printf ("%sn", szOut);
        if (send (sockAccepted, szOut , strlen (szOut), 0) < 0)
           perror ("minus1: error from send");
        }
   }
Figure 4 – corrected code fragment

 

Securing the VOS Telnet Daemon

5.25.2012Availability, Customer Focus, Programming, UptimeBy: This talk will explain under what conditions it is safe to stop running the telnetd server and describe several approaches to prevent unwanted access to the module, via telnet, when it is not safe to just stop the telnetd server. It will also discuss a very common disabling approach that does not work as expected and has the side effect of breaking the ftScalable disk array diagnostic tool.

Download the Securing VOS Telnet PDF Files

Do you support left-handed widgets? Searching for solutions to technical problems

1.19.2012Customer Focus, Programming, ServiceBy: I build products for a living. Often when I visit a Stratus customer, I’m asked a question of the form “Do you support left-handed widgets?” When I was a young, inexperienced engineer, I’d think for a moment, and then tell the customer yes or no (generally, “no”, since most customers get pretty good at figuring out what features are available in a product). Sometimes the customer would engage me in a discussion of why we didn’t have it, or when we’d make it available, but often, the conversation would end rather unsatisfactorily for both of us. I’d feel bad that I could not help our customer, and they were disappointed that we lacked some important piece of technology. Somtimes, my inability to say “yes” would cost us future business.

Read More

A (very) simple log server for VOS

11.17.2011Availability, Programming, securityBy: The OpenVOS architecture makes use of several independent systems; the NIO for X25 communication, the fiber channel disk array controllers, UPS for power, the RSN Internet Console Server for RSN over IP and the maintenance network Ethernet switches that allow all these systems to communicate over a private Ethernet/IP network. These systems are monitored by various OpenVOS processes to make sure that they are running correctly. But some events like user logins are not monitored. The RSN console and the network switches have the option to send a message to a logging server whenever someone logs in or tries to login. Since the network switches are only connected to the Stratus module it makes sense to have a logging server run on the Stratus module. To that end I have created a very simple logging server, it simply writes the message that it receives along with a date-time stamp and the IP address of the host that sent the message to standard output. By running the server as a started process the messages can be saved to the process’s out file.

Examples from the network switches
It is possible that someone who was authorized was just having problems typing the root password when logging into the network switch as root, then again maybe they are just really good at guessing passwords.


2011-10-02 11:55:39 : 10.10.1.75 : >%AAA-W-REJECT: New telnet connection for use
+r root, source 10.10.1.1 destination 10.10.1.75  REJECTED

2011-10-02 11:56:03 : 10.10.1.75 : >%AAA-W-REJECT: New telnet connection for use
+r root, source 10.10.1.1 destination 10.10.1.75  REJECTED

2011-10-02 11:56:08 : 10.10.1.75 : >%AAA-I-CONNECT: User CLI session for user ro
+ot over telnet , source 10.10.1.1 destination  10.10.1.75 ACCEPTED

 
 
Here is someone guessing system administrator user IDs


2011-10-02 12:03:13 : 10.10.1.75 : >%AAA-W-REJECT: New telnet connection for use
+r admin, source 10.10.1.1 destination 10.10.1.75  REJECTED

2011-10-02 12:03:30 : 10.10.1.75 : >%AAA-W-REJECT: New telnet connection for use
+r sysadmin, source 10.10.1.1 destination 10.10.1.75  REJECTED

2011-10-02 12:04:39 : 10.10.1.75 : >%AAA-W-REJECT: New telnet connection for use
+r Administrator, source 10.10.1.1 destination 10.10.1.75  REJECTED

 
 
Besides user logins the network switches will report when the configuration has been changed


2011-10-02 15:16:29 : 10.10.1.75 : >%COPY-I-FILECPY: Files Copy - source URL run
+ning-config destination URL flash://startup-config

2011-10-02 15:16:43 : 10.10.1.75 : >%COPY-N-TRAP: The copy operation was complet
+ed successfully

 
 
And it will also report link up and down messages which can be very useful when troubleshooting communication problems.


2011-10-02 15:49:17 : 10.10.1.75 : >%LINK-W-Down:  2/e24

2011-10-02 15:49:20 : 10.10.1.75 : >%LINK-I-Up:  2/e24

 
 

  Examples from the RSN Internet Console Server
Someone from the module connected to the RSN console and logged in as root, after typing the password incorrectly twice.


2011-10-02 12:11:03 : 10.10.1.200 : in.telnetd[2942]: connect from 10.10.1.1 (10
+.10.1.1)
2011-10-02 12:11:03 : 10.10.1.200 : telnetd[2942]: doit: getaddrinfo: Temporary
+failure in name resolution
2011-10-02 12:11:07 : 10.10.1.200 : login[2943]: invalid password for `root' on
+`ttyp0' from `10.10.1.1'
2011-10-02 12:11:15 : 10.10.1.200 : login[2943]: invalid password for `root' on
+`ttyp0' from `10.10.1.1'
2011-10-02 12:11:35 : 10.10.1.200 : login[2945]: root login  on `ttyp0' from `10
+.10.1.1'

 
 
Note that the RSN console will report a user ID of UNKNOWN if an attempt is made with an invalid user ID.


2011-10-02 12:12:31 : 10.10.1.200 : in.telnetd[2946]: connect from 10.10.1.1 (10
+.10.1.1)
2011-10-02 12:12:32 : 10.10.1.200 : telnetd[2946]: doit: getaddrinfo: Temporary
+failure in name resolution
2011-10-02 12:12:37 : 10.10.1.200 : login[2947]: invalid password for `UNKNOWN'
+on `ttyp0' from `10.10.1.1'
2011-10-02 12:12:45 : 10.10.1.200 : login[2947]: invalid password for `UNKNOWN'
+on `ttyp0' from `10.10.1.1'
2011-10-02 12:12:54 : 10.10.1.200 : login[2947]: invalid password for `UNKNOWN'
+on `ttyp0' from `10.10.1.1'

 
 
The RSN console will not report when the valid user ID rsn_admin is used to login. However, you will still see the connection. The rsn_admin user ID does not have access to change any of the system configuration files. If the rsn_admin user tries to change to root with the su command it will be logged.


2011-10-02 12:15:37 : 10.10.1.200 : in.telnetd[2957]: connect from 10.10.1.1 (10
+.10.1.1)
2011-10-02 12:15:38 : 10.10.1.200 : telnetd[2957]: doit: getaddrinfo: Temporary
+failure in name resolution
2011-10-02 12:15:54 : 10.10.1.200 : su[2959]: + ttyp0 rsn_admin-root
2011-10-02 12:15:54 : 10.10.1.200 : PAM_unix[2959]: (su) session opened for user
+ root by rsn_admin(uid=500)

 
 
As will attempts that fail.


2011-10-02 12:19:50 : 10.10.1.200 : PAM_unix[2972]: authentication failure; rsn_
+admin(uid=500) -> root for su service
2011-10-02 12:19:52 : 10.10.1.200 : su[2972]: pam_authenticate: Authentication f
+ailure
2011-10-02 12:19:52 : 10.10.1.200 : su[2972]: - ttyp0 rsn_admin-root

 

To configure the network switch to send log messages to the OpenVOS module you need to log into the switch as root, execute the logging command and then save the new configuration:

telnet 10.10.1.75                                       
Trying...
Connected to 10.10.1.75.
Escape character is '^]'.

User Name:root
Password:******

console# config
console(config)# logging 10.10.1.1
console(config)#

console# copy running-config startup-config

 

To configure the RSN console to send log messages to the OpenVOS module you need to log into the console as root and start the syslogd process with the command “syslogd –R 10.10.1.1:514”. To make sure that the syslogd process is started after a reboot the /etc/tc.d/rc.local file must be changed.

telnet 10.10.1.200
Trying...
Connected to 10.10.1.200.
Escape character is '^]'.

Moxa Embedded Linux, Professional Edition
Linux/armv5teb 2.4.18_mvl30-ixdp425

azvos login: root
Password:
Welcome to

    ___  _____  __        _______    _____                   __
   / _ / __/ |/ / ____  /  _/ _   / ___/__  ___  ___ ___  / /__
  / , _/ /    / /___/ _/ // ___/ / /__/ _ / _ (_-</ _ / / -_)
 /_/|_/___/_/|_/       /___/_/     ___/___/_//_/___/___/_/__/ 

 Authorized Users Only!

root@azvos:~# syslogd -R 10.10.1.1:514
root@azvos:~# 
root@azvos:~# 
root@azvos:~# 
root@azvos:~# cd /etc/rc.d
root@azvos:/etc/rc.d# cp rc.local rc.local.bak
root@azvos:/etc/rc.d# echo syslogd -R 10.10.1.1:514 >> rc.local
root@azvos:/etc/rc.d# tail rc.local
fi
/etc/init.d/ssh start
/etc/init.d/apache stop
/etc/init.d/portmap stop
rm -f /rsn/call.log
/rsn/callhome &
lcmmessage -c -m "   Welcome to   " -l
lcmmessage -m " RSN-IP Console " -l
cat /etc/motd
syslogd -R 10.10.1.1:514
root@azvos:/etc/rc.d#

 

Assuming you get the same output as shown above you can delete the rc.local.bak file with “rm rc.local.bak”.

Once logging has been set up on the devices you need to run the logd program on the Stratus module.  I suggest starting the program with the following command macro. The log file will be named logd.(date).(time).out. If for some reason a file will that name already exists it is be renamed to logd.(date).(time).old.out. If there is already a file with the .old.out suffix it is deleted. Given that the time stamp is to the second this is unlikely. The out file has implicit locking set so the file may be read at any time. Note that the out file will grow forever so some maintenance on your part will be needed or you can modify the program to make it smarter about handling the output.



& start_logd.cm begins here
&
& Version 1.00 11-11-02
& noah.davids@stratus.com
&
& This script creates a log file, sets implicit locking and starts the logd
& process. The process will not normally terminate and the log file has the
& potential to grow very large.
&
&
& This software is provided on an "AS IS" basis, WITHOUT ANY WARRANTY OR
& ANY SUPPORT OF ANY KIND. The AUTHOR SPECIFICALLY DISCLAIMS ANY IMPLIED
& WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE.
& This disclaimer applies, despite any verbal representations of any
& kind provided by the author or anyone else.
&
&set DT (date).(time)
&if (exists logd.&DT&.out)
&then !rename logd.&DT&.out logd.&DT&.old.out -delete
!create_file logd.&DT&.out
!set_implicit_locking logd.&DT&.out
start_process logd -output_path logd.&DT&.out -privileged -process_name logd
&
& start_logd.cm ends here

 

Any finally here is the program.



/* logd.c starts here

   Version 1.00 11-11-02
   noah.davids@stratus.com

   This software is provided on an "AS IS" basis, WITHOUT ANY WARRANTY OR
   ANY SUPPORT OF ANY KIND. The AUTHOR SPECIFICALLY DISCLAIMS ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE.
   This disclaimer applies, despite any verbal representations of any
   kind provided by the author or anyone else.
*/

#define _POSIX_SOURCE

#include <sys/select.h>
#include <prototypes/inet_proto.h>
#include <stdlib.h>
#include <string.h>
#include <c_utilities.h>
#include <errno.h>
#include <time.h>

#define BUFFERLEN 10000
#define bzero(s, len)             memset((char *)(s), 0, len)

int errno;

getCurrentDateTime (char * szDT)
{
time_t tTime;
struct tm *tmLT;

tTime = time ((time_t *) 0);
tmLT = localtime (&tTime);

sprintf (szDT, "%4ld-%02ld-%02ld %02ld:%02ld:%02ld",
          tmLT -> tm_year+1900,
          tmLT -> tm_mon,
          tmLT -> tm_mday,
          tmLT -> tm_hour,
          tmLT -> tm_min,
          tmLT -> tm_sec);
}

main (argc, argv)
int    argc;
char   *argv [];

{
 struct sockaddr_in serv_addr;
 struct sockaddr_in cli_addr;
 int clilen = sizeof (cli_addr);

 int zeroCount = 0;
 int socks0;
 int recvBytes;

 char szSender [16];
 char  szDateTime [32];
 char szMessage [BUFFERLEN];
 short portNumber;

 if (argc == 1)  /* no arguments - use the default of 514 */
    {
    portNumber = 514;
    }
 else
 if (argc == 2) /* one argument, must be the expected port number */
    {
    portNumber = atoi (argv [1]);
    if (portNumber == 0)
       {
       printf ("nn%s argument was not the expected port number", argv [1]);
       printf ("nUsage: logd [port number, default = 514]n");
       exit (-1);
       }
    }
 else /* more than one argument gets a usage message */
    {
    printf ("nnToo many arguments");
    printf ("nUsage: logd [port number, default = 514]n");
    exit (-1);
    }

/* Leting you know what argument values will actually be used */

 printf ("logd %dnn", portNumber);

 if ((socks0 = socket (AF_INET, SOCK_DGRAM, 0)) < 0)
    {
    perror ("logd: can't create dgram socket");
    exit (errno);
    }

/* build a sockaddr structure holding the address we will bind to. The IP
   address is INADDR_ANY meaning we will listen on all active IP addresses */

 bzero ( (char *) &serv_addr, sizeof (serv_addr));
 serv_addr.sin_family        = AF_INET;
 serv_addr.sin_addr.s_addr   = htonl (INADDR_ANY);
 serv_addr.sin_port          = htons (portNumber);

/* now bind to the address and port */

 if (bind (socks0, (struct sockaddr *) &serv_addr, sizeof (serv_addr)) < 0)
    {
    perror ("logd: can't bind local address");
    exit (errno);
    }

/* main loop just does a recv, blocking until something is available to read.
   Assuming we receive at least 1 byte we get the current date-time,
   convert the senders IP address to a printable string and print the
   date-time, address and message starting at the 5th character position.
   The first four characters of a syslog message are <NN> where NN is a
   severity and facility code. These can be used for message filtering. Since
   this program doesn't do any filtering I just skip them. */
 while (1)
   {
    recvBytes=recvfrom(socks0,szMessage, BUFFERLEN, 0,
          (struct sockaddr *) &cli_addr, &clilen);
    if (recvBytes > 0)
       {
       getCurrentDateTime ((char *) &szDateTime);
       strcpy (szSender, inet_ntoa ((struct in_addr) cli_addr.sin_addr));
       szMessage [recvBytes] = 0;
       printf ("%s : %s : %sn", szDateTime, szSender, &szMessage[4]);
       zeroCount = 0;
       }
    else
    if (recvBytes < 0) /* in the event of an error report it and exit */
       {
       getCurrentDateTime ((char *) &szDateTime);
       printf ("%s : Error %d returned - exitingn", szDateTime, errno);
       exit (errno);
       }
    else  /* I can't think of any reason we would be getting null messages */
       {  /* but if we get a stream of them we would silently loop. This */
       zeroCount++;          /* forces out a message if we get 100 null */
       if (zeroCount > 99)   /* messages in a row */
          {
          getCurrentDateTime ((char *) &szDateTime);
          strcpy (szSender, inet_ntoa ((struct in_addr) cli_addr.sin_addr));
          printf ("%s : %s %sn", szDateTime,
               "We have received 100 null messages, the last one from",
               szSender);
          zeroCount = 0;
          }
       }
   }
}

/* logd.c ends here */

 

Technical Webinar on Porting Open-Source code to OpenVOS

9.29.2011Availability, Programming, UptimeBy: Did you miss the webinar? Well, we have the video right here for you!

Watch an overview of the steps that you can follow to port open-source code to the Stratus OpenVOS operating system. I review the OpenVOS POSIX environment, explain how to find and download open-source packages, discusses the process for building and installing open-source code, and review some of the common issues that often arise. Attendees should be familiar with VOS or OpenVOS and have a basic knowledge of the UNIX/Linux/POSIX programming environment.

Do you have questions? Ask anything right here in the comments and I will get back to you.

Porting Open Source Software to OpenVOS Technical Webinar

9.19.2011Programming, Service, UptimeBy: paul green

Details: Paul Green will give an overview of the steps that you can follow to port open-source code to the Stratus OpenVOS operating system. He will review the OpenVOS POSIX environment, explain how to find and download open-source packages, discuss the process for building and installing open-source code, and review some of the common issues that often arise. Attendees will be encouraged to ask questions. Attendees should be familiar with VOS or OpenVOS and have a basic knowledge of the UNIX/Linux/POSIX programming environment.

Date: Wednesday, September 21, 2011

Time: 10:00AM ET

Speaker: Paul Green is a Senior Technical Consultant in the VOS Group at Stratus Technologies. Paul has over 30 years of experience at Stratus and is one of the original software engineers that created VOS. Paul earned a BSEE from MIT in Electrical Engineering with a focus on Software Engineering.

 

Please click here to register.

 

Message Queue Anomalies

8.19.2011Availability, Programming, Service, UptimeBy: v>

The CAC is frequently asked to look into problems with VOS message queues. Here are a couple of interesting ones, along with some solutions and recommendations that I’d like to share with you.

Problem 1: Recently, a customer came to us with a problem. A requester was unable to add a message to a message queue, receiving the error code e$max_file_exceeded. Strangely, the queue was empty, as shown by the list_messages command.

Upon examination of the queue, it was seen that the count of disk blocks used for this queue was approaching the maximum file size for a non-extent file.

name:   %s1#d01>AppData>queue.MQ

file organization:         message queue file
last used at:               11-08-16 14:45:13 edt
last modified at:         11-08-16 14:45:13 edt
last saved at:             10-06-14 21:34:18 edt
time created:             10-06-09 11:03:15 edt
transaction file:          yes
log protected:            no
safety switch:            no
audit:                          no
dynamic extents:        no
extent size:                 1
last message:             51689380
blocks used:               520201

Why was this queue both full, and empty?

At some point in the past, the servers responsible for draining the messages out of the queue were off-line. This resulted in a very large backlog of messages. These messages were eventually handled by the servers and deleted from the queue. When messages are deleted from a queue, a key is added to the _record_index index of the queue, and the key value indicates the number of bytes of the deleted message(s).  When a new message is added to a message queue, the file system will attempt to find a previously deleted message of the exact size of the new message. If one is not available, the new message is written to the virgin space at end of the queue.

In this case, there was not enough virgin space in the queue to contain the new message, and there was no pre-existing deleted message of the correct size.

The moral of this story is that it is a good idea to limit the number of unique message lengths in any given queue. Rather than have each message use the exact number of bytes it needs, round the value up to some standard size. By using this technique, you increase the chance that a new message can reuse the space from a previously deleted message.

Problem 2: Recently, another situation arose regarding the performance of message queues. A customer stated that the time to empty a message queue of 400,000+ messages was taking an inordinate amount of time.

They had recently had a problem with their server processes being unable to process messages in a message queue in a timely manner. Fortunately, the requesters had been kept running so that no data was lost. When the server problem was resolved, it was many hours before they had caught up with the backlogged requests and could then start processing recent transactions. The customer was asking why this occurred, and how can it be either prevented or sped up in future situations.

When a message is deleted in a message queue, a key is added to the system-maintained _record_index, where the value of the key is the message length. If the message being deleted is the same length as a previously deleted message, the unused data position is saved as a duplicate entry on the key containing that message size at the end of the list of duplicate values. Thus, if there are hundreds of thousands of deleted messages, all the same size (or the set of lengths of deleted messages is small), the list of duplicate keys is very long and the time to delete a single message goes up linearly.

Conversely, when a message is added to the queue and a _record_index key for the message length exists, the space occupied by the newest deleted record is reused to contain the data for the new message. This value must then be deleted from the key value containing the message length. Thus, the time to add a message goes up linearly; the more deleted messages, the longer it takes to add a new message.

The moral of this story is that system maintained data in message queues have memory; the queues remember the locations and sizes of all previous messages. This information persists even after the queue is emptied. Try to avoid allowing your message queues to grow to a huge size (tens or hundreds of thousands of disk blocks). Otherwise, you will find that the cost of adding and deleting messages to a queue can grow over time.

The solutions to both of these situations are the same.

Solution A: A message queue can be truncated while it is opened. The routine s$truncate_queue can be used to accomplish this. However, there are 4 conditions that must be satisfied:

1: there must be no requesters holding the message queue  open

2: the message queue must be drained of all messages

3: this routine must be called by a server

4: the queue cannot be a transaction file

If the first 3 conditions are not met, s$truncate_queue will return e$no_truncate_queue. If the last condition is not met, s$truncate_queue will return e$invalid_io_operation.

Solution B: if the application design allows having multiple servers, you can periodically rename the existing message queue, create a new message queue with the correct name, start a new set of servers, and bounce the requesters. When the servers start up, they will start processing on a new, but empty, message queue. When the requesters start up, they will add their requests into the new and empty message queue. The original set of servers can remain running, processing the backlog of requests until the queue is empty. Then the old set of servers can be stopped, and the old message queue can be deleted.

In addition, a solution to problem 1 may be to use an extent based message queue. That would allow additional messages to be placed in the queue as the maximum file size would be larger by a factor of the extent size. However, by using extent message queues, performance will be even worse than normal if or when the message queue ever contains a large number of messages at any one point in time.

As mentioned earlier, limiting the number of unique message lengths in any given queue will improve the probability that a new message can reuse the space released by a previously-deleted message. This will help solve both problems mentioned in this post.

Sharing the Load – Multiple Processes Listening on the Same Port Number

8.1.2011Availability, ProgrammingBy: Under STCP when you have multiple processes listening on the same port number only the first process that bound to the port number is notified when a connection is requested. One common way to slip around this restriction is to close the listening socket when a connection is completed and then create a new socket and bind and listen again. This places the listening socket at the end of the chain of listening sockets.

Starting in 17.1 this procedure will no longer work without changing the default STCP parameter values.

Prior to 17.1, STCP misinterpreted the intent of the socket option SO_REUSEADDR. The intent of this option was to allow a process that listened on port X to restart immediately if it was terminated and the system still had sockets bound to port X in a TIME_WAIT state. However, the prior releases of STCP would allow a process to listen on port X regardless of the state of existing sockets bound to port X, assuming of course that the SO_REUSEADDR socket option was set. This behavior allows two different applications to listen to the same port number. If the first application uses the listening port recycle procedure or the process is terminated and restarted the second application will effectively hijack the port.

Starting in release 17.1 the bind function will return the error EADDRINUSE unless all sockets bound on the requested port are in a TIME_WAIT state; or the second process has the same session ID as the first process. All processes started from the same parent process will have the same session ID, so a process can fork multiple child process and all those children can listen on port X.

Two STCP parameters where added in release 17.1 that can be used to change STCP’s behavior back to the pre 17.1 behavior, these are tcp_reuseaddr_action and udp_reuseaddr_action. Each parameter has the values “safe” (default) and “unsafe”. The term unsafe is used because the setting will allow the port hijacking described above.

You can use the analyze_system request list_stcp_params to show the current setting of these two parameters (figure 1) and the request set_stcp_param to change the setting (figure 2). Like all STCP parameters changes are only effective until the system is rebooted so to make the change permanent a command setting the parameter to unsafe must be placed in the module_start.cm file or in the start_stcp.cm file.

analyze_system -request_line 'list_stcp_params tcp_reuseaddr_action' -quit
OpenVOS Release 17.1.0ab, analyze_system Release 17.1.0ab
Current process is 481, ptep 89B6B740, Noah_Davids.CAC

TCP SO_REUSEADDR action [safe/unsafe]       (tcp_reuseaddr_action)  safe

ready  17:10:29

analyze_system -request_line 'list_stcp_params udp_reuseaddr_action' -quit
OpenVOS Release 17.1.0ab, analyze_system Release 17.1.0ab
Current process is 481, ptep 89B6B740, Noah_Davids.CAC

UDP SO_REUSEADDR action [safe/unsafe]       (udp_reuseaddr_action)  safe

ready  17:11:16
Figure 1 – display of the two reuseaddr action parameters

 

 

analyze_system -request_line 'set_stcp_param tcp_reuseaddr_action unsafe' -quit
OpenVOS Release 17.1.0ab, analyze_system Release 17.1.0ab
Current process is 481, ptep 89B6B740, Noah_Davids.CAC

Changing TCP SO_REUSEADDR action (tcp_reuseaddr_action)
from safe to unsafe
ready  07:30:38
Figure 2 command to change tcp_reuseaddr_action to unsafe

Pageof 8

Share