Older blog entries for jum (starting at number 41)

The setitimer problem of yesterday probably appearently can be worked around by rounding the minimum interval given to be at least 20ms, two clock ticks. We are still testing this.

Meanwhile another strange problem that we were looking at for a few days appears to be resolved. In our PCShare product which contains SMB file sharing we do have a complete netbios naming service. This netbios name service did not properly start up if it is the sole computer in a workgroup. First findings appeared to point to packet length problems, as choosing an odd length netbios machine name made things going (not completely, but it did get further along). Thinking a bit more about that I tried turning off hardware checksumming, and voila, it appears to work fine. The hardware checksumming feature was a recent addition to MacOS X, and appearently depending upon if a broadcast packet is an even or odd number of bytes in size has some weird problems.

As the last preparations for the MacOS X final release take place the one or other show stopper bug pops up. One particular nasty one was in reality rearing its ugly head in July for the first time. The problem is that if an app does pretty heavily use setitimer sometimes SIGALRM get lost. If the app is waiting on some timed event the app will stall and if a kill with SIGALRM is used from the command line the app will continue. The very same problem did appear years ago on SunOS 4.1.3 and was fixed by Sun with a patch. Unfortunately SunOS is closed source and thus I do not know exactly what the fix was, but looking through the FreeBSD cvs I believe that I may have found the culprit. There appears to be rare case that the setitimer mechanism might inccur a race condition with the hardware tick timer, the fix was at 1.9 of kern_time.c.

The AIX 5 fix download mechanism does work now, I downloaded the latest versions of system components. The result has not been encouaging so far, as if I leave the start of httpd in /etc/inittab AIX 5 in 64 bit mode will now always crash upon boot. I turned httpd off to be able to continue work in 64 bit mode.

I have fixed one strange bug that was causing multicast address add/delete fail if used via the AppleTalk socket ioctl interface. I did introduce this special ioctl interface as not all systems sport an SIOCADDMULTI/SIOCDELMULTI style interface. And to be exact on AIX this is translated on AIX to the proper ndd_ctl NDD_ENABLE_ADDRESS/NDD_DISABLE_ADDRESS calls. And here the problem came in, I used a statment like this:

err = aa_ndd->ndd_ctl(aa_ndd, cmd == ATIOCADDMULTI?
NDD_ENABLE_ADDRESS : NDD_DISABLE_ADDRESS, buf, len);

The problem here is the comparison cmd == ATIOCADDMULTI, cmd is a parameter that explicetely contains the unsigned int from the upper levels and ATIOCADDMULTI is a define using the standard sys/ioctl.h macros as _IOW('A', 5, struct ifreq). _IOW() contains some pretty obscure shifting and on AIX 64 bit does actually produce a sign extended 64 bit value that is negative (having the top 32 bits set to one). This will never compare to the plain 32 bit cmd value, casting using unsigned makes it work (cmd == (unsigned)ATIOCADDMULTI). Go figure.

I have been playing with a brand new IBM pSeries 610 that arrived early this week. I came preinstalled with minimal AIX 5.1, they even left out the IDE CD-ROM driver the system needs to install any further software. As I wanted to know if they system would work with the AIX from CD-ROM anyways I just booted from CD and this time AIX did have an IDE CD-ROM driver.

This time I installed most of the software I believed we would need. I did not see the option in the install menu to make the 64 bit kernel the default so I ended up with a 32 bit kernel. I had to test our AppleTalk kernel modules in 32 and 64 bit mode so I started with 32. After rebooting in 64 bit mode first I had no NFS mounts, the mount command was barfing that one of the NFS kernel modules is using an old obsolete format. Appearently I installed one package to much, after removing the des package NFS works fine.

The AppleTalk kernel module was an easy port, just Makefile adaptions to compile a 32 bit as well as a 64 bit module from the same sources and archive these together into an ar archive. The AIX kernel is smart enough to select the proper version from the archive depending upon the mode it is running in, pretty nifty.

While testing some stuff in 64 bit mode I noticed that Apache (as delivered by IBM as Websphere server) did core dump upon starting. Dbx does tell me the core file is invalid, strange. I started httpd with dbx and the -X option and dbx did hang. I kill -9'ed httpd and could exit dbx. I then attempted an apachectl start and whoops, I was talking with the service processor instead of AIX (I was sitting at the console). I rebooted and looked at the generated vmcore file and it did point at the kernel based linker that AIX uses for its shared libraries, it appeared to have stumbled across a NULL pointer while loading an httpd module.

A few of the other subsystems also produce strange failure messages in 64 bit mode, all in all I am not convinced about AIX 5.1 64 bit. AIX has been rock solid for me since the early beginnings, this is really disappointing. I looked at the AIX fixes page and tried the new order system for AIX 5 fixes as I found out that I did not yet have the latest components. One does click on the packages needed and they did tell me they would process my order and send me a notification with a download URL. After a few hours waiting no URL yet, not encouraging.

In the last diary entry I said MacOS X is not Unix. Today I have to say that Solaris threading and setitimer is really broken. The setitimer man page says that SIGALRM signals cannot be blocked in threaded code and this is a bug that is not going to be fixed.

In converting an existing event based system to cooperate with threads I really had to provide an efficient API to have multiple millisecond resolution timers that only happen to run while the main app is waiting for file descriptors via poll/select. All signals are blocked while not waiting for fds, so one can even to malloc inside signal handlers. This makes for some really easy event driven programming and we have used this framework for a really long time now.

For getting around the Solaris setitimer problem I did do a workaround by actually only having a really primitive SIGALRM signal handler that signals a real time signal that can be blocked as a replacement. This works really fine (with some overhead of the extra signal delivery) and I thought problem solved. Well, after some months using this on the development machines I have found out that this free running SIGALRM really wreaks havoc with one assumption everywhere in the code: no signal will happen unless in poll/select and thus EINTR is impossible.

Now with SIGALRM running freely without being blocked any slow I/O on pipes, sockets and terminals can cause EINTR to happen and strange failures creep into code running since years. Due to the interaction with timing these bugs are really difficult to find. We will have to wrap any of the read, write, readv, writev and so on calls into safe ones that retry on EINTR and change all of the places that need the wrappers. This really sucks.

Did some debugging yesterday that showed again that MacOS X is not Unix, it is something else. If you start a background daemon from a shell window while logged on with the Aqua GUI and then log out newly forked processes from the background daemon will not be able to do any get*ent lookups any more. The C library on MacOS X does attempt to re-establish on fork a mach IPC send right for the lookupd cache management server, and this fails due to the MACH bootstrap server having destroyed the current context on logout. This basically means you are not able do start background daemons from shell windows, this really sucks.

The dladdr idea is hopelessly machine dependent and I have decided not pursue the idea further. We thus compile in the name of the shared library and search that in the standard places.

In the mean time the rework of the admin protocol for all the PC style stuff and the new printer interface types progresses well, the server part is done. Heinrich works on the client side, this takes longer as it is much more work.

I have meanwhile started to put in the AFP 3.0 extensions into our afpsrv, although I do not necessarily expect to be finished in time for the initial MacOS X release. The important infrastructure changes are already done, namely the 64 bit file I/O stuff and the new shared arena. Also AFP 3.0 does allow for long UTF8 file names, which we can now do easily as we did extend our desktop database format. I will first implement the 64 bit I/O calls and than the Unix style permissions, leaving the more complicated UTF8 file name stuff for later.

After returning from Yellowstone I was busy the last few days to abstract a few operations we have been doing all the years although it is possible to optimize them. In particular we do append resources (the idea is loosely based on the Mac idea) to the end of our executables for small information items that should always be in sync with the compiled code. Under Unix there is no standard way to open the current process executable, so the original code did search for argv[0] along the path.

Under more modern Unix variants there is the /proc that allows one to open the running executable more easily, for example /proc/self/exe under Linux or /proc/self/object/a.out under Solaris. A few platforms like Irix or Tru64 make that more difficult as one has to open /proc/<pid> first and then use ioctl(..., PIOCOPENM, 0) to retrieve an open file descriptor for the zero mapping (the main executable).

Still some Unix variants like AIX 4 or MacOS X do not provide any of this so we still have to search along the path, a bit fragile and ugly.

Appearently even more ugly it gets if you want to open a shared library. The current solution compiles in the name of the shared library (ugh) and searches according the OS search rules for shared libraries. As far as I thought about this one could either call dladdr to find the name of the shared library a function is in or use the /proc file system mapping enumeration to do it. I will see which version works best.

Today was the last day before leaving for the CIFS conference in Bellevue, WA followed by a week of vacation in Yellowstone National Park. As it is with these last days, the MacOS X beta was supposed to be ready today as well. Alas, as it turned out there was just that show stopper bug that turned up late in the afternoon after we did already put a version on our web server (not visible if you do not know the path). The TNT folks delivered a set of new CD's with the latest MacOS X 10.1 build, and to our horror a few programs did just core dump upon starting up.

Examining the core dump showed svc_getreqset as the leaf function on the stack, this immediatly rang a bell with me. I had that problem before, but with the change from AIX 4.2 to 4.3. I looked into sys/types.h of the new MacOS X version and indeed, they increased FD_SETSIZE from 256 to 1024. This is no bad idea as 256 is rather small, but the design of the SUN RPC library is really bad in this regard, as it passes the address of an fd_set but not how large it is. From the application side it is also difficult to prevent this, as there is no way to find out which value of FD_SETSIZE was used to compile the C library. As it stands, we did simply define FD_SETSIZE to be 1024 even as we are still compiling on the older system, this way the structure is large enough.

As we had to clean out all object code the build is still running and we will have to put it up on the web server on monday. One day we will have to get rid of the elaborate makefile system and use some more sane perl scripts to do the build, this way it would be easier to distribute the build across multiple systems.

OK, I have got the packaging to work under MacOS X. The problem was that I set the destination to /usr/local/helios and put all relative path names into the pax.gz and .bom files. This was in an attempt to leave open the option to make fully relocatable packages, but this does not work out easily as you can not easily find out the installation directory of a base package if you have multiple add-on packages. I have now put root-relative path names (including the usr/local/helios prefix) into the .bom and .pax.gz files and set the destination to /. Now the packages install fine even if I re-install.

32 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!