Older blog entries for tmattox (starting at number 8)

Lots of news this time:

  • Ahhhh, the replacement switches for KLAT2 are installed. Hopefully they actually fixed the design flaw.
  • Ars Technica has posted our article about how we got 64 GFLOPS out of KLAT2.
  • The new AFN 000601 PCB's are finished, and should be in my hands by Wednesday. I envision lots of soldering in my near future... :-)
  • Our submission for a Gordon Bell price/performance award was accepted, so we are finalists for this prestigious award.
  • Our abstract/paper about KLAT2/FNNs was accepted at the Third Extreme Linux Workshop which will be held at the 4th Annual Linux Showcase in Atlanta.
  • We put together a KLAT2 In The News page that links in all the coverage that we have found so far.
  • Finally, our cluster work is linked-in on the "official"Beowulf web site.
The more I get done, the longer my TO-DO list gets! That's two papers that need mucho revisions, 21 AFN boards to assemble and solder, an update to AFAPI to handle more than 32 processors, and some major code/network tweaking to improve that price/performance ratio!

On a personal note, last weekend I might have gotten a few more people addicted to Settlers of Catan, a really cool board game. Yeah, no fancy 3D video card needed... much less a computer :-)

It's been a busy week and a half since my last entry.

  • On Friday, Hank and I finished writing an article about how we got over 64 GFLOPS on KLAT2. Hopefully it will be appearing on Ars Technica in the next week.
  • I sent off the PCB design files for a new AFN based on a revised PAPERS 960801 module. So, hopefully, in a few weeks, we can have an AFN up and working on KLAT2. The new design files will be posted once I've verified that none of my changes/tweaks messed up the functionality/reliability of the PAPERS 960801 design. I didn't need to make revisions, however, since we needed to have a new run of PCBs made, it was a good time to correct some annoyances with the old design.
  • And now for the ugly events of the past two weeks or so:

The 32-port Fast Ethernet switches that we purchased for KLAT2 had a design flaw, causing a 60% failure rate after a few weeks of use. The manufacturer says it is a latent thermal problem. I suspect that the failure rate will approach 100% within another month or two. The company is sending us replacements for the entire set, that will have a design revision that supposedly fixes the problem. Yet, they won't get here for another two weeks! I've set the thermostat in the lab down to 65 F, making it rather unpleasant for me to work in the room. Hopefully, the remaining 4 switches will keep working until the replacements arive.

The other recent unpleasant event is our discovery that the "marketroids" have again redefined a technical term/phrase into oblivion. The phrase "wire-speed switching on all ports" used to have a technical meaning that the backplane bandwidth of a switch was large enough to handle all ports going at full speed in full duplex mode continuously, as long as the communication pattern was a permutation. The key here is that "wire-speed" should mean that as long as I am the only processor/NIC sending to another particular processor/NIC, I should have full wire-speed bandwidth available for my use, regardless of what other traffic is in the switch. The marketroids seem to have modified this definition to mean that for some permutations, you can achieve wire-speed, but not for all permutations. ACK!

So, if we can get more specific details on the internal structure of common switches we will try and modify our GA to accomodate the restrictions when designing a FNN. Most switches seem to be built with 8+1 switch-on-a-chip modules, where the +1 ports are tied together in a unidirectional ring of varous bandwidths. The key is that, depending on how high the ring's bandwitdh is, the overall switch cannot achieve wire-speed for permutations that must go almost all the way around the ring. This will also affect the observed latency of your connection patterns, possibly dramatically.

P.S. - We did NOT want to know this. But too late now... What happend to crossbars, fat-trees, and star topologies for internal switch fabrics? (I know: economics...) Addendum: I just read through a document from Allayer, a switch-on-a-chip maker, that reasonably explains the choice of a ring.

Cool! KLAT2 has been hitting the press around the world! It's been reported in a Chinese newspaper's business section... I still need to find out "which" newspaper. A fairly well done article appeared in the EE Times under Technology News. The CNET story got reported at Tom's Hardware which has translated mirrors around the globe (German, Japanese, and Korean to name a few).

I'm putting the finishing touches on a new PCB for making an AFN. It's a tweaked version of the PAPERS 960801 board. Hopefully we can assemble and test an AFN on KLAT2 by the end of June. We could just use the old design, but we need 21 boards to build the AFN for KLAT2... so we needed to get more PCBs fabbed, so it was a good time to make a few fixes (and to update the URL on the PCB :-) Once the new board has been checked out, I'll post the design files and board masks.

P.S. - We now have access to a wave solder machine... cha-ching!

Patience Luke.. Patience...

What a difference a few days makes. CNET picked up the story today and even included a picture of KLAT2. And we discovered that a strange/funny rewording of our press release was on LinuxMall.com... I never thought moonshine would be associated with supercomputers. :-)

We've already had several people "chomping at the bit" to get a copy of our software to design and use Flat Neighborhood Networks. It'll take some time to clean up the code so that it can be used/understood by people other than the authors. But as soon as it's not embarasing for others to see, it'll be posted and released into the Public Domain.

Hmmm, I guess a super-keen-neato-fast Linux Athlon cluster for real cheap isn't as newsworthy as we had thought. Is it just common knowledge that Athlons rock, or is it "Beowulf press release overload" recently? Anyway, we made it onto TechNN under "press releases" for a few hours... almost a day, Linux Today with 1500 "reads" or so, and KLAT2 was prominently mentioned on 3DNow.net.

A little 64 node Athlon cluster for under $42K just doesn't compete with a $15 million NOAA cluster for news coverage. Or am I just being impatient with the press...

Oh well, its time to get back to the grind, and get the next software release out the door.

Our press release just went out for our new KLAT2 cluster breaking the $1K/GFLOPS milestone. Interestingly, it appears that we were scooped by AMDZone.com a few hours before we actually started officially sending out the PR...

P.S. - It's been a wild ride designing, building, and debugging KLAT2 over the past two months or so. Almost as fun as the Beast :-)

P.P.S. - KLAT2 stands for "Kentucky Linux Athlon Testbed 2"

Just got back from Kings Island... I had a great time, except that I missed the Vortex, and the Son of Beast wasn't running. :-(
The new Face/Off ride was really cool. Also, the Beast and Outer Limits are always a thrill.

It seems that a few people do read these diary entries. Hi Jonathan and Michael, glad to run across some familiar faces. Jonathan, thanks for the M4 info, I'll look into using it for the next batch of Aggregate web pages.

Speaking of which... On Saturday I spent about an hour rummaging around the web pages of the University of Sao Paulo in Brazil trying to find a research project page from the past...and, no, I don't know Portuguese. Eventually I found it: they had made a "new" web site for their project, and only left behind those stupid 404 errors at their old web page. grrr... So I guess I'm wondering, is there some easy way to make an entire old web-page-heirachy not become 404's, but instead a pointer to the new site. If this was easier to do, there might be fewer broken/dead-end links out there.

The Sao Paulo page I was looking for was their reference to our work with the PAPERS project. I was looking for that, since I think the next update to The Aggregate site should be a pile of links to users of our technology. I know its out there in many places, but we've not been keeping track of how many people actually use our public domain technology.

Anyway, I guess my point/question is "Is there a non- invasive, ethical, easy, and regularly used method for tracking how many, and/or who, is using a particular free/open source software package?" I don't like the idea of making people fill out some form first before the can download our software. Especailly since that doesn't correlate to who actually uses it for anything. I have filled out those "registration forms" for a variety of open software packages, and yet, I'm not sure I still use, or ever actually used any of them beyond the first run or two. Including an annoying "Please don't forget to register your XYZ software" each time it is run is out of the question. I've had programs crash from broken registration-reminder features! Also, the idea of having a software package check in with its "authors" over the internet each time it's run is repulsive to me. I guess in the form of a feature to "automatically/periodically (with permission) check for the latest version of itself" might be a sensible way to measure the number of real users of said package. I guess you could even make it check in anonymously by default. Hmmm, I wonder if the recent spat of commercial software packages that have "auto-update" features are really serving the "demographics department" more than the user...

Probably the best thing I can do would be to put up a voluntary registration form for people who wish to be linked in as "part of the user community" for XYZ... hey, isn't that sort of what Advogato is doing!

I'm rambling... time for sleep.

Well, The Aggregate website update is taking a long time to do. Does anyone reading this (if anyone is :-), have suggestions for web site maintenance tools under BSD or Linux? I'm really in need of something that automatically annotates IMG tags with the WIDTH and HEIGHT of the images. It makes a huge difference on viewing pages over a modem... Anyway, the site is now mostly there, and I can soon return to getting my LAM/MPI mods submitted, and our latest AFAPI/VWLib fixes tested & posted. Oh, and that Ph.D. thesis thing needs some work too.

Tomorrow I'm meeting some "Purdue" friends at Kings Island to forget about all this for a day...

I've begun in earnest to update the http://aggregate.org/ web site now that KLAT2 is up and running. So what is KLAT2? Our research group has been on very friendly terms with AMD, and recently they donated 66 Athlon processors to our project. We turned those into the KLAT2 cluster (Kentucky Linux Athlon Testbed 2), which will hopefully be making the news soon.

We are trying out a bunch of new performance enhancing technologies with KLAT2. The coolest one I think is the genetically engineered Flat Neighborhood Network (FNN) topology Hank & I came up with. It allows a very inexpensive network to very closely approximate a fully connected network. It has a very high crossection bandwidth, and single hop latency between any pair of nodes. I've done a preliminary patch to LAM/MPI to make it work with the FNN. However, it is going to take quite a bit more software hacking to get an MPI to utilize all the data link bandwidth that the FNN supplies. I'll post a pointer when we have a reasonable document up that explains the FNN.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!