madscientist is currently certified at Master level.

Name: Paul Smith
Member since: 2002-06-29 02:07:48
Last Login: 2008-05-19 21:51:26

FOAF RDF Share This

Notes:

I started using computers in the 70's, typing in programs from listings in books of BASIC games: Hunt the Wumpus, Tic Tac Toe, Checkers, Star Trek, you name it. In the beginning I had absolutely no idea what the characters I was typing meant, but gradually I started to learn how to program. I was using teletypes with acoustic modems, both thermal paper and loud, clattering typewriter-style. I can still vividly remember the smell of the teletype. By the end of the evening I'd be waist-deep in thermal or tractor paper, as page after page documented my battles with Klingons... or with Collossal Cave.

In the 80's I received a Bachelors degree in Applied Mathematics/Computer Science from Carnegie Mellon University. I learned an incredible amount at CMU, and met so many remarkable people. Truly an awesome experience; one of the great times of my life. But, I would say CMU is definitely not for everyone... By the time I graduated I was addicted to C, UNIX, Emacs, and Coke. Classic, of course. We hoarded it during the Dark Times. Oh, and my wife. I'm still addicted to all of them. Except I've switched to Diet Coke.

Since then I've been working mostly in networking, first in network management, then in network devices themselves. I have a huge interest in coding both as an art and a process: best practices, SCM tools, build tools, standards, etc. I've been a strong advocate of free software since I first started using it, well before the Linux kernel was even a blip on the horizon.

I started using GNU/Linux in 1993, after downloading it on my SunOS box at work and using dd to raw-write it onto about 10 floppies. I started with Slackware, then moved to Red Hat, then to Debian and I've never looked back. Debian Rules!

In 1996 Roland McGrath decided he didn't have time to continue maintenance of GNU make and asked me to take over. I guess I pestered him enough about it that he felt like getting even. Or maybe it's that I was a footnote in an O'Reilly book. I've also been involved with development on FVWM, a number of Emacs Lisp packages, and other random projects.

I'm interested in writing code, SCM tools and systems, standards, Lisp, C, Perl, make, UNIX, shell, Wikis, OO, XML, networks, virtual synchrony, Extreme Programming, ...

Projects

Recent blog entries by madscientist

Syndication: RSS 2.0
GNU make
Released GNU make 3.80 last week. Feels good to get it out the door and so far so good: no "brown paper bag" bugs reported yet. I'm always slightly nervous about this in a tool so essential and ubiquitous as GNU make; if there's a stupid bug there it's pretty visible! My plan is to fix any bugs that come in, integrate the ports to OS/2 and MingW that have been sitting on the back burner for a few months, and maybe some other cleanups (I've got a workspace with ansi2knr support almost completely integrated...!!) then release 3.81, maybe even by the end of the year depending on various factors. Then choose a "next big thing" to go into 3.82. My favorite project is stateful make, but that's a serious amount of effort. Another possibility is Guile integrated as an extension language.

Work
Survived another reduction last month. Who'd have thought the telecomms industry would implode this badly in only two years? If the trades are to be believed it ain't over yet, either. *Sigh*. With all that some of our plans are on hold, some are reprioritized, and so it goes...

Home
Headed to West Virginia for another Buckwheat Festival with the family a few weeks ago. As always, great fun was had by all although the weather was variable. Just as one of the parades was ending the sky let loose with an absolutely torrential downpour. It was quite the site to see the formerly crowded sidewalks emptied in mere seconds and the street turned into a river; discarded cups and containers caromed around and through the waterlogged feet of the final, soaked high school marching band as they plodded towards the end of the route. As always, too, I got sick while we were there. Just mild this year though.

21 Aug 2002 (updated 21 Aug 2002 at 13:18 UTC) »

I spent an hour or so playing with an implementation of Paul Graham's anti-spam algorithm, described recently in A Plan for Spam.

I implemented two different tools, both in Perl: spamcalc takes two sets of filenames, separated by "--"; the first set is a list of files containing "good" email (you can have lots of email messages in a single file, or one per file--but it only groks standard UNIX mailbox format, with '^From ' delimiters). This script reads in and tokenizes (using the same algorithm Paul describes) all the "good" messages, counting how many times each appears, then does the same for bad messages. Then, it does the weight calculation and constructs a DB file containing all the valid (appeared enough times to count) tokens and their weights.

The second script, spamcheck, takes a single message, tokenizes it, and computes the 15 most "interesting" tokens. It then applies Bayes Rule and shows you the resulting probability that this mail is spam. The implementation (barring any stupid coding errors on my part) is identical to that described in the paper, including ignoring case, etc.

I then played around with it for a bit. The main problem I have is that, as I suspect is the case with most people, I don't keep my old spam. So, I had to dig hard to come up with some spam to test with, and only managed to find 10 messages that I had received. So, I'm just testing--who cares, right?

Next, I have a huge archive of every email I've ever sent (well, since 1995 or so--the older stuff is on backup somewhere), but that's not really what I want since I'm trying to test email others have sent me: it seems likely to me that email I sent would give a different, skewed statistical "look" from email I receive, and harm the filter. However I also have a pretty large set of folders containing mail others have sent me, so I used all of that for the "good" mail. I then ran some test email, both spam and not-spam, through the filter.

Well, the results were disappointing: everything was categorized as spam! Looking at the results shows why: there are about 5 instances of the year ("2002") as a token in the test messages (in the headers, etc.), and each one of those was labeled individually as very interesting, and they all had a strong correlation to spam (.88 or so). Why is this? Easy, once you think about it: my spam was all of very recent vintage: today, actually. However, my good email was from folders where a very large number of messages were from previous years. So, the "2002" token appeared in all the spam messages, but a much smaller percentage of the good messages, hence the year was treated as a high-probability indicator of spam! Not good. Maybe if I had more spam (even if it was all from 2002) there would be more interesting words than the year and this wouldn't matter. Of course, older spam would also solve this problem.

Then I decided to try to get more spam to test with, so I went looking at archives of mailing lists, like the GNU lists, which I know get lots of spam. I found 30-40 messages and saved them, and re-ran spamcalc. Now when I tested messages, they were all categorized as not spam! Again, checking the details shows why, and it's related to mail headers: all the email to me contained headers that showed my hostname, etc. All the spam I installed from the archives did not. So, any tokens containing my host, etc. indicated a low probability of spam... again, not good!

So, I changed my "good" list to be just my inbox, which does contain some older messages but most of which are more recent, to solve the first problem, and I included only the spam I'd actually received to solve the second. This works better than the other two, but still I don't have enough spam mail to get a really good filter yet. But, I've started saving spam so maybe it won't be too much longer :).

In summary, if you want to use this algorithm be aware that for good results it's best if both your good and spam sets of messages are of similar vintage (not just due to the year, but other things in the headers like different local hostnames, etc.), and that you use spam you actually received rather than public archives of spam.

One way around this would be to enhance the algorithm to ignore some kinds of tokens outright: maybe avoiding things that look like dates, and maybe the first one or two (or some well-known set) of Received headers (ones that will be in every message you receive anyway); obviously now we're moving slightly away from a pure statistical analysis and trying to inject some AI into the algorithm. Which kind of goes against the whole idea.

Anyway, I thought it was an interesting experiment.

Back from another vacation. Kind of crazy: two weeks away, then two weeks back, then another week away. But, I'm darn relaxed (and a good thing too, because due to other vacations I've taken already I only have a few days left this year...). This one was with my family, which is actually always pretty fun, and we had more extended relatives whom I hadn't seen in a number of years show up. Biking, beaching, and reading.

GNU make
I added another new feature or two to make, so I need another pretest release. Also, someone reported a problem on PTX (whatever that is!) with jobserver support which we haven't closed on yet. Still, it looks good.

CrossOver Office 1.2.0
Wow! Way cool software. I'm really impressed with how much progress WINE has been making; I didn't realize they were doing so well. And Codeweavers has done a nice job of productizing it. I'm very thrilled because I was able to get Quicken running on my Linux box, which is the very last application that caused me to reboot into Windows (I'd love to switch to GnuCash or similar, but it just does not have enough features for me to get there yet). This was causing my personal finances to suffer as I avoided rebooting as much as possible :). Although there are some small glitches, I'm able to download my online credit card statement, E*Trade account info, and my online bank info as well and import it all relatively easily. Certainly it's less labor intensive than rebooting into Windows...

Health
Ya know, sore backs suck. My dad has had problems with his back and mine has been acting up on occasion as well--it's been kind of sore ever since the marathon plane ride to Hawai'i. It seems to be finally getting better now though. You don't really realize how much you depend on your back muscles until they start complaining... my wife has some yoga tapes and I might start trying those to see if that helps.

29 Jul 2002 (updated 29 Jul 2002 at 04:46 UTC) »

Back from vacation: almost two weeks in Hawai'i: half in Kona and half outside of Hilo. It was good to be back after so many years (and last time I was in Honolulu). Most excellent. Most relaxing. Didn't even bring a laptop. My fingers aren't working so well yet--have to recondition them. Good diving. The volcano is active, so we got to walk right up to the flow (the week we were there the lava chewed up more of the road). Did the helicopter thing: my first time in a helicopter; it was very cool. Got to see the vent and also lava flowing into the ocean. Unbelievable. Then we went to visit Kalapana, where the most beautiful black sand beach on the island (along with most of the little village next to it) was completely covered in lava back around 1991 (IIRC). It was incredible to hike out hundreds of yards across black, broken lava and think that ten years ago we would have been standing right in the ocean! Drove down into the Waipi'o Valley and had a good day there; spent a day on 69 beach (and got burned :( ); visited the Pu'uahonau o Honaunau national park: I'm telling you it was kind of eerie--I've never really had such a strong feeling of history right there... and I live less than a mile from the Battle Road in Massachusetts. Spelunked lava tubes, saw the Akapa and Rainbow falls, did a lu'au (of course), saw lots of sea turtles (very cool), and generally had a blast. A very long plane ride, but definitely worth it.

GNU make
Got a pretest release out before I left. Seems mostly fine, although there were a few small problems. I need to follow up with some issues that were raised, though. Still, I don't think it will be too much longer before a new release.

Web Hosting
Well, my subdomain is finally working. My hosting service offers very cool features, and a very reasonable price (and all their servers run Linux! :)) but when there are glitches it can take a little bit of perserverence to get them ironed out. Oh well, whaddaya gonna do.

Internet Banking
I just love internet banking. I've been so happy ever since I said "screw you" to Fleet and went Internet. Although, this vacation I did have a minor glitch: it turns out that my bank won't transfer money automatically from my reserve credit line to cover ATM transactions (although they do for checks of course). Annoying, but not a huge deal. Now that I know about it :).

Free Software
Geez, you're gone for two weeks and it takes a full day just to catch up. Debian 3.0 released: need to apt-get dist-upgrade on a number of my systems. New versions of gettext and automake: need to update my packages for that (esp. gettext as I was using a prerelease version before). And, of course, who knows what's going on when I get to work tomorrow ...

People who crack free software sites suck. Yeah, yeah, we're so impressed that you can take advantage of friendly people who are trying to help you; that really shows your chops. Gee, you're so l33t! Assholes.

I've gotten quite a bit done on GNU make this week. I want to try to get a few more things done but if I don't, no big deal. I'm off on vacation starting Sunday morning and I will have a pretest release out before I go. Some neat new things in there.

Why is it always that the week before you go on vacation you have to work four times as hard? Annoying.

chipx86, I don't understand your issues with gettext. Why do you want backward-compatibility with 0.10.x? This seems not very useful, as that version was really broken. Some of the new features in gettext are really excellent: personally I've removed all gettext code from my source tree and now I use the external mode. This is really nice. The next version of gettext comes with a new tool, autopoint, which does what you want: it does not modify ChangeLog, etc. Unfortunately it's been stuck in beta limbo for a couple of months now; I've been using the pretest which has a small bug (easy to work around though).

4 older entries...

 

madscientist certified others as follows:

  • madscientist certified tromey as Master
  • madscientist certified abraham as Master
  • madscientist certified wichert as Master
  • madscientist certified jimb as Master
  • madscientist certified chip as Master

Others have certified madscientist as follows:

  • fxn certified madscientist as Journeyer
  • lerdsuwa certified madscientist as Master
  • dneighbors certified madscientist as Master
  • mulix certified madscientist as Master
  • sdodji certified madscientist as Master
  • rillian certified madscientist as Journeyer
  • abraham certified madscientist as Master
  • madhatter certified madscientist as Master
  • bataforexport certified madscientist as Master
  • sqlguru certified madscientist as Master
  • jnewbigin certified madscientist as Master

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page