Older blog entries for ingvar (starting at number 309)

I was reading titus's recent post about data structures for k-mer filtering.

ONe thing struct me. Titus, you looked at binary Bloom filters, but seemingly ignored Counting Bloom filters (where instead of indexing a bit with your hashes, you index a counter of a suitable size, you do run into problems if you need both increments and decrements, since once you it max on the counter, you can no longer decrement).

Of course, it's not necessarily trivial going from a Bloom filter (counting or not) to the key, whereas the key is typically available when using a normal hash table.

I suspect, but have not verified, that a counting Bloom combined with the parallelization opportunity you found (and that one is the big win, I think) would be an equally interesting approach.

This morning, I finished off and packaged up a library (CL, linux; other OSes with CL implementations and OS-provided random (or random-backed pseudorandom) devices can probably be grafted in, but would require volunteers to test) to extract random bits from /dev/urandom and use those essentially unadorned to generate random numbers.

There's a single exported function from the package, RANDOMNESS:RND and takes an integer argument (I guess you could try feeding it a float, but it wouldn't do you that much good). That function then proceeds to extract log2 N bits and uses rejection smapling to find an integer in the 0..(N-1) span. As the bits sucked out of the internal pool run out (each bit is used only the once), they're replenished from the OS-provided randomness source.

All very nice, with some interesting corner cases for the unwary and probably still bug-ridden. But, such is life.

Download from here.

Neat(ish) hack...

Sometimes, I find myself writing anonymous functions, to fill out keyword arguments for functions or adapting the argument order. So, I thought, how hard would it be to write a macro to write the code for me? Turns out, not very complex, at all. The formatting is not QUITE what I started out with, as the Advogato edit box is a bit on the short end, but, hey...


(defmacro _ (form)
 (flet ((is-arg (sym)
	  (ignore-errors
	   (and (char= (char (symbol-name sym) 0) #\_)
		(cons
		 (parse-integer (symbol-name sym)
				:start 1)
		 sym)))))
   (let ((syms (loop for arg in form
		  for temp = (is-arg arg)
		  if temp collect temp)))
    `(lambda
	 ,(mapcar #'cdr (sort syms #'< :key #'car))
       ,form))))

With this in hand, you can, for example, easily make an adapter to parse C-style hex constants:


  (_ (parse-integer _1 :radix 16 :start 2))

Not that the lambda-wrapping of this would've been much more complex and I am not entirely sure this wins as far as readability is concerned, but that is as it may be. It's if nothing else a neat macro that would be more than a little tricky to pull off with a less capable macro facility.

So, one of the users of my Image library asked if there wasn't any way of using a font other than the rather ugly, built-in one. I said, roughly, "you can define a new one, in an analogous fashion to how the built-in oine is done?"

Apparently, that was not the wished-for answer and I started thinking. Bit-mapped fonts are (relatively) easy to deal with. There's a helluva lot of bitmapped fonts for X11. So, I set forth to code up a reader for PCF fonts, converting the bitmaps to internal format and some multi- font support in Image. All done, now.

Common Lisp is still very pleasant when dealing with binary file I/O, even if I wished that there was an easy way of changing the binaryness of a file on the fly.

Currently tooling away at asked-for functionality for my IMAGE library (specifically, a request was made to see if I can scare up another font or three, so I am currently noodling on a PCF font-file reader, as you do).

Common Lisp continues to be amazingly convenient for "binary I/O", although it does require a certain mind-set to consider it convenient, I guess.

I've just packaged up a new release of my IMAGE library, with some new functionality in place. It now supports copying (parts of) an image into another image (with either the same alpha across the whole copied section or with a provided alpha map, so one can do Clever Stuff that way).

I suppose I should write documentation for the image library at some point, but...

Idleness is the source of many bugs. I suppose. It's certainly the cause for a lot of code I write.

Lately, I have been pondering hashing of octet vectors. Well, actually, I've been pondering the hashing of strings, but these days strings need mangling to octet vectors before one can reasonably reason about them as numbers (in my case, turning a vector of characters into a possibly- longer (well, possibly-more-elemented) vector of (UNSIGNED- BYTE 8), containing a UTF-8 encoding of the character string).

But, what's a hash without an idea how well it performs? So, with a pluggable (or at least semi-pluggable) hash algo, what I now do is run it across a dictionary (in my case it is /usr/dict/words, containing a British word list) and see how many collisions one gets.

The next step is, of course, to try to figurte out what is and isn't bad, as far as collissions go. From memory, the last attempt say something like 600 collisions out of just under 70k words. I'd have to run a simulation to say if it's well above, well below or around what I'd statistically expect.

Next after that will have to be concatenations of words, though that MAY take a bit longer to test-run.

Interesting (but rather pointless) self-replicating shell- script:


#! /bin/cat
Self-replicating!

Make sure it's executable, then you can replicate the whole script by running it... But, maybe, there's some usefulness to it? Doubt it, though.

IN slightly less whingeing news, since the latest (well, second) installment of the annual Snooper Report has been turned down, I'll let it loose on an unsuspecting public (and, of course, the URL is too wide for the Advogato editing window, so now I'll get bots come and look for URLs with %0A in them, grr).

In almost as exciting news, I cobbled up a proof-of-concept of a blackhole maintenance library (CLI client, web client, libarry code, DB backend and a reaper suitable to stick in cron, to remove blackholes once they expire).

One of the little web toys I knocked together and occasionally use for pontificating is my Essays site. Since its initial creation, I've added a comment functionality, but due to the prevalence of spam, everything requires manual verification.

Not, in and of itself, a major problem, but I've noticed something quite disturbing in the spam comments (most, alas). The latest is that they're pointing to the SourceForge forum (with URLs that start http:// cmr.sourceforge.net/forum/) and I am in two minds about that.

One mind says I shouldn't bother complaining, because they should already have noticed. The other mind says I should, indeed, point it out to them (with a slight risk that the forum is zapped; this happened last time I pointed the issue out).

Instead, I decided that thee Right and Proper thing to do is to have a semi-public whinge. Don't know that it makes me feel any betetr, but at least I am not feeling any worse.

300 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!