Advogato: Blog for ssp

4 Jul 2011 (updated 4 Sep 2011 at 02:06 UTC) »

Fast Multiplication of Normalized 16 bit Numbers with SSE2

If you are compositing pixels with 16 bits per component, you often need this computation:

uint16_t a, b, r;

r = (a * b + 0x7fff) / 65535;

There is a well-known way to do this quickly without a division:

uint32_t t;

t = a * b + 0x8000;

r = (t + (t >> 16)) >> 16;

Since we are compositing pixels we want to do this with SSE2 instructions, but because the code above uses 32 bit arithmetic, we can only do four operations at a time, even though SSE registers have room for eight 16 bit values. Here is a direct translation into SSE2:

a = punpcklwd (a, 0);
b = punpcklwd (b, 0);
a = pmulld (a, b);
a = paddd (a, 0x8000);
b = psrld (a, 16);
a = paddd (a, b);
a = psrld (a, 16);
a = packusdw (a, 0);

But there is another way that better matches SSE2:

uint16_t lo, hi, t, r;

hi = (a * b) >> 16;
lo = (a * b) & 0xffff;

t = lo >> 15;
hi += t;
t = hi ^ 0x7fff;

if ((int16_t)lo > (int16_t)t)
    lo = 0xffff;
else
    lo = 0x0000;

r = hi - lo;

This version is better because it avoids the unpacking to 32 bits. Here is the translation into SSE2:

t = pmulhuw (a, b);
a = pmullw (a, b);
b = psrlw (a, 15);
t = paddw (t, b);
b = pxor (t, 0x7fff);
a = pcmpgtw (a, b);
a = psubw (t, a);

This is not only shorter, it also makes use of the full width of the SSE registers, computing eight results at a time.

Unfortunately SSE2 doesn’t have 8 bit variants of pmulhuw, pmullw, and psrlw, so we can’t use this trick for the more common case where pixels have 8 bits per component.

Exercise: Why does the second version work?

Syndicated 2011-07-03 12:11:20 (Updated 2011-09-04 01:42:00) from Søren Sandmann Pedersen

30 Jun 2011 (updated 30 Jun 2011 at 11:39 UTC) »

6 Aug 2008 (updated 4 Jul 2011 at 18:28 UTC) »

17 Jun 2008 (updated 4 Jul 2011 at 18:29 UTC) »

10 Oct 2007 (updated 4 Jul 2011 at 18:29 UTC) »

9 Sep 2007 (updated 4 Jul 2011 at 18:29 UTC) »

10 Dec 2006 (updated 4 Jul 2011 at 18:29 UTC) »

27 Jul 2006 (updated 4 Jul 2011 at 18:31 UTC) »

25 Feb 2006 (updated 4 Jul 2011 at 18:30 UTC) »

7 Dec 2005 (updated 4 Jul 2011 at 18:31 UTC) »

Older blog entries for ssp (starting at number 13)