Poisoned cookies
Posted 6 Sep 2000 at 20:27 UTC by mskala 
Marketers invade our privacy to compile databases of our interests and buying habits, and waste our bandwidth with advertising.
The technologies they use get more baroque daily - from cookies to Web bugs to :CueCat and the Amazon price-changing scheme
that made Slashdot today. The usual solution is passive: disable all the "phone home" features on the software we use, filter out the
spam, and complain a lot. But what about a more active approach? With cable modems and DSL we've got lots of receive
bandwidth to play with. Why not write a fake client to simulate the behaviour of a real human surfer, while leaking false marketing data to
anyone who snoops on it? Could such a program poison the databases and make privacy invasion an unprofitable exercise?
I'm not the first to suggest this, of course, but I haven't seen anyone actually build it yet. Perhaps the time has come. For a long
time, a lot of us have been entering fake data in marketing surveys and so forth - but when it's a small-scale individual effort, the resulting
data points will just be filtered as noise. To really make a difference, I think the fake data has to be a big enough fraction of total traffic to
be caught by data mining (perhaps five or six hits on a site per day) and consistent enough to look like a trend. Can we create one or
more fake "market segments" and convice the advertisers to target them?
The system I envision would allow one to write a "persona template" describing a fake person. You install the client on your system
with one or more persona templates, and it generates some imaginary people based on those templates. Each imaginary person (or
"persona") would have his or her own complete set of personal data and Web surfing habits. Then the software would assume the roles
of
the personae and go surfing - downloading Web pages, accepting cookies, making search engine queries, etc. The marketers would
assemble their profiles of the personae and never know which profiles were human beings and which were robots.
We could make it fun to design imaginative templates, and
there'd be an incentive to distribute your template because the more people using it, the more likely the marketers would target it. The
system would become a sort of a sport: you win if you can convince the marketers to do something stupid. Such an unequal contest
might seem cruel to the marketers (the same kinds of objection raised to activities like "sport fishing"), but I don't have much sympathy
with marketers and don't object to breaking their "business models" for my own amusement. Apart from the use of making trouble for
marketers, this kind of system could be useful in testing intranet servers. If you want a realistic load on your server, it could be a big
step up from the usual "download a page a thousand times" script you might hack up in ten minutes.
Legal consequences would be worth thinking about. There should not be anything illegal about downloading Web
pages
- it's not a denial of service attack, the activities involved are by definition intended to be indistinguishable from normal use of the Web -
but once the marketers figured out what was going on, they might well try to sue somebody. I suggest that any such system ought to
be
explicitly GPLed and the copyrights assigned to the FSF. Public domain anonymous publication might be less hair-raising, but could be
seen as an admission by the authors that there was something questionable about the system. A lot of weird issues could conceivably
be raised - for instance, if the program claims falsely to be IE or Netscape when it downloads the pages, then the makers of those
packages could allege some sort of trademark infringement.
The ethical side also merits discussion - is it okay to pollute marketing databases? My view is an emphatic yes, but it would be
interesting to hear the other side. A user could easily subvert their own copy to abuse a "you're paid for hits on this page" scheme.
They'd be hard-pressed to get others to help them with that (if I'm gonna abuse such a system I'll do it for my own profit, not yours) but it
could still be a problem. I see no way to prevent it and am not sure it needs to be prevented.
So... if we had enough participants, could this work? If it could work given enough participants, could we get that many
participants? How much trouble could we get into? Who would volunteer to help write it?
You may want to read a message I wrote to the local Linux
club's
mailing list about this concept and of course Neal Stephenson's Hack the Spew
is
mandatory.
Income, posted 6 Sep 2000 at 21:06 UTC by deekayen »
(Master)
I think you're leaving one important thing out of the picture. Advertisers are willing to pay as much as they do because ad companies are
getting better at assuring them that their ads will make it to the correct audience. If burstnet or imigs can't assure people that their
banners will hit the correct audience, then sites that have ads for funding are hurt. Sure a lot of us put up sites and we could probably run
the bandwidth off a 56K modem (although we don't), but what if Altavista shut down because advertisers wouldn't pay them enough to
keep their equipment going.
Don't get me wrong, I'm not working for an ad company or something, but I think ads in most cases are a nessesary evil. I don't think
any webmaster puts them on their site because he thinks their fun to look at. If you hate information collection so bad, I think it might just
be easier to put up a cookie crusher program that singles out on ad cookies... and I'm sure there are plenty out there somewhere.
It sounds like a lovely idea. One area of marketing that it cannot
subvert is the tracking of actual purchases -- I don't imagine we'd let
these fake personas actually buy anything. Now that data, who is
willing to spend how much for what, is perhaps the most valuable of all.
ultimately, posted 7 Sep 2000 at 05:06 UTC by darkewolf »
(Journeyer)
Ultimately its not that they are doing these demographics that is the
problem. The
problem is that they are more often than not doing demographics and data
profiling but not telling the customer. I am going to lay odd many of
these
ad companies keep a fairly good track of who looks at what over multiple
sites,
using a mixture of 'webbugs' and IP address tracking (Yes, proxies stuff
this
up a fair bit, but, there is the Originating-IP header).
If someone is tracked, thats okay. Iff they are informed about this
and given
the option not to be tracked. As mentioned this also occurs in software
(even
on GNU/Linux), for instace MTV (commercial Mpeg player) uses
/usr/bin/mail to email useage to the central office, whether
you
register the software or warez it.
Poisoning the demographics might actually be an advantage, it means
rather than the product decisions being made on the norm or the
mean, the marketers might actually ask a cross section of
people
and get a range of options.
Go fer it, its your privacy, your life and ultimately the individual
and the community
has more power than the corporate.
mskala: the Idea of data poisoning is not new,
if I got that right it all got big in the late eighties when people would
trade their grocery savings cards amongst each other to confuse
the living crap out of the - back then - barely networked databases
Safeways and co. had up to track customer behavior.
Nowadays the web is data mining source number one (but just
barely outnumbering the Safeway Savings Card and Mail-In
rebates) and it's not all Cookies we're dealing with, here. Anyone
with a bit of a brain and a powerful database (Oracle will do for
now) is able to serve banners over his own server and analyze
IP-Adresses and Clicktroughs based on a simple clf-Logfile.
While cookies play a bigger and bigger role in this game, it's still
the good old methods that are used to cross-verify cookie
generated impressions. By poisoning cookies you'll introduce
worthless data into the databases, inevitably generating glitches
towards a false impression but todays mining tools are smart
enough not to let these impressions screw up the whole analysis.
Let's dive one step deeper into this scenario and take a look at,
say freshmeat. On a first glimpse, Andover's "take over" seems
like the friendly gesture from your neighbor offering you money and
a save place to live because he likes what you do. Look at it from a
different angle and all of a sudden there's Linux vendors who'd kill
to get Freshmeats data in a processed form: which appliocations
are most wanted, which ones are uploaded when, etc. - suddenly
one realizes that the ability to datamine has made sites the size of
Freshmeat or Slashdot possible, not just pure friendlyness (there
is no friendlyness in Pre-IPO-Land).
Back to the topic. To poison data you need to do more than just
fake surfing behavior, like shopping irregularly and complete anti
pattern or simply giving wrong answers on surveys. Eben 200 bots
cannot introduce any significant poisoning into the databases
considering the amount of "dumb" people out there that will give
them right data.
I, personally, handle this issue like Safeways Savings Card - I do
not contribute. A nice little Squid does its share of the work in
keeping me out of the records and basically I don't mind others
conbtributing or even making money of these mined data.
Cookies are not all bad, I certainly don't want to expose the
legitimate users of them (think advogato, Sourceforge, THATware
or even foundation) to some bot running amok.
I'm not sure that database poisoning is either useful or necessary. I take steps to guard my own privacy on-line, but I don't
necessarily see
data-mining activities of this sort as being inimical to my interests.
There are two main costs to the consumer of this kind of activity:
- Invasion of privacy. This is hard to quantify, but there is no doubt that people dislike the idea of having large chunks of their "private"
lives available to the inspection of unknown and unaccountable individuals.
- Spam handling. Receiving an advert has a cost in bandwidth (AtGuard currently tells me it has blocked some 15Mb of banner adverts
since my last reboot) and in mindwidth. I need to decide what to do with it, even if only to delete it or otherwise ignore it. Also many
banner adverts are animated in order to grab attention. This is seriously horrible (and AtGuard nicely rewrites them to run once, BTW).
(Aside: I gather AtGuard is now integrated in Norton Internet Tools 2000 or some such name).
Advertisers are in a position to impose these costs on us without much consideration as to whether we want to pay them. What we need
is a way to return these costs to sender in order to punish the baddies and reward the goodies. Blanket punishment of everyone doesn't
improve behaviour because it just becomes a part of the cost of doing business.
Bear in mind that the point of such activities is to offer me goods that I want to buy. An advert which merely annoys me has done the
advertiser no good at all. The problem at the moment is that this process is so hit and miss that the cost of receiving and comprehending
an advert outweighs the probable benefit of finding something I want. If the accuracy of advertising can rise to the point that I find the
adverts useful and informative then it becomes a net benefit to me instead of a net cost. If, for example, Amazon deduce from my
purchases that I am a Terry Pratchett fan then they can email me with news of the next TP book, and I would be interested to see that.
OTOH if they sell the list to someone selling TP merchandise then I won't be interested because I don't generally buy fan merchandise.
Paul.
And concerning surveys, to subvert them is possible with some
concerted effort. Keep in mind, when someone puts a form on
their page asking a series of questions about background and
salary and whatnot, I'm sure they are hoping to find that
the more affluent folks are using their site. They'd much
rather be able to sell ad space to big money car makers than to
phone psychics.
Randomly filling in the forms is not enough, such "noise" will just
disappear in the shuffle. We should all fill in the forms as if we
were underemployed, young, ethnic, welfare moms. Now, if
enough people did that, we'd actually put the aggregate in a
downward direction, at least with regard to perceived wealth.
This would act, I think, to subvert the goals of the survey more
than anything else.
Creating "fake" clickthrough and adview data by sending web-user bots
stumbling all over your least favorite online businesses does not seem
beneficial to the privacy of the individual who's dispatched the bots.
Say a site's marketing data has been horribly munged by a legion of
well-meaning anti-demographics generators. Now everyone visiting the
site gets totally random, useless, annoying ads, the advertisers drop
their business with the site because they are getting no clickthroughs,
and traffic to the site dries up as people are driven away by pointless
ads and the site's increasingly bland services, decaying from lack of ad
revenue. Even if you block ads already, you still get pissed when the
site runs down, and stop using it (why do you care about a site's
marketing data if you don't use the site?).
Legitimate sites, used by many people (even people who willingly submit
to a site's privacy policy and have read it in full) use banner ads and
cookies to provide real, convenient services and (occasionally)
well-targeted, useful
advertising. It's dangerous to assume that what you reasonably consider
to be an invasion of privacy must obviously be equally onerous to the
other users of a site. Corrupting the marketing data of a company, no
matter how much you dislike their policies or site, hurts the collective
user population.
There is plenty of software out there to block ads and squish cookies,
and ISPs have massively improved their ability to keep spam from hitting
your inbox. Anyone who cares enough about privacy to want to implement
punitive measures against marketroids.com (sorry if this is an actual
site :o) has probably already switched to a cleaner ISP and filters
everything. So what's the point of making the spam and ads you barely
receive now any more randomly-targeted and useless? There are countless
more effective ways of improving your privacy than engaging in a
quixotic battle with terabytes of data-warehoused marketing information.
Notably, it surprises me that grocery stores feel the need to use a card to get an "ID token" about the gentle purchaser.
After all, many of the purchasers use credit cards or ATM cards, which contain, on their little
magnetic strips, the individual's name.
Given the name, and/or other ID info on the credit card, they should be able to do all the correlations that they
may feel the need to do without collecting any information beyond what they truly need to have in order to
complete the transaction. For those that buy using cheques, there's even more personal information
that the grocery store must collect to give out a "cheque cashing card."
The point here is that there are some pretty potent ways that significant demographic and psychometric data forcibly
gets through when it comes to real transactions, and since those transactions led to people deciding
to SPEND MONEY , they are the activities involving psychometric data that you can really bank on.
You can create a web bot that goes off and throws garbage at the "demographers," but all this really does is
to demonstrate that transactions that don't lead to spending money are vastly less "honest" than those that do.
That being said, I think that it's a neat idea to create such a web bot, and possibly of some small utility. If the "bot"
were sufficiently simple to install, configure, and use, some of the "demographic sector" that were willing to contribute
CPU to things like the distributed.net contest would be willing to contribute
a bit of network bandwidth to the NukeDemographics.net Project. And while this wouldn't do much to
influence the psychometric analysis at the point of sale, it could indeed go somewhere in
discouraging companies from working real hard to collect the largely worthless psychometric
statistics that come from "just browsing the web."
Some responses, posted 7 Sep 2000 at 18:49 UTC by mskala »
(Journeyer)
deekayen says: if we put the advertisers out of business, then people who depend on advertising revenue
will be harmed. True, but
isn't that the point? Yes, there are Web sites that I like that currently depend upon advertising revenue. But I hate
marketing. I'm willing to risk endangering the Web sites I like that do have ads, if it could mean eliminating ads. I believe the sites that
deserve to survive would be able to survive on other income sources if advertising weren't available. I also think you're overestimating the
possible success rate of my plan - which is flattering, but not my own view. I don't think this could really, by itself, shut down the entire
concept of Internet marketing. I'd be very glad if it could, but the expected result is just to make it a little trickier.
stimuli points out that I'm unlikely to be willing to let my robots buy things I don't want, just to mess with
the
seller's minds. True, but
there are plenty of things I can do that don't cost. The commercial interests have very carefully structured the Web to make it easy and
cheap for me to view their advertising. Thus, I can view a lot of advertising without incurring significant costs to myself.
jLoki appears to be taking my article title too literally. I don't mean that HTTP cookies should
be the
only or even the primary target, I'm suggesting that the robots should do everything they can to leak their fake personal information.
Poisoned cookies is a nice catchy name, but I agree that Web page downloading, form filling, and similar activities are probably more
useful as specific tactics. As far as the scale necessary to have an effect, I don't know enough about data mining to give specific
quantitative answers on that. I'm not sure even the commercial data miners know - I believe that just like in all other fields of comp sci,
commercial practice is at least 20 years behind what the academics are doing. It calls for research. But it's not necessary to truly
render
the entire
database useless just to have a good effect. To carry the poison analogy further, maybe we can't really poison the database but we can
make it taste bad. 200 bots will have an effect on the marketers' behaviour if they really look like a new market
segment.
PaulJohnson and slothrop question whether marketing data collection is really so
bad after all. I submit that yes, it is really that bad. Saying "The sites I like collect my data to serve me better, and I don't care what
data
the sites I don't visit might collect" is a dangerously short-sighted view because it assumes that the data is only ever going to be used by
the site that collects it, and only ever used in the way it was meant to be used at the time it was collected. Data is a superfluid: it flows
with zero viscosity and gets everywhere. It also lies dormant on backup tapes and then surfaces unexpectedly at any time in the
future. The world changes. The site that collected your data for an innocent reason today may go bankrupt on account of idiots like me
spoiling the game for everyone, and then it may sell its database to someone else. The site you visit might somehow become
associated
with illegal activities and its list fall into the hands of the police. I expand on "reasons you don't want your surfing behaviour recorded" at
considerable length in a recent post to the VLUG mailing list, which
you might want to read.
On the topic of "what is the payoff for consumers to interfere with marketing databases"? I can only speak for myself, but: when I
first
started using the Net however many years ago, it was really cool. Now it's a lot less cool and a lot more scary. I believe the biggest
single cause for the change in the Net is the ease with which commercial interests can use the Net to make money. If I can make it just
a little bit more difficult for commercial interests to make money on the Net, that's a positive result for me. I think some other users of
the
Net agree with my feelings. If you aren't one of them, feel free to go write robot-detection software for the data mining companies. I'm
sure they'll pay you a lot of money for it. There is also the privacy motivation, of course, but as jLoki and
others point out, individuals who just care about their own privacy can defend it adequately with passive measures.
Poisoning databases sounds like a very good idea to me, but I don't really think that it would work merely for the scale of things. There
are millions of web surfers all over the world, surfing every day. If we really wanted to make the data in those databases useless,
everybody that reads slashdot on a daily basis would probably have to run 100 or more of those "poison" clients. Companies expect
when collecting huge amounts of data to have a certain percentage wrong, misleading, incomplete, or otherwise no good. You can make
it unprofitable for them if that percentage gets high enough, but they have a VERY high tolerance. I have worked for companies that do
direct mailings, and it is considered a VERY good run if you get a 3% response rate. Similary with telemarketing, the percentage of
people who weren't interested in your product would have to be well over 90% to make telemarketing unprofitable.
So basically, in order to poison those databases to the extent necessary to make it unprofitable, you would need to use so many poison
clients and so many connections and so much bandwidth that you'd probably just bring the internet to a grinding halt. (If you're targeting
one specific corporation though, this probably isn't true. if you're just looking to screw doubleclick, that might be achievable if you limit
your goal to something very specific).
Also, how long do you think it would be before the marketers built in code to look for patterns common among poisoning software and
toss out that data? Which leads to changes in the poisoning software, which leads to changes in their database apps...it's basically a
battle of numbers of people working on a project, and I'm not convinced that 1,000 people who hate marketers working 1 hour a week to
foil them could even come close to 10 guys in the marketer's IT dept working full time to keep the data clean.
The laws of the land and the technologies that were using just have problems. What I do is vote with my dollars. When it really comes
down to it, that's the only thing capitalists understand. If you're serious about this, you can boycott companies associated with marketers
who violate privacy.
ddccss, the Distributed
Doubleclick Cookie Snarfing System, has collected 1,211,222 unique
doubleclick.net cookies. The're available for
download.
Mskala writes, "I believe the sites that deserve to survive would be
able to survive on other income sources if advertising weren't
available."
$200+ a month for a colo box to provide anything like adequate bandwidh
for a reasonably popular site.
For a site that's not actually selling anything directly, what income
sources were you thinking of?