Sunday, September 13, 2009

The Database of Intentions

On the surface our search histories do not appear to be particularly sensitive information and they are unlikely to reveal our identity. However, recent history demonstrates the flaws in this logic and show that our search histories can easily reveal our identity. In August 2006 AOL released 20 million "anonymized" search queries from approximately 650,000 users to the research community.

AOL anonymized these search histories by obfuscating or removing 'personally identifiable information' such as usernames and IP addresses. AOL replaced usernames with randomized unique identifiers.

Reporters from the New York Times analyzed these search histories and were able to identify user #4417749. The reporters noted that user #4417749 searched for
  • landscapers in Lilburn, Ga
  • 60 single men
  • homes sold in shadow lake subdivision gwinnett county georgia
  • several people with the last name Arnold
A quick reference of other outside sources lead the Times reporters to Thelma Arnold, a now 65-year widow living in Lilburn, Georgia.

This example reveals the false promise of anonymization. In particular, it is difficult for a database administrator to anonymize one data source when the administrator does not know what other data sources an investigator is able to access. In this case, the Times reporters were able to use the phone book or perhaps property records from Lilburn, Georgia. The combination of these data sources allowed the reporters to identify Thelma Arnold.


Vincent said...

I realize that is post is simply "food for thought", but I think it might be stretching the possibiiles of what hackers can possibly do with search history information.

Based on the 4 searches that Thelma Arnold had in her history, its difficult how anyone could narrow it down to her. I think the major concern should lie with "A quick reference of other outside sources" that people should be worried about. It was that information that I was able to link Thelma to the search history.

Regardless, in the rare case that someone found out who Thelma was through her search history, what can a hacker do knowing that Thelma lives in GA, may be looking into real estate, and needs a new hubbie? ... I don't think much.

Rachel said...

Vincent, I think your point is valid; however, I don't think it touches on the heart of the situation. The possibility for our searches to be aggregated and connected to our identities, changes the whole notion of privacy in the technology era. As mentioned in class, if such an activity were to become popular, peoples' behavior may change (ex. ceasing to search more private matters online).

Jed said...

Ultimately, people are still going to use Google and other search engines, despite their poor privacy policies, because people are ridiculously lazy- Google didn't become popular because it was better than what was out there- it just happened to have a nice white layout that anyone could use.

Not many people understand the ramifications of their search histories because, as we mentioned, the ways they'll be put to use may not have even been invented yet.

And even so- we, the students of Intro to Information Privacy, DO understand the extent to which our data is stored, and it's not like we're going to stop using search engines (or posting comments on Blogger).

At least losing our privacy produces one useful service- humor. I can't wait for the day the presidential candidates of our generation will have to explain their adolescent search histories.

"I did not Google 'having sexual relations with that Klingon woman...'"

Anonymous said...

Jed, you have a very good point. I know that I use google for just about everything - from accurate spelling, to directions from my college to God knows where. without google, I would be lost. It is the sad truth.
I wonder if it would be a good tactic to begin searching for things to throw people off. Like searching for directions from some random house in Mississippi more often than the house where you actually live...? It sounds like too much work, but I honestly can't think of a better solution to maintain some form of privacy.

Or said presidential candidate could always say:"I did not Google 'having sexual relations with that Klingon woman...' It was my evil twin using my computer" ;)