From the company that brought you the C programming language comes Hancock, a C variant developed by AT&T researchers to mine gigabytes of the company's telephone and internet records for surveillance purposes.
An AT&T research paper published in 2001 and unearthed today by Andrew Appel at Freedom to Tinker shows how the phone company uses Hancock-coded software to crunch through tens of millions of long distance phone records a night to draw up what AT&T calls "communities of interest" -- i.e., calling circles that show who is talking to whom.
The system was built in the late 1990s to develop marketing leads, and as a security tool to see if new customers called the same numbers as previously cut-off fraudsters -- something the paper refers to as "guilt by association."
But it's of interest to THREAT LEVEL because of recent revelations that the FBI has been requesting "communities of interest" records from phone companies under the USA PATRIOT Act without a warrant. Where the bureau got the idea that phone companies collect such data has, until now, been a mystery.
According to a letter from Verizon to a congressional committee earlier this month, the FBI has been asking Verizon for "community of interest" records on some of its customers out to two generations -- i.e., not just the people that communicated with an FBI target, but also those who talked to people who talked to an FBI target. Verizon, though, doesn't create those records and couldn't comply. Now it appears that AT&T invented the concept and the technology. It even owns a patent on some of its data mining methods, issued to two of Hancock's creators in 2002.
Programs written in Hancock work by analyzing data as it flows into a data warehouse. That differentiates the language from traditional data-mining applications which tend to look for patterns in static databases. A 2004 paper published in ACM Transactions on Programming Languages and Systems shows how Hancock code can sift calling card records, long distance calls, IP addresses and internet traffic dumps, and even track the physical movements of mobile phone customers as their signal moves from cell site to cell site.
With Hancock, "analysts could store sufficiently precise information to enable new applications previously thought to be infeasible," the program authors wrote. AT&T uses Hancock code to sift 9 GB of telephone traffic data a night, according to the paper.
The good news for budding data miners is that Hancock's source code and binaries (now up to version 2.0) are available free to noncommercial users from an AT&T Research website.
The instruction manual (.pdf) is also free, and old-timers will appreciate its spare Kernighan & Ritchie style. The manual even includes a few sample programs in the style of K&R's Hello World, but coded specifically to handle data collected by AT&T's phone and internet switches. This one reads in a dump of internet headers, computes what IP addresses were visited, makes a record and prints them out, in less than 40 lines of code.
Another sample program included in the manual shows how a Hancock program could create historical maps of a person's travels by recording nightly what cell phone towers a person's phone had used or pinged throughout a day.
AT&T is currently defending itself in federal court from allegations that it installed, on behalf of the NSA, secret internet spying rooms in its domestic internet switching facilities. AT&T and Verizon are also accused of giving the NSA access to billions of Americans' phone records, in order to data-mine them to spot suspected terrorists, and presumably to identify targets for warrantless wiretapping.