Data mining: key to surveillance – and modern science…
As master leaker Edward Snowden searches for asylum, revelations of widespread collection of telephone and email records by the National Security Agency have set teeth on edge.
The Obama Administration says the data bring security, yet voices on the left and right both have condemned the snooping as an invasion of privacy.
Data is not information, but rather the raw material for understanding. But one thing’s for sure: as the NSA spends billions perfecting new means of “mining” information from its mountains of data, it benefits from the plummeting price of computer storage and processing.
Data mining is a broad term for mechanisms, frequently called algorithms, that are usually enacted through software, that aim to extract information from huge sets of data.
Increases in the amount of data — and the ability to extract information from it — are also affecting the sciences, says David Krakauer, director of the Wisconsin Institute of Discovery. “A lot of science is now tracking Moore’s law, in the sense of the exponential increase in computer power, memory storage, and the exponential reduction in cost.”
A hard drive capable of holding a terabyte of data might have cost $1,000 around 2005, “but now you can put that on a thumb drive for less than $100,” says Krakauer, who studies the evolution of intelligence. The current talk about big data and data mining “is happening because we are in the middle of an earthquake; we feel it in a way we did not before,” Krakauer says.
As our lives leave more tracks through phones, credit cards, e-commerce, Internet and email, the growing commercial impact of big data shows when:
- you search for a flight to Tuscaloosa and then see websites plastered with promos for Tuscaloosa hotels
- you watch a movie that used computer graphics built on data measured in hundreds of thousands of gigabytes
- you shop at stores arranged to maximize profit based on data-mining of customer activity
- airlines change their prices unpredictably, based on algorithms that predict future demand for seats
- a smart-phone app identifies your location, so you receive offers from nearby restaurants
Is Big Data watching you?
Beyond security and commerce, big data and data mining are also surging in science. As more instruments with finer sensors return ever-more overwhelming data streams, more analytical horsepower is needed. In fields like meteorology, petroleum exploration and astronomy, gushers of data support — even demand — a new level of analysis and insight.
One milestone in the emergence of big data in medicine was 2003, when the first human genome was completed. Since then, the breakthrough genome has been augmented by thousands of others for individuals, primates, mice and bacteria. With billions of “letters” per genome, the threat of computational confusion helped spawn the new field of bioinformatics, which harnesses software, hardware and sophisticated algorithms to support new types of science.
Another example of bioinformatics comes from the National Cancer Institute, where Susan Holbeck tested 5,000 pairs of FDA-approved cancer drugs against 60 cell lines. After 300,000 experiments, Holbeck says. “We know the level of RNA expression in every gene in each of the cell lines. We have sequence data, protein data, and data on micro RNA expression. We can take all of that, do data mining and see why one cell line would respond well to combinations while another cell line would not. We can take a pair of observations and turn it into a rational, targeted drug that we can test in the clinic.”
Truthy or consequences
As medical scientists try to cope with cancer, bacteria and viruses, political chatter has “gone viral” on the Internet. The Twittersphere has surpassed half a billion tweets per day, and its political clout is surging, confronting clean-government groups with a phenomenal data-mining challenge.
The goal of the Truthy project at Indiana University is to unearth insights from this daily deluge, says post-doctoral researcher Emilio Ferrara. “Truthy is a tool to allow researchers to study information diffusion in Twitter. By identifying keywords and tracking the activity of users online, we study the discussion that is ongoing.”
Truthy was developed by Indiana researchers Fil Menczer and Alessandro Flammini. Each day, the project’s computers screen upwards of 50 million tweets for patterns.
One key interest is “astroturf,” Ferrara says: orchestrated persuasion campaigns that supposedly come from the grass-roots but are actually issued by “individuals and organizations that have an interest in spreading information that is not correct.”
During the 2012 election, a series of tweets claimed that Republican presidential candidate Mitt Romney had gained a suspiciously large number of Facebook followers. “People investigating found that it was not caused by Republicans or Democrats,” Ferrara says. “Someone else was behind it. It was an orchestrated campaign to defame Romney, to make people believe he was buying followers.”
Astroturf campaigns often carry hallmarks, Ferrara says. “If you want to run a massive defaming campaign, you need a lot of Twitter accounts,” including robot-run fake accounts that tweet and retweet the chosen messages. “We are able to identify these automatic activities by analyzing the features of the tweets.”
As the number of Tweets doubles year by year, can anything ensure transparency in e-politics? “The goal of our project is to allow technology to grasp a little of this information,” Ferrara says. “It is not possible to find everything, but even if we are able to find a little bit, that is better than nothing.”
Big data in the mind’s eye
The human brain is the ultimate calculating machine, and the ultimate big-data predicament, with an uncountable number of possible connections between individual neurons. The Human Connectome Project is an ambitious effort to map interactions among the different brain regions.
genome: an organism’s entire genetic information, encoded in DNA or, for some viruses, RNA
transcriptome: the complete set of RNA “readings” produced from an organism’s DNA
proteome: all proteins that can be expressed by an organism’s genes
metabolome: all small molecules, including intermediates and final products, of metabolism in an organism
The goal of the connectome “is to collect advanced neuroimaging data, along with cognitive, behavioral and demographic data on 1,200 individuals” who are neurologically healthy, says Daniel Marcus, head of informatics at the Connectome’s facility at Washington University in St. Louis.
The project is using three types of magnetic resonance imaging to view the structure, function and connections in the brain. When data collection finishes two years from now, Marcus expects connectome researchers to be slogging through about one million gigabytes of data.
One key task is “parcelization,” generating maps of brain regions, which were originally identified two or three centuries ago, based on staining a small number of brains. “We will have data on 1,200 individuals,” Marcus says, “so we can look at how this varies among individuals, and look at how they are connected.”
To identify links between brain regions, Marcus says, “We look at how spontaneous activity in the brain correlates between regions” in scans taken while subjects are resting. For example, if region A and B are spontaneously creating brain waves at 18 cycles per second, “this implies those are networked,” Marcus says. “We will use those correlations across the whole brain to create a matrix that shows how every point in the brain is correlated with every other point.” (These points are considerably larger than cells, which MRIs cannot “see.”)
Galaxy zoo: Crowd-sourcing to the heavens!
The Galaxy Zoo project breaks the rule for big data: Instead of putting data through a massive, computerized data-mining, it feeds images to motivated volunteers who do basic classifications of galaxies. The Zoo, launched in 2007, traces to Oxford, England, where astronomer Kevin Schawinski had just finished staring at 50,000 images from the Sloan Digital Sky Survey.
According to William Keel, a professor of astronomy at the University of Alabama and a member of the Zoo science team, Schawinski’s advisor suggested he complete the set of 950,000 images. “His eyes were falling out of his head, and so he headed to a pub where he encountered Chris Lintott, and in classic fashion, they sketched the web structure of Galaxy Zoo on the back of napkin.”
Galaxies are a classic big-data problem: a state-of-the-art telescope scanning the entire sky would likely see 200 billion of these star worlds. However, “There is a constellation of issues related to cosmology and galaxy demographics that could be addressed by having a lot of people do a fairly simple sort of classification,” says Keel. “Classifications that are trivial after five-minute tutorial to this day are not really amenable to algorithms.”
Galaxy Zoo’s startup was so successful that user traffic physically damaged a server, Keel says.
After all 950,000 images in the Sloan survey were seen an average of 60 times apiece, the Zookeepers have moved on to larger surveys. Science is being served, Keel says. “I have gotten a lot of mileage out of oddball things that people have found,” including backlit galaxies.
Galaxy Zoo relies on statistics, multiple viewers and logic to process and check data. If the proportion of viewers who think that a certain galaxy is elliptical remains fixed as more people see it, the galaxy would be retired from viewing.
However, for rarer objects, Keel says, “You may need 40 or 50 viewers.”
Citizen science is developing its own principles, Keel adds. The volunteer’s work “has to contribute to a real, important research problem, in a way that can’t be done by any existing software. Clicks should not be wasted.”
The Zoo approach is being copied and refined by zooniverse.org, a parent organization that runs about 20 projects on, for example, tropical cyclones, the surface of Mars and climate data stored in ship logbooks.
Eventually, software may nudge out the volunteers, Keel says, but the line between computer and human is fungible. The Supernova Zoo, for example, was shut down after software learned the task.
We were surprised to learn that the huge data sets being amassed by volunteers are ideal for teaching classification to computers. “Some Galaxy Zoo users really hate that,” Keel says. “They loudly resent their clicks being used to train software. But we say, don’t waste the click. If someone walks in with a new algorithm that works as well, people won’t have to do that.”
Yearning for learning
More training has also benefited the long efforts to improve pattern recognition in images and speech, says Krakauer of UW-Madison. “It doesn’t just get better, it just starts to work. Five or 10 years ago, the idea of Siri on the iPhone was unthinkable; speech recognition was terrible. Now we have this vast number of data sets that trained the algorithm, and all of sudden they work.”
The utility of a giant dataset may go through a “phase transition,” Krakauer adds, after a relatively small change in processing capacity leads to a breakthrough in results.
“Big data” is a relative rather than absolute term, Kraukauer points out. “Big data can be seen as a ratio, the amount you can compute to the amount of data you have to compute on. There has always been big data. If you think about Tycho Brahe [Danish astronomer: 1546 to 1601] who collected data on the position of the planets, we did not have Keppler’s theory [explaining the motions of the planets], so the ratio was skewed. That was the big data of that age.”
Big data becomes an issue “When we have the technology that allows us to collect and store data that has outpaced our ability to reason about the system under scrutiny,” Krakauer says.
We wondered whether, as software continues to formulate decisions in science, commerce and security based on complex calculations with unimaginably large databases, we are turning too much power over to the machines. Behind our backs, automatic decisions are being made without any human understanding of the relationship between input and output, between data and decision. “This is what I work on,” Krakauer responded. “My research is on the evolution of intelligence in the universe, from the Big Bang to the brain. I have no doubt that what you said is true.”
– David J. Tenenbaum
Terry Devitt, editor; S.V. Medaris, designer/illustrator; Yilang Peng, project assistant; David J. Tenenbaum, feature writer; Amy Toburen, content development executive
- Big data is a big deal ↩
- How Target learns you are pregnant before other people do ↩
- Facebook ‘Like’ predicts your personality. ↩
- Try this test now ↩
- Quantify how happy the world is ↩
- Twitter can predict the stock market, but you have to read the right tweets ↩
- Text mining shows why Herman Melville is unique ↩
- Promise and limitations of big data ↩
- Store more. Much more! ↩
- Big study. Big problems? ↩
- Ancient hole, black hole ↩
- Greenhouse gas maps ↩