Mining data

Print Friendly
Data dance: Big data and data mining

Data mining: key to surveillance – and modern science…

Video: NASA
Big ocean = big data: about a dozen observing systems supplied the data needed to make this never-before visualization of ocean currents.

As master leaker Edward Snowden searches for asylum, revelations of widespread collection of telephone and email records by the National Security Agency have set teeth on edge.

The Obama Administration says the data bring security, yet voices on the left and right both have condemned the snooping as an invasion of privacy.

Data is not information, but rather the raw material for understanding. But one thing’s for sure: as the NSA spends billions perfecting new means of “mining” information from its mountains of data, it benefits from the plummeting price of computer storage and processing.

Data mining is a broad term for mechanisms, frequently called algorithms, that are usually enacted through software, that aim to extract information from huge sets of data.

Increases in the amount of data — and the ability to extract information from it — are also affecting the sciences, says David Krakauer, director of the Wisconsin Institute of Discovery. “A lot of science is now tracking Moore’s law, in the sense of the exponential increase in computer power, memory storage, and the exponential reduction in cost.”

A hard drive capable of holding a terabyte of data might have cost $1,000 around 2005, “but now you can put that on a thumb drive for less than $100,” says Krakauer, who studies the evolution of intelligence. The current talk about big data and data mining “is happening because we are in the middle of an earthquake; we feel it in a way we did not before,” Krakauer says.

ENLARGE
Image shows many small circular currents in the Atlantic Ocean and a big red stream flowing eastward from Gulf of Mexico
This visualization of ocean surface currents between June, 2005 and December, 2007 is based on an integration of satellite data with a numerical model. Eddies and narrow currents transport heat and carbon in the oceans. The Estimating the Circulation and Climate of the Ocean project provides ocean flows at all depths, but only surface flows are used here. These visualizations are used to measure the ocean’s role in the global carbon cycle and monitor heat, water, and chemical exchanges within and between different components of the Earth system.
Data sources: sea surface height from NASA’s Topex/Poseidon, Jason-1, and Ocean Surface Topography Mission/Jason-2 satellite altimeters; gravity from the NASA/German Aerospace Center Gravity Recovery and Climate Experiment mission; surface wind stress from NASA’s QuikScat mission; sea surface temperature from the NASA/Japan Aerospace Exploration Agency Advanced Microwave Scanning Radiometer-EOS; sea ice concentration and velocity from passive microwave radiometers; temperature and salinity profiles from shipborne casts, moorings and the international Argo ocean observation system.

As our lives leave more tracks through phones, credit cards, e-commerce, Internet and email, the growing commercial impact of big data shows when:

  • you search for a flight to Tuscaloosa and then see websites plastered with promos for Tuscaloosa hotels
  • you watch a movie that used computer graphics built on data measured in hundreds of thousands of gigabytes
  • you shop at stores arranged to maximize profit based on data-mining of customer activity
  • airlines change their prices unpredictably, based on algorithms that predict future demand for seats
  • a smart-phone app identifies your location, so you receive offers from nearby restaurants

Is Big Data watching you?

Beyond security and commerce, big data and data mining are also surging in science. As more instruments with finer sensors return ever-more overwhelming data streams, more analytical horsepower is needed. In fields like meteorology, petroleum exploration and astronomy, gushers of data support — even demand — a new level of analysis and insight.

ENLARGE
Two researchers stand, holding an iPad and a long scroll of electrocardiography diagram respectively
MIT researchers John Guttag and Collin Stultz built a computer model to analyze formerly discarded electrocardiogram data from heart attack patients. Using data mining and machine learning to sift the massive data, they associated three electrical abnormalities with a doubled or tripled risk of dying from a second heart attack within one year. The new approach could catch more high-risk patients, who are usually not detected with existing risk screening.

One milestone in the emergence of big data in medicine was 2003, when the first human genome was completed. Since then, the breakthrough genome has been augmented by thousands of others for individuals, primates, mice and bacteria. With billions of “letters” per genome, the threat of computational confusion helped spawn the new field of bioinformatics, which harnesses software, hardware and sophisticated algorithms to support new types of science.

Another example of bioinformatics comes from the National Cancer Institute, where Susan Holbeck tested 5,000 pairs of FDA-approved cancer drugs against 60 cell lines. After 300,000 experiments, Holbeck says. “We know the level of RNA expression in every gene in each of the cell lines. We have sequence data, protein data, and data on micro RNA expression. We can take all of that, do data mining and see why one cell line would respond well to combinations while another cell line would not. We can take a pair of observations and turn it into a rational, targeted drug that we can test in the clinic.”

Truthy or consequences

As medical scientists try to cope with cancer, bacteria and viruses, political chatter has “gone viral” on the Internet. The Twittersphere has surpassed half a billion tweets per day, and its political clout is surging, confronting clean-government groups with a phenomenal data-mining challenge.

The goal of the Truthy project at Indiana University is to unearth insights from this daily deluge, says post-doctoral researcher Emilio Ferrara. “Truthy is a tool to allow researchers to study information diffusion in Twitter. By identifying keywords and tracking the activity of users online, we study the discussion that is ongoing.”

Truthy was developed by Indiana researchers Fil Menczer and Alessandro Flammini. Each day, the project’s computers screen upwards of 50 million tweets for patterns.

One key interest is “astroturf,” Ferrara says: orchestrated persuasion campaigns that supposedly come from the grass-roots but are actually issued by “individuals and organizations that have an interest in spreading information that is not correct.”

ENLARGE
Twitter users’ avatars connected via green lines form several groups
February 27, 2012, Marc Smith
Big data watches “#bigdata.” These connections appeared among the Twitter users who tweeted “bigdata,” scaled by numbers of followers. Blue lines show connections created when users reply or mention; green lines show one person following another.

During the 2012 election, a series of tweets claimed that Republican presidential candidate Mitt Romney had gained a suspiciously large number of Facebook followers. “People investigating found that it was not caused by Republicans or Democrats,” Ferrara says. “Someone else was behind it. It was an orchestrated campaign to defame Romney, to make people believe he was buying followers.”

Astroturf campaigns often carry hallmarks, Ferrara says. “If you want to run a massive defaming campaign, you need a lot of Twitter accounts,” including robot-run fake accounts that tweet and retweet the chosen messages. “We are able to identify these automatic activities by analyzing the features of the tweets.”

As the number of Tweets doubles year by year, can anything ensure transparency in e-politics? “The goal of our project is to allow technology to grasp a little of this information,” Ferrara says. “It is not possible to find everything, but even if we are able to find a little bit, that is better than nothing.”

Big data in the mind’s eye

The human brain is the ultimate calculating machine, and the ultimate big-data predicament, with an uncountable number of possible connections between individual neurons. The Human Connectome Project is an ambitious effort to map interactions among the different brain regions.

The Connectome is one of many data-drenched “omes”:

genome: an organism’s entire genetic information, encoded in DNA or, for some viruses, RNA

transcriptome: the complete set of RNA “readings” produced from an organism’s DNA

proteome: all proteins that can be expressed by an organism’s genes

metabolome: all small molecules, including intermediates and final products, of metabolism in an organism

The goal of the connectome “is to collect advanced neuroimaging data, along with cognitive, behavioral and demographic data on 1,200 individuals” who are neurologically healthy, says Daniel Marcus, head of informatics at the Connectome’s facility at Washington University in St. Louis.

ENLARGE
Imagery of two brain hemispheres with regions in yellow, red, blue, green or purple
Image courtesy M. F. Glasser and S. M. Smith for the WU-Minn HCP consortium.
Colors show correlations in metabolic activity in the human cerebral cortex while 20 healthy subjects were at rest in the MRI scanner. Yellow and red regions are functionally connected to a “seed” location in the parietal lobe of the right hemisphere (yellow spot at top right). Regions in green and blue are weakly connected or not connected at all.

The project is using three types of magnetic resonance imaging to view the structure, function and connections in the brain. When data collection finishes two years from now, Marcus expects connectome researchers to be slogging through about one million gigabytes of data.

One key task is “parcelization,” generating maps of brain regions, which were originally identified two or three centuries ago, based on staining a small number of brains. “We will have data on 1,200 individuals,” Marcus says, “so we can look at how this varies among individuals, and look at how they are connected.”

To identify links between brain regions, Marcus says, “We look at how spontaneous activity in the brain correlates between regions” in scans taken while subjects are resting. For example, if region A and B are spontaneously creating brain waves at 18 cycles per second, “this implies those are networked,” Marcus says. “We will use those correlations across the whole brain to create a matrix that shows how every point in the brain is correlated with every other point.” (These points are considerably larger than cells, which MRIs cannot “see.”)

Galaxy zoo: Crowd-sourcing to the heavens!

The Galaxy Zoo project breaks the rule for big data: Instead of putting data through a massive, computerized data-mining, it feeds images to motivated volunteers who do basic classifications of galaxies. The Zoo, launched in 2007, traces to Oxford, England, where astronomer Kevin Schawinski had just finished staring at 50,000 images from the Sloan Digital Sky Survey.

According to William Keel, a professor of astronomy at the University of Alabama and a member of the Zoo science team, Schawinski’s advisor suggested he complete the set of 950,000 images. “His eyes were falling out of his head, and so he headed to a pub where he encountered Chris Lintott, and in classic fashion, they sketched the web structure of Galaxy Zoo on the back of napkin.”

ENLARGE
A big spiral galaxy with a small galaxy behind it
WIYN telescope, Anna Manning, Chris Lintott, William Keel
This backlit galaxy, one of almost 2,000 found by Galaxy Zoo volunteers, is lit by the galaxy behind it. Backlight highlights dust in the foreground galaxy. Interstellar dust is a key player in star formation, but is also produced by stars, so tracing its amount and location is critical to understanding the history of galaxies.

Galaxies are a classic big-data problem: a state-of-the-art telescope scanning the entire sky would likely see 200 billion of these star worlds. However, “There is a constellation of issues related to cosmology and galaxy demographics that could be addressed by having a lot of people do a fairly simple sort of classification,” says Keel. “Classifications that are trivial after five-minute tutorial to this day are not really amenable to algorithms.”

Galaxy Zoo’s startup was so successful that user traffic physically damaged a server, Keel says.

After all 950,000 images in the Sloan survey were seen an average of 60 times apiece, the Zookeepers have moved on to larger surveys. Science is being served, Keel says. “I have gotten a lot of mileage out of oddball things that people have found,” including backlit galaxies.

Galaxy Zoo relies on statistics, multiple viewers and logic to process and check data. If the proportion of viewers who think that a certain galaxy is elliptical remains fixed as more people see it, the galaxy would be retired from viewing.

However, for rarer objects, Keel says, “You may need 40 or 50 viewers.”

Citizen science is developing its own principles, Keel adds. The volunteer’s work “has to contribute to a real, important research problem, in a way that can’t be done by any existing software. Clicks should not be wasted.”

ENLARGE
Holding a plugin, man smiles in backyard with electrical devices placed on tables and a lighted house behind him.
New doors open as the price of data and communication continues to fall. If you’re wondering how much water and energy each device in your house gobbles, MacArthur Fellow Shwetak Patel has a solution: wireless sensors that recognize the unique digital signature of each device. Patel’s smart algorithms, combined with a plug-in sensor, inexpensively identify the biggest wastrels. This Hayward, Calif. family was surprised to learn that video recorders were snarfing 11 percent of their household power.

The Zoo approach is being copied and refined by zooniverse.org, a parent organization that runs about 20 projects on, for example, tropical cyclones, the surface of Mars and climate data stored in ship logbooks.

Eventually, software may nudge out the volunteers, Keel says, but the line between computer and human is fungible. The Supernova Zoo, for example, was shut down after software learned the task.

ENLARGE
Purple circles in different sizes labeled with mental diseases are connected with red and blue lines.
Mental disorders are typically considered case by case, but a study of 1.5 million patient records showed that a significant number of patients have more than one illness. At the University of Chicago, the Silvio O. Conte Center uses data mining to understand the causes of, and relationships among, neuropsychiatric disorders. “There are multiple [research] communities looking at the same problem,” said center director Andrey Rzhetsky. “We are trying to combine them all to model and analyze those data types jointly… looking for possible environmental factors.”

We were surprised to learn that the huge data sets being amassed by volunteers are ideal for teaching classification to computers. “Some Galaxy Zoo users really hate that,” Keel says. “They loudly resent their clicks being used to train software. But we say, don’t waste the click. If someone walks in with a new algorithm that works as well, people won’t have to do that.”

Yearning for learning

More training has also benefited the long efforts to improve pattern recognition in images and speech, says Krakauer of UW-Madison. “It doesn’t just get better, it just starts to work. Five or 10 years ago, the idea of Siri on the iPhone was unthinkable; speech recognition was terrible. Now we have this vast number of data sets that trained the algorithm, and all of sudden they work.”

The utility of a giant dataset may go through a “phase transition,” Krakauer adds, after a relatively small change in processing capacity leads to a breakthrough in results.

“Big data” is a relative rather than absolute term, Kraukauer points out. “Big data can be seen as a ratio, the amount you can compute to the amount of data you have to compute on. There has always been big data. If you think about Tycho Brahe [Danish astronomer: 1546 to 1601] who collected data on the position of the planets, we did not have Keppler’s theory [explaining the motions of the planets], so the ratio was skewed. That was the big data of that age.”

Big data becomes an issue “When we have the technology that allows us to collect and store data that has outpaced our ability to reason about the system under scrutiny,” Krakauer says.

We wondered whether, as software continues to formulate decisions in science, commerce and security based on complex calculations with unimaginably large databases, we are turning too much power over to the machines. Behind our backs, automatic decisions are being made without any human understanding of the relationship between input and output, between data and decision. “This is what I work on,” Krakauer responded. “My research is on the evolution of intelligence in the universe, from the Big Bang to the brain. I have no doubt that what you said is true.”

– David J. Tenenbaum

1 2 3 4 5 6 7 8

Related Why Files

9 10 11 12

Terry Devitt, editor; S.V. Medaris, designer/illustrator; Yilang Peng, project assistant; David J. Tenenbaum, feature writer; Amy Toburen, content development executive