On one hand, the sheer volume of scientific data and published materials continues to grow dramatically. At the same time, however, barriers continue to exist that make it difficult for scientists and researchers to collaborate and share data. Here, access is a challenge, but just finding additional data or becoming aware of new data is getting more difficult.
A recent Economist article highlights a Chinese researcher who manually did just this - stitching together dozens of data sets and reviewing more than 1,000 studies on addiction and genetics. Her team was able to identify 18 genes that led to addiction to at least one type of drug (alcohal, cocaine, nicotine, or opiate), and that five turned out to be common to all four drug types. While the five are well-known and scientists had suspected them as culprits, the tyeam's work provided statistical evidence to back up the assumptions, for the first time. While her questions were not new, her team's methodology was.
A new class of intelligent (software) agents that specialize in scientific data and matching patterns hopes to provide scientists with this kind of data mining across disciplines and at scale. Some approaches rely on advanced probabilistic machine learning techniques combined with large scale text processing and crawling.
The results of automated data mining to date has been decidedly mixed. While the specific learning techniques have improved, they are still largely considered fairly immature. However, recent advances based on work under the PAL program at DARPA has yielded some interesting new technologies that may make text data mining for science both more practical and more accurate.
Better, more rapid data-finding and "stitching" together could have a tremendous impact on the pace of innovation in science. Discovery is not a binary state - it's a process that until the nineteenth century usually involved a lone scientist. Allied in trends in open publishing models and open science are converging to produce new areas of inquiry and experimental data.
As scientists attack non-traditional domains that require consideration of multiple literatures and disciplines, they will need help (The new CDC NCZVED center is a prime example). If they have tools to harness the growing opportunities in data and collaboration, we could see more rapid discovery, further opening of data sources, and better collaboration across disciplines and domains that ultimately serve to create better science.
http://www.economist.com/science/displaystory.cfm?story_id=10493159
http://www.darpa.mil/IPTO/programs/pal/pal.asp
http://calosystem.org/
http://ebiquity.umbc.edu/event/html/id/215/Empowering-Scientific-Discovery-by-Distributed-Data-Mining-on-the-Grid-Infrastructure
http://cs.nyu.edu/courses/fall07/G22.2965-001/inriamachinelearning.pdf
http://www.nersc.gov/nusers/analytics/exploration/
http://www.kdd2007.com/panelsCFP.html
http://www.cdc.gov/nczved/