Discovery, rediscovery, and open access: Part 1
From Peter Suber's August 2010 Issue of the SPARC Open Access Newsletter.
In 1979, William Garvey made a remarkable claim: " ... in some disciplines, it is easier to repeat an experiment than it is to determine that the experiment has already been done." (See W.D. Garvey, Communication: The essence of science, Pergamon Press, Oxford 1979, p. 8.)
Garvey was talking about research in the era of print, and we'd like to think that digital technologies have changed the picture. But Garvey's thesis is not false today. It's just true less often than it was in 1979.
Of course digitizing research makes it easier to find. But when finding it is still hard (because search tools are weak or access barriers block crawlers) or when retrieval is hard (because the work is toll-access or TA) or when the original experiment is particularly easy to repeat, then repeating the experiment can still be the path of least resistance.
The same is true when we move beyond mere digitization to online distribution. Search will be easier, but may still fail. For many people retrieval will still be costly or impossible.
The same is true when we move beyond online distribution to OA distribution. When the original research is OA, then if we can find it, we can retrieve it. But a hard search can still be harder than an easy experiment --though not very often.
Of course even when the original research is findable and retrievable, the researchers might have been vague or coy when writing it up. ("Decant the phlogiston..." or "The results, according to our [homegrown, closed-source] software...."). But the Garvey problem is about access failures, not author failures. Hence, we needn't consider the case --which can arise in an OA world-- in which we can find and retrieve the original research but must still repeat the work in order to make up for deficiencies in what we retrieved.
Our best hope for solving the Garvey problem is the combination of ubiquitous OA and powerful search, and fortunately we're making good progress on both fronts. The scope of the Garvey problem is shrinking. The optimistic take is that it won't be long before the Garvey problem is limited to known, simple truths that are easier to discover in the world than to rediscover by search. (For the pessimistic take, see Part 2 next month.)
This class of simple truths may be small and shrinking, but it's non-empty. For example, I have a thermometer outside my kitchen window, but I wear bifocals and can't read it from across the room. On some days, when I'm drinking my morning coffee at my laptop, it's easier to Google the local temperature than walk across the room. On other days, it's easier to walk across the room. A key variable is whether Bonny, my excitable 100-pound Labrador Retriever, is hovering between me and the window waiting for me to stand up.
Of course when the Garvey problem arises only for this class of truths, it's no longer a serious obstacle. As time goes on, then, we might be tempted to blur the distinction between the cases where there is no Garvey problem and the cases where the Garvey problem is no obstacle --that is, between the cases when searching is easier than repeating the original experiment and the cases when searching is harder but still easy. But we should remember that we reach this plateau of low-barrier access to knowledge not just in the rare cases when discovery and rediscovery are both easy. We also reach it in the more plentiful cases when just one of them is easy. Hence, it remains important to distinguish the speed bump of search from the speed bump of repeating the original experiment or observations. The two bumps are equally low only for a special subset of simple truths. But we only need one to be low in order to accelerate research.
Where we can't make empirical discovery easier, at least we can make look-up or rediscovery easier. Step One in eliminating the Garvey problem is to discover and record a piece of knowledge, or a claimed piece of knowledge. If we get this far, then in principle others can find it without having to repeat the original work. Step Two is to make finding easier than repeating. Hence, it takes a village to shrink the scope of the Garvey problem: the original discoverers and recorders, the larger community of refiners and confirmers, and the army of cooperators developing the recent critical improvements to our access system: digitization, online distribution, strengthening search, and spreading OA.
(For more on why OA isn't enough, and why search must complement OA, see SOAN for July 2005. For more on why search isn't enough, and why OA must complement search, see SOAN for December 2005.)
Solving the Garvey problem doesn't mean that we'll make use of what we know, but it gives us a fighting chance. It means that we're moving beyond mere preservation to a useful degree of findability. Findability may exist in a spectrum of degrees from 0 to 1. But we pass an important threshold when the ease of findability through an effective access system surpasses the ease of discoverability through empirical experiments. Insofar as knowledge is worth accessing, knowing, or using, then it's urgent to pass this threshold and urgent to remain above it. It's just as urgent to keep working to lower any access barriers that may remain.
Where we haven't solved the Garvey problem, we're allowing access barriers to render some known truths essentially useless. Where we've solved the Garvey problem but haven't kept working to lower the barriers to findability and retrievability as much as we can lower them, we're allowing access barriers to render some known truths needlessly expensive, invisible, and out of reach.
* For a given research question, I might face a Garvey problem when you would not. For example, you might have good access to the internet or a print library when I don't, or I might have good access to lab equipment when you don't. (Further, one of us might face a lab problem and the other a "Lab" problem.)
Similarly, some research questions raise Garvey problems when others do not. Some searches are so difficult --reading a handwritten manuscript locked in the Vatican-- that we'd prefer to redo the original work, if only we could. This is the kind of problem that digitization, OA, and search can solve, even if they haven't yet solved them all.
Conversely, some experiments are too expensive or dangerous to repeat. Does lunar soil contain more silicon than iron? What happens when two 3.5 trillion electron-volt proton beams collide? Is thalidomide safe for pregnant women? If only we could answer our question through a difficult search, perhaps by traveling across the globe to read a unique handwritten manuscript, we'd prefer to do that than to redo the experiment.
OA helps in both cases, either by making difficult searches easier or by making the repetition of difficult experiments unnecessary.
For more on the second family of cases, see Richard Poynder's case study of CERN's Large Hadron Collider (August 2008), showing that big science intrinsically carries big incentives for OA:
"Ten or 20 years ago we might have been able to repeat an experiment," says [CERN's Rolf-Dieter Heuer]. "They were simpler, cheaper and on a smaller scale. Today that is not the case. So if we need to re-evaluate the data we collect to test a new theory, or adjust it to a new development, we are going to have to be able reuse it. That means we are going to need to save it as open data." ...Openness is not an issue for data alone, however. The research papers produced from the LHC experiments will also have to be open....Because science is a cumulative process, the greater the number of people who can access research, critique it, check it against the underlying data and then build on it, the sooner new solutions and theories will emerge. And as "Big Science" projects like the LHC become the norm, the need for openness will be even greater because the larger the project, the more complex the task, and the greater the need for collaboration....Certainly, if the public is asked to fund further multi-billion-pound projects like the [Large Hadron Collider], there will be growing pressure on scientists to maximise the value of the data they generate - and that will require greater openness.
Banks too big to fail can threaten the economy. But experiments too big to repeat are no threat to research, at least if we take the lesson from their virtual unrepeatability and open up the process, data, software, and results so that repetition is as unnecessary as we can possibly make it.
Experiments too big to repeat don't face Garvey problems. No matter how difficult it is to find the original results, finding them will be easier the redoing the original work. But insofar as we tolerate access barriers to those results, we undermine our own enormous investment in the original research. It's penny wise and pound foolish to fund an expensive experiment and make the results expensive for subsequent researchers to find and retrieve.
If this is true of single experiments that are individually expensive, it's also true of portfolios of experiments that are collectively expensive. One of the compelling rationales for a funder OA mandate is that the large investment in the funder's research budget should not be undermined by needless access barriers to the results. Many funders turn to OA to solve "big portfolio" and "big science" problems at the same time. The NIH research budget is more than the GDP of 140 nations. When taxpayers devote that kind of money to research, they can maximize the return on their investment by ensuring that the results are available to all who can build on them. In addition, the cost of an NIH-funded research project can be hundreds or even thousands of times greater than the cost of publication. To allow its results to be held hostage by publishers is the same mistake on a different scale as spending billions on a Large Hadron Collider and locking up the results in toll-access publications.
* Not all the literature we want to find, retrieve, and read should be called "knowledge". We want access to serious proposals for knowledge even if we're still evaluating them and debating their merits. We want access even to knowledge claims that turn out to be false or one-sided. We want access to datasets even when they are small, uninterpreted, or methodologically flawed. We want access to observations that were true when they were made but time-bound and variable (like Garvey's own observation in 1979). We want access to all the data, evidence, arguments, and algorithms that help us decide what to call "knowledge", not just to the results that we agree to call "knowledge". If access depended on the *outcome* of debate and inquiry, then access could not *contribute* to debate and inquiry.
We don't have a good name for this category larger than knowledge, but here I'll just call it "research". Research includes knowledge, knowledge claims and proposals, hypotheses and conjectures, data, algorithms, debate, evaluation, criticism, dissent, interpretation, summary, and review.
Recapitulating an experiment or observation can be trivial compared to recapitulating a generations-long and culture-wide (or cultures-wide) debate. Large-scale debates, like large-scale experiments, are too big to repeat. Like large-scale experiments, they are generally immune to Garvey problems. Even if finding and retrieving all the relevant bits is extraordinarily difficult, it's usually easier than recapitulating the original process. It doesn't follow that access barriers are low, merely that they are lower than repetition barriers. Access can be formidably expensive and difficult, not only because of price and permission barriers, but because of the nearly opposite problems of information overload and inadequate filtering and search.
In these ways, research in the wider sense, and nearly all research in the humanities, is like big science. It's too big to repeat or recapitulate and carries intrinsic incentives for OA. Leaving access barriers any higher than necessary means slowing the process of inquiry and wasting more effort and resources than necessary.
* Postscript. This essay stands on its own, but it's also Part 1 of a longer piece. I'm hoping to publish Part 2 next month. In Part 2 I'll look at (1) when we *want* to repeat the original work, in order to test its reproducibility, and (2) when the rediscovery of older knowledge is made difficult not so much by access barriers as by ignorance, indifference, taboos, and Dark Ages 2.0.