Dead Science: Most Resources Linked in Biomedical Articles Disappear in Eight Years

Modeling the resources decay in scientific publications.
Science of Science
Logistic Regression
Authors
Affiliations

Tong Zeng

School of Information Management, Nanjing University, Nanjing 210023, China

School of Information Studies, Syracuse University, Syracuse, NY 13244, USA

Alain Shema

School of Information Studies, Syracuse University, Syracuse, NY 13244, USA

Daniel E Acuna

School of Information Studies, Syracuse University, Syracuse, NY 13244, USA

Published

March 13, 2019

Abstract

Scientific progress critically depends on disseminating analytic pipelines and datasets that make results reproducible and replicable. Increasingly, researchers make resources available for wider reuse and embed links to them in their published manuscripts. Previous research has shown that these resources become unavailable over time but the extent and causes of this problem in open access publications has not been explored well. By using 1.9 million articles from PubMed Open Access, we estimate that half of all resources become unavailable after 8 years. We find that the number of times a resource has been used, the international (int) and organization (org) domain suffixes, and the number of affiliations are positively related to resources being available. In contrast, we found that the length of the URL, Indian (in), European Union (eu), and Chinese (cn) domain suffixes, and abstract length are negatively related to resources being available. Our results contribute to our understanding of resource sharing in science and provide some guidance to solve resource decay.

Keywords

Link decay, Resources decay, Reproducibility

1 Introduction

Reproducibility and replicability are key components of science. Increasingly, this depends on the ability of scientists to use the resources shared in scientific articles. Many studies have found that resources embedded in scientific publications suffer from decay over time [1–6] directly affecting the incremental nature of science. In particular, biomedical sciences is a discipline that reuses resources regularly (e.g., software [7], protocols [8], and datasets [9]). However, a systematic study of the decay of such resources in biomedical publications is lacking.

The mechanisms governing sharing of data and resources are important for science. As early as 2003, the National Institutes of Health published a policy requiring applications for grants greater than $500,000 to include data sharing plans [10]. The National Science Foundation also has policies encouraging data sharing [11]. There are other institutions that recognize the importance of this practice (e.g., [12,13]) and its actual impact on the acceleration of science(e.g., [9]). Sharing of resources is important and how they decay is still poorly understood.

One way of understanding how long and why resources are available is to analyze how resources decay over time. Several studies have tried to understand thisphenomenon in several disciplines. All this previous work has focused mostly on closed access publications and, to the best of our knowledge, the biggest dataset has around 1 million URLs [5,6]. In our work, we examine resource decay at a significantly larger volume in open access biomedical articles.

2 Materials and Methods

We obtained a copy of Pubmed Open Access Subset in June 2018 which consists of 1,904,971 articles. Not all URLs in these files are interesting or represent a resource being shared. We apply the following filters to discard URLs. First, we remove links to local file systems, URLs without any paths, and we canonicalize the URLs. The URL availability checker followed standard detection methods [17]. This checker, however, does not consider resources that are available but moved from the original URL. The final dataset contains 2,642,694 URLs of which 1,883,622 are unique.

3 Results

3.2 Most Resources Shared in Publications Disappear After Eight Years.

The point at which half of the resources become obsolete is important. Here we simply examined the average availability of a resources as a function of age (Figure 2) and found that this point happens after eight years. Surprisingly, we notice that new resources (age = 0) have a 20% chance of being unavailable, similar to previous findings [18]. Resources tend to follow a steady decline from ages 1 to 10 years. Then, it seems that resources 10 years and older stabilize around 42% availability. The data in our research showed that half of links become unavailable after 8 years.

Figure 2: Probability of resources being available as a function of age in years.

4 Discussion and conclusion

The practice of embedding links in scientific papers has been growing exponentially, and our findings are in line with previous research [15,3]. While we found a half-life of 8 years, there is great variability in this number—2.2 years [2], 5 years [4], 5.3 years [2], 9.3 years [15], 10 years [14]. However, all this previous work has analyzed a smaller volume of links and shorter time spans compared to our analysis, which may explain this variability.

While our findings are only correlations, they offer some intuitive suggestions. We would propose that authors use shorter, easier to remember URLs, hosted in non-profit domains. For links that are available, we should consider archiving those that are old, have complex URLs, are published in low h-index journals, and are hosted in country-based domains. In none of this is possible, we can explicitly archive links with services such as Perma [16], the Internet Archive [22], and WebCite [23,24].

Back to top

Citation

BibTeX citation:
@inproceedings{zeng2019,
  author = {Zeng, Tong and Shema, Alain and E Acuna, Daniel},
  editor = {Greene Taylor, Natalie and Christian-Lamb, Caitlin and H.
    Martin, Michelle and Nardi, Bonnie},
  publisher = {Springer, Cham.},
  title = {Dead {Science:} {Most} {Resources} {Linked} in {Biomedical}
    {Articles} {Disappear} in {Eight} {Years}},
  booktitle = {Information in Contemporary Society. iConference 2019.},
  series = {Lecture Notes in Computer Science},
  volume = {11420},
  pages = {170 - 176},
  date = {2019-03-13},
  url = {https://link.springer.com/chapter/10.1007/978-3-030-15742-5_16},
  doi = {10.1007/978-3-030-15742-5_16},
  langid = {en},
  abstract = {Scientific progress critically depends on disseminating
    analytic pipelines and datasets that make results reproducible and
    replicable. Increasingly, researchers make resources available for
    wider reuse and embed links to them in their published manuscripts.
    Previous research has shown that these resources become unavailable
    over time but the extent and causes of this problem in open access
    publications has not been explored well. By using 1.9 million
    articles from PubMed Open Access, we estimate that half of all
    resources become unavailable after 8 years. We find that the number
    of times a resource has been used, the international (int) and
    organization (org) domain suffixes, and the number of affiliations are
    positively related to resources being available. In contrast, we
    found that the length of the URL, Indian (in), European Union (eu),
    and Chinese (cn) domain suffixes, and abstract length are negatively
    related to resources being available. Our results contribute to our
    understanding of resource sharing in science and provide some
    guidance to solve resource decay.}
}
For attribution, please cite this work as:
Zeng, Tong, Alain Shema, and Daniel E Acuna. 2019. “Dead Science: Most Resources Linked in Biomedical Articles Disappear in Eight Years .” In Information in Contemporary Society. iConference 2019., edited by Natalie Greene Taylor, Caitlin Christian-Lamb, Michelle H. Martin, and Bonnie Nardi, 11420:170–76. Lecture Notes in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-15742-5_16.