Predicting the longevity of resources shared in scientific publications

Predict the longevity of resources shared in publications using Tobit Model and Random Forest.
Science of Science
Machine Learning
Authors
Affiliations

Daniel E Acunaz

School of Information Studies, Syracuse University, Syracuse, NY 13244, USA

Jian Jian

School of Information Studies, Syracuse University, Syracuse, NY 13244, USA

Tong Zeng

School of Information Management, Nanjing University, Nanjing 210023, China

Lizhen Liang

School of Information Studies, Syracuse University, Syracuse, NY 13244, USA

Han Zhuang

School of Information Studies, Syracuse University, Syracuse, NY 13244, USA

Published

March 24, 2022

Abstract

Research has shown that most resources shared in articles (e.g., URLs to code or data) are not kept up to date and mostly disappear from the web after some years (Zeng et al., 2019). Little is known about the factors that differentiate and predict the longevity of these resources. This article explores a range of explanatory features related to the publication venue, authors, references, and where the resource is shared. We analyze an extensive repository of publications and, through web archival services, reconstruct how they looked at different time points. We discover that the most important factors are related to where and how the resource is shared, and surprisingly little is explained by the author’s reputation or prestige of the journal. By examining the places where long-lasting resources are shared, we suggest that it is critical to disseminate and create standards with modern technologies. Finally, we discuss implications for reproducibility and recognizing scientific datasets as first-class citizens.

Keywords

Dataset decay, Science of science, Predictive analytics, Reproducibility

Back to top

Citation

BibTeX citation:
@article{e_acunaz2022,
  author = {E Acunaz, Daniel and Jian, Jian and Zeng, Tong and Liang,
    Lizhen and Zhuang, Han},
  publisher = {arXiv.org},
  title = {Predicting the Longevity of Resources Shared in Scientific
    Publications},
  journal = {arXiv preprint arxiv.2203.12800},
  date = {},
  url = {https://arxiv.org/abs/2203.12800},
  doi = {https://doi.org/10.48550/arxiv.2203.12800},
  langid = {en},
  abstract = {Research has shown that most resources shared in articles
    (e.g., URLs to code or data) are not kept up to date and mostly
    disappear from the web after some years (Zeng et al., 2019). Little
    is known about the factors that differentiate and predict the
    longevity of these resources. This article explores a range of
    explanatory features related to the publication venue, authors,
    references, and where the resource is shared. We analyze an
    extensive repository of publications and, through web archival
    services, reconstruct how they looked at different time points. We
    discover that the most important factors are related to where and
    how the resource is shared, and surprisingly little is explained by
    the author’s reputation or prestige of the journal. By examining the
    places where long-lasting resources are shared, we suggest that it
    is critical to disseminate and create standards with modern
    technologies. Finally, we discuss implications for reproducibility
    and recognizing scientific datasets as first-class citizens.}
}
For attribution, please cite this work as:
E Acunaz, Daniel, Jian Jian, Tong Zeng, Lizhen Liang, and Han Zhuang. n.d. “Predicting the Longevity of Resources Shared in Scientific Publications .” arXiv Preprint Arxiv.2203.12800. https://doi.org/https://doi.org/10.48550/arxiv.2203.12800.