Predicting the longevity of resources shared in scientific publications
Predict the longevity of resources shared in publications using Tobit Model and Random Forest.
Science of Science
Machine Learning
Authors
Affiliations
Daniel E Acunaz
School of Information Studies, Syracuse University, Syracuse, NY 13244, USA
Jian Jian
School of Information Studies, Syracuse University, Syracuse, NY 13244, USA
Tong Zeng
School of Information Management, Nanjing University, Nanjing 210023, China
Lizhen Liang
School of Information Studies, Syracuse University, Syracuse, NY 13244, USA
Han Zhuang
School of Information Studies, Syracuse University, Syracuse, NY 13244, USA
Published
March 24, 2022
Abstract
Research has shown that most resources shared in articles (e.g., URLs to code or data) are not kept up to date and mostly disappear from the web after some years (Zeng et al., 2019). Little is known about the factors that differentiate and predict the longevity of these resources. This article explores a range of explanatory features related to the publication venue, authors, references, and where the resource is shared. We analyze an extensive repository of publications and, through web archival services, reconstruct how they looked at different time points. We discover that the most important factors are related to where and how the resource is shared, and surprisingly little is explained by the author’s reputation or prestige of the journal. By examining the places where long-lasting resources are shared, we suggest that it is critical to disseminate and create standards with modern technologies. Finally, we discuss implications for reproducibility and recognizing scientific datasets as first-class citizens.
Keywords
Dataset decay, Science of science, Predictive analytics, Reproducibility
@article{e_acunaz2022,
author = {E Acunaz, Daniel and Jian, Jian and Zeng, Tong and Liang,
Lizhen and Zhuang, Han},
publisher = {arXiv.org},
title = {Predicting the Longevity of Resources Shared in Scientific
Publications},
journal = {arXiv preprint arxiv.2203.12800},
date = {},
url = {https://arxiv.org/abs/2203.12800},
doi = {https://doi.org/10.48550/arxiv.2203.12800},
langid = {en},
abstract = {Research has shown that most resources shared in articles
(e.g., URLs to code or data) are not kept up to date and mostly
disappear from the web after some years (Zeng et al., 2019). Little
is known about the factors that differentiate and predict the
longevity of these resources. This article explores a range of
explanatory features related to the publication venue, authors,
references, and where the resource is shared. We analyze an
extensive repository of publications and, through web archival
services, reconstruct how they looked at different time points. We
discover that the most important factors are related to where and
how the resource is shared, and surprisingly little is explained by
the author’s reputation or prestige of the journal. By examining the
places where long-lasting resources are shared, we suggest that it
is critical to disseminate and create standards with modern
technologies. Finally, we discuss implications for reproducibility
and recognizing scientific datasets as first-class citizens.}
}
For attribution, please cite this work as:
E Acunaz, Daniel, Jian Jian, Tong Zeng, Lizhen Liang, and Han Zhuang.
n.d. “Predicting the Longevity of Resources Shared in Scientific
Publications .”arXiv Preprint Arxiv.2203.12800.
https://doi.org/https://doi.org/10.48550/arxiv.2203.12800.