Finding datasets in publications: the Syracuse University approach

Tong Zeng; Daniel E Acuna

Abstract

Science is fundamentally an incremental discipline that depends on previous scientists’ work. Datasets form an integral part of this process and therefore should be shared and cited like any other scientific output. This ideal is far from reality: the credit that datasets currently receive does not correspond to their actual usage (Zeng et al., 2020). One of the issues is that there is no standard for citing datasets, and even if they are cited, they are not properly tracked by major scientific indices. Interestingly, while datasets are still used and mentioned in articles, we lack methods to extract such mentions and properly reconstruct dataset citations. The Rich Context Competition challenge aims to close this gap by inviting scientists to produce automated dataset mention and linkage detection algorithms. In this chapter, we detail our proposal to solve the dataset mention step. Our approach attempts to provide a first approximation to better give credit and keep track of datasets and their usage.

Download PDF

1 Introduction

The problem of dataset extraction has been explored before. Ghavimi et al. (2016,2017) use a relatively simple TF-IDF representation with cosine similarity for matching dataset identification in social science articles. Their method consists of three major steps: preparing a curated dictionary of typical mention phrases, detecting dataset references, and ranking matching datasets based on cosine similarity of TF-IDF representations. This approach achieved a relatively high performance, with F1 = 0.84 for mention detection and F1 = 0.83, for matching. Singhal and Srivastava (2013) proposed a method using normalized Google distance to screen whether a term is in a dataset. However, this method relies on external services and is not computationally efficient. They achieve a good F1 = 0.85 using Google search and F1 = 0.75 using Bing. A somewhat similar project was proposed by Lu et al. (2012). They built a dataset search engine by solving the two challenges: identification of the dataset and association to a URL. They build a dataset of 1000 documents with their URLs, containing 8922 words or abbreviations representing datasets. They also build a web-based interface. This shows the importance of dataset mention extraction and how several groups have tried to tackle the problem.

In this chapter, we describe a method for extracting dataset mentions based on a deep recurrent neural network. In particular, we used a bidirectional long short-term memory (bi-LSTM) sequence to sequence model paired with a conditional random field (CRF) inference mechanism. The architecture is similar to that of Chapter 6, but we only focus on the detection of dataset mentions. We tested our model on a novel dataset produced for the Rich Context Competition challenge. We achieved a relatively good performance of F1 = 0.885. We discuss the limitations of our model.

2 The Proposed Method

2.1 Overall View of the Architecture

In this section we propose a model for detecting mentions based on a bi-LSTM CRF architecture. At a high level, the model uses a sequence-to-sequence recurrent neural network that produces the probability of whether a token belongs to a dataset mention. The CRF layer takes those probabilities and estimates the most likely sequence based on constraints between label transitions (e.g., mention–to–no-mention–to–mention has low probability). While this is a standard architecture for modelling sequence labelling, the application to our particular dataset and problem is new.

We now describe in more detail the choices of word representation, hyperparameters and training parameters. A schematic view of the model is given in Figure Figure 1 and the components are as follow.

Figure 1: The architecture of bi-LSTM CRF network

Character encoder layer: treat a token as a sequence of characters and encode the characters by using a bi-LSTM to get a vector representation.
Word embedding layer: mapping each token into fixed-size vector representation by using a pre-trained word vector.
Bi-LSTM layer: make use of bi-LSTM network to capture the high-level representation of the whole token sequence input.
Dense layer: project the output of the previous layer to a low-dimensional vector representation of the distribution of labels.
CRF layer: find the most likely sequence of labels.

Please read the paper for more details.

3 Results

In this work we wanted to propose a model for the Rich Context Competition challenge. We propose a relatively standard architecture based on the bi-LSTM CRF network. We now describe the evaluation metrics, hyperparameter setting, and the results of this network on the dataset provided by the competition.

For all of our results, we use F1 as the measure of performance. This measure is the harmonic average of the precision and recall and it is the standard measure used in sequence labelling tasks. It varies from 0 to 1, the higher the better. Our method achieved a relatively high F1 of 0.885 for detecting mentions.

Please read the paper for more details.

4 Conclusion

In this work, we report a high-accuracy model for the problem of detecting dataset mentions. Because our method is based on a standard bi-LSTM CRF architecture, we expect that updating our model with recent developments in neural networks would only benefit our results. We also provide some evidence of how difficult we believe the linkage step of the challenge could be if dataset noise is not lowered.

One of the shortcomings of our approach is that the architecture is lacking some modern features of RNN networks. In particular, recent work has shown that attention mechanisms are important especially when the task requires spatially distant information, as in this case. These benefits could also translate to better linkage. We are exploring new architectures using self-attention and multiple-head attention. We hope to share these approaches in the near future.

There are a number of improvements that we could make in the future. A first improvement would be to use non-recurrent neural architectures such as the Transformer which has been shown to be faster and a more effective learner than RNNs. Another improvement would be to bootstrap information from other dataset sources such as open-access full-text articles from PubMed Open Access Subset. This dataset contains dataset citations (Zeng et al., 2020) – in contrast to the most common types of citations to publications. The location of such citations within the full text could be exploited to perform entity recognition. While this would be a somewhat different problem than the one solved in this chapter, it would still be useful for the goal of tracking dataset usage. In sum, by improving the learning techniques and the dataset size and quality, we could significantly increase the success of finding datasets in publications.

Our proposal, however, is surprisingly effective. Because we have barely modified a general RNN architecture, we expect that our results will generalize relatively well either to the second phase of the challenge or even to other disciplines. We would emphasize, however, that the quality of the dataset has a great deal of room for improvement. Given how important this task is for the whole of science, we should strive to improve the quality of these datasets so that techniques like this one can be more broadly applied.

Citation

BibTeX citation:

@incollection{zeng2019,
  author = {Zeng, Tong and E Acuna, Daniel},
  editor = {I. LANE, JULIA and MULVANY, IAN and NATHAN, PACO},
  publisher = {SAGE},
  title = {Finding Datasets in Publications: The {Syracuse} {University}
    Approach},
  booktitle = {Rich Search and Discovery for Research Datasets},
  pages = {157 - 165},
  date = {2019-12-31},
  url = {https://tong-zeng.github.io/publications/finding-datasets-in-publications/},
  langid = {en},
  abstract = {Science is fundamentally an incremental discipline that
    depends on previous scientists’ work. Datasets form an integral part
    of this process and therefore should be shared and cited like any
    other scientific output. This ideal is far from reality: the credit
    that datasets currently receive does not correspond to their actual
    usage (`Zeng et al., 2020`). One of the issues is that there is no
    standard for citing datasets, and even if they are cited, they are
    not properly tracked by major scientific indices. Interestingly,
    while datasets are still used and mentioned in articles, we lack
    methods to extract such mentions and properly reconstruct dataset
    citations. The Rich Context Competition challenge aims to close this
    gap by inviting scientists to produce automated dataset mention and
    linkage detection algorithms. In this chapter, we detail our
    proposal to solve the dataset mention step. Our approach attempts to
    provide a first approximation to better give credit and keep track
    of datasets and their usage.}
}

For attribution, please cite this work as:

Zeng, Tong, and Daniel E Acuna. 2019. “Finding Datasets in Publications: The Syracuse University Approach .” In Rich Search and Discovery for Research Datasets, edited by JULIA I. LANE, IAN MULVANY, and PACO NATHAN, 157–65. SAGE. https://tong-zeng.github.io/publications/finding-datasets-in-publications/.