CASPER: Concept-integrated Sparse Representation for Scientific Retrieval
Abstract
CASPER, a sparse retrieval model using tokens and keyphrases, outperforms dense and sparse baselines in scientific search and keyphrase generation.
The exponential growth of scientific literature has made it increasingly difficult for researchers to keep up with the literature. In an attempt to alleviate this problem, we propose CASPER, a sparse retrieval model for scientific search that utilizes tokens and keyphrases as representation units (i.e. dimensions in the sparse embedding space), enabling it to represent queries and documents with research concepts and match them at both granular and conceptual levels. To overcome the lack of suitable training data, we propose mining training data by leveraging scholarly references (i.e. signals that capture how research concepts of papers are expressed in different settings), including titles, citation contexts, author-assigned keyphrases, and co-citations. CASPER outperforms strong dense and sparse retrieval baselines on eight scientific retrieval benchmarks. Moreover, we demonstrate that through simple post-processing, CASPER can be effectively used for the keyphrase generation tasks, achieving competitive performance with the established CopyRNN while producing more diverse keyphrases and being nearly four times faster.
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper