Papers
arxiv:2508.13394

CASPER: Concept-integrated Sparse Representation for Scientific Retrieval

Published on Aug 18
Authors:
,
,
,

Abstract

CASPER, a sparse retrieval model using tokens and keyphrases, outperforms dense and sparse baselines in scientific search and keyphrase generation.

AI-generated summary

The exponential growth of scientific literature has made it increasingly difficult for researchers to keep up with the literature. In an attempt to alleviate this problem, we propose CASPER, a sparse retrieval model for scientific search that utilizes tokens and keyphrases as representation units (i.e. dimensions in the sparse embedding space), enabling it to represent queries and documents with research concepts and match them at both granular and conceptual levels. To overcome the lack of suitable training data, we propose mining training data by leveraging scholarly references (i.e. signals that capture how research concepts of papers are expressed in different settings), including titles, citation contexts, author-assigned keyphrases, and co-citations. CASPER outperforms strong dense and sparse retrieval baselines on eight scientific retrieval benchmarks. Moreover, we demonstrate that through simple post-processing, CASPER can be effectively used for the keyphrase generation tasks, achieving competitive performance with the established CopyRNN while producing more diverse keyphrases and being nearly four times faster.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.13394 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.13394 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.