Papers
arxiv:2509.13397

The threat of analytic flexibility in using large language models to simulate human data: A call to attention

Published on Sep 16
Authors:

Abstract

The use of large language models to create synthetic datasets for human subjects research involves numerous analytic choices that significantly impact sample quality, with no single configuration optimizing all dimensions.

AI-generated summary

Social scientists are now using large language models to create "silicon samples" - synthetic datasets intended to stand in for human respondents, aimed at revolutionising human subjects research. However, there are many analytic choices which must be made to produce these samples. Though many of these choices are defensible, their impact on sample quality is poorly understood. I map out these analytic choices and demonstrate how a very small number of decisions can dramatically change the correspondence between silicon samples and human data. Configurations (N = 252) varied substantially in their capacity to estimate (i) rank ordering of participants, (ii) response distributions, and (iii) between-scale correlations. Most critically, configurations were not consistent in quality: those that performed well on one dimension often performed poorly on another, implying that there is no "one-size-fits-all" configuration that optimises the accuracy of these samples. I call for greater attention to the threat of analytic flexibility in using silicon samples.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.13397 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.13397 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.13397 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.