Concerns on the use of generative AI in social science research
How to contribute!
We would welcome any suggestions for additions to this list!
- (Our preference!) Make a pull request for this repository!
- Contact us via email: matti.nelimarkka@helsinki.fi and adeline.clarke@helsinki.fi
Background
Recently, there has been significant interest in the use of generative AI (genAI) in social science for data generation and analysis. There are many articles, such as this one which advocate for their use and articulate opportunities the LLM could provide for social science research. At the same time, some social scientists have felt less optimistic about these opportunities, with some arguing that their use requires explicity justification
To summarise this discussion and highlight ongoing work, Chris Bail published an article ‘Can Generative AI improve social science?’ in May 2024, summarising current debate on both the benefits and concerns related to using generative AI for social science research and arguing that genAI has great potential in these fields
In hopes to develop practices in social sciences, this document is intended to become a repository which outlines concerns and issues to address and engages them with more critical research. Before providing a list of research highlighting concerns with using genAI for social science research, we first summarise these concern areas, as well as proposed applications of genAI in the social sciences. (We assume our readers are familiar with terms like ‘generative AI’, ‘large language models’, ‘Foundation Models’. If not, Bail’s article provides a useful summary of these.)
Overview of concerns
A number of concerns have been raised with the use of genAI in social science research such as:
Reproducibility
Replicability of research using generative AI is of concern due to the probabilistic nature of the models. ‘Temperature’ parameters can be used to combat this but there isn’t consensus on what these should be set to and they come with drawbacks of repetition that may be problematic in some research settings. Another issue is that as the models change and develop, their outputs will as well, making it difficult to reproduce results at a later date. Finally, results may be dependent on which generative AI was used within the research.
Bias
Generative AI tools exhibit the human biases found in their training materials. Bail argues prompt engineering shows some promise in addressing this and these biases may be easier to remove from generative AI than human populations, but this requires researchers being able to identify the bias and is hindered by researchers having little access to material used to train the model. This bias can be seen as a ‘bug’ or a ‘feature’ and may be useful in some research applications. We also don’t know what the developers have done to limit or prevent certain biases (such as racism) through fine-tuning to safeguard against their generative AI producing unappealing or toxic content.
Ethics
The ethical concerns with the use of generative AI in research include: whether consent should be sought before including generative AI in experiments involving human subjects, generative AI producing harmful content or misinformation when interacting with human subjects (especially if interactions are not closely ‘supervised’, although arguably generative AI could allow researchers more control over content than having two humans interacting), concerns with storage of identifiable, private or confidential data if private corporations control generative AI tools used in research, and environmental impact (see below). The ethical advantages include: generative AI being used to simulate dangerous scenarios and using generative AI to diagnose ethical issues.
Hallucination/ Junk Science
Generative AI can often create inaccurate information with high confidence. This is partly due to the models being trained on datasets which can contain misinformation or flawed content. In some cases, hallucinations have also become defamatory: In 2025, a Norwegian man filed a complaint after ChatGPT falsely claimed he had been convicted of murdering two of his children. As such incorrect personal data points are difficult or impossible to correct by such services, they are, in effect, breaching the GDPR.
Environmental Impact
The environmental cost of training and using generative AI tools are significant and as these tools become bigger (and better) so will their footprint!
Application areas
Text Analysis and Annotation
- Claim: Generative AI shows promise within text analysis and could be used for content analysis (such as classification, credibility assessment, topic identification) with accuracy that surpasses that of Amazon Mechanical Turk. It is not yet superior to expert coding, but has comparative advantages including unlimited attention-span, consistency, speed, and objectivity.
Synthetic Surveys, silicon samples
- Claim: Generative AI could be used to create ‘silicon samples’ after provision of a number of background variables and traits of respondents, thus allowing a representative sample to be used where ‘convenience samples’ are otherwise used for practical reasons. These synthetic surveys would also be able to be longer and include more invasive questioning.
- A good example of these synthetic surveys is this article which used GPT-3 'conditioned' using real survey data (such as ANES) to avoid unwanted algorithmic bias, and instead to produce responses that correspond to specific demographics.
Synthetic Experiments
- Claim: Generative AI has successfully reproduced some experiments (such as the Milgram experiment) and reflect known phenomena (such as the Prisoner’s Dilemma), there is an opportunity for generative AI to reproduce experiments. He also highlights a study which found a correlation of 0.86 between the results of 482 studies and synthetic experiments using GPT-4 indicating this opportunity could be widespread.
- Claim: Dillion et al. suggest that LLMs could replace human participants in psychological studies, by demonstrating that, when asked to make moral judgements', GPT3.5 reponded very similarly to humans.
Generative Agent-Based Models
- Claim: generative AI can be used to simulate populations which could lead to better, richer, ABMs thus enabling deeper research.
Blending Simulation and Human Experiment
- Claim: There are a number of articles which demonstrate that LLMs are capable of convincingly impersonating humans. Bail highlights this as an opportunity for content creation (eg. production of text designed to elicit a specific response in human subjects or creation of two similar images depicting different races). Bail also believes generative AI provides an opportunity to create chatbots that are convincingly human, to include AI participants with particular traits within research on groups and influence.
| Link | Description | Concerns | Application |
|---|---|---|---|
| Generative Artificial Intelligence in Qualitative Data Analysis: Analyzing—Or Just Chatting? | Authors ask whether genAI is appropriate for qualitative data analysis, concluding that it could potentially ‘introduce unacceptable epistemic risk’ and noting that interpretation of meaning, central to qualitative data analysis is an ‘inherently human capability’. They state ‘in our evaluation, LLM chatbots consistently failed to code the data efficiently, accurately, reliably and comprehensively, or identify meaningful themes’ | Validity, reliability | Qualitative data analysis |
| Synthetic Replacements for Human Survey Data? The Perils of Large Language Models | Bisbee et al. used ChatGPT 3.5 Turbo to create silicon samples based on real respondents from the ANES with questions similar to those in the ANES. They considered the mean and variance of responses along with correlation between persona characteristics and responses, and sensitivity to changes in the prompts. They found that their results were not replicable (they showed significant differences due to underlying algorithm changes between April and July 2023). While every synthetic mean was within 1 standard deviation of ANES results, the synthetic data show less variance and differences in distributions would have led to different inferences from the synthetic data. | Reproducibility | Synthetic surveys |
| Whose Opinions Do Language Models Reflect? | Santurkar et al. quantify results gained from using LLMs in comparison to opinion polls. Even when prompted to represent a certain US demographic group, the overall results tend to reflect more liberal, younger and educated respondents rather than the general population. The authors show that LLMs are particularly bad at representing some demographic subgroups such as over-65s, Mormons and the widowed. This study looks at US demographic groups; it is likely these issues extend outside of a US context. | Bias | Synthetic surveys |
| Which Humans? | Atari et al. show that LLMs best reflect WEIRD (Western Educated Industrialised Rich Democratic) societies. | Bias | |
| AI Psychometrics: Assessing the Psychological Profiles of Large Language Models Through Psychometric Inventories | Pellert et al. consider the psychometric profile of LLMs, concluding that LLMs portray traits of extraversion and agreeableness, and don’t show neuroticism. | Bias | |
| The emergence of economic rationality of GPT | Chen et al. show that GPT produces results that are more rational, and homogenous than humans. | Bias | |
| MarxistLLM | Nelimarkka fine-tuned an LLM to have a marxist viewpoint. In doing so, he demonstrated that the base model itself is not neutral and has a ‘viewpoint’. | Bias | |
| Moral Foundations of Large Language Models | Addulhai et al. use a psychological assessment tool (Moral foundations theory) to assess typical responses by LLMs and find that LLMs have a bias towards reflecting politically conservative people. In their paper they also highlight a number of risks coming from their use. | Bias | |
| Using Large Language Models for Qualitative Analysis can Introduce Serious Bias | The authors compare using LLMs with expert coding for analysis of open-ended interviews with a large number of participants (in their case, interviews with Rohinga refugees and their hosts in Bangladesh) and find that LLMs have are biased and that the resultant prediction errors are not random. They suggest it is 'preferable to train a bespoke model on a subset of transcripts coded by trained sociologists rather than use an LLM'. | Bias | Text analysis and annotation |
| Performance and biases of Large Language Models in public opinion simulation | Qu and Wang also find that ChatGPT has better performance for WEIRD countries (and best performance for the USA) and demonstrate other biases around gender, ethnicity, age, education, social class. They use data World Values Survey to evaluate the LLM's performance at producing silicon samples and highlight 3 challenges to address before LLMs can be used in the social sciences: 'global applicability and reliability', 'demographic biases', and 'complexity and choice variability in LLM simulations'. | Bias | Synthetic surveys |
| AI Snake Oil | Narayanan and Kapoor demonstrate that training generative AI tools comes at a human cost, highlighting that annotators which are used in many tasks such as labelling possible toxic output, are often overworked and underpaid. | Ethics | |
| Consent and Compensation: Resolving Generative AI’s Copyright Crisis | A concern is whether creators have consented to, or been compensated for their work being used to train genAI tools. | Ethics | |
| Taxonomy of risks posed by language models | Covers 21 ethical and social risks covering areas of: ‘Discrimination, Hate speech and Exclusion’, ‘Information Hazards’, ‘Misinformation Harms’, ‘Malicious Uses’, ‘Human-Computer Interaction Harms’, and ‘Environmental and Socioeconomic Harms. | Ethics, Environmental harms | |
| Out of Context! | Mervaala and Kousa have addressed such issues regarding the limitations of ChatGPT’s context window: for example, when the amount of text an LLM can analyse is exceeded, the model may concoct the rest of the analysis based on the beginning of the document. | Hallucination & junk science | |
| ChatGPT for Text Annotation? Mind the Hype! | Ollion et al. conducted a literature review and found mixed results from few- and zero-shot text annotation by ChatGPT “and kin models”. Overall results showed ChatGPT was generally outperformed by models fine-tuned on human annotations but may perform better than crowdsourced annotation. While different studies demonstrated varied performance by the LLMs, generally recall was better than precision (more false positives than false negatives). One issue raised in the review was that different studies used different metrics to evaluate ChatGPT. Other concerns raised by the authors were reproducibility of research using ChatGPT for annotation, privacy and copyright considerations, and further dominance of English language in research. | Text annotation | |
| Machine Bias | Boelaert et al.caution against the use of silicon samples, particularly in opinion polling . In their experiment, these samples display random bias for each question which they term “Machine Bias”. Their experiment, which compares Llama, Mixtral and GPT-4, and attempts to use LLMs to replicate results of the World Values survey from Australia, Mexico, Germany, Russia and the US, shows that LLMs are bad at predicting attitudes, show no systematic social bias (ie. the bias doesn’t correlate with a specific social group) and low adaptability to input of sociodemographic properties to display. | Bias | Synthetic surveys |
| Systematic testing of three Language Models reveals low language accuracy, absence of response stability,and a yes-response bias | Dentella et al. investigated the claim that LLMs have a ‘human-like language understanding’ by testing LLM ability to learn linguistic phenomena and comparing human and LLM performance in determining whether sentences were grammatical or not. The authors found a ‘yes’ bias from LLMs and that there is instability in their responses, contradicting the claim that LLMs possess ‘human-like’ language ability. This study found that LLMs were unable to produce ‘stable and accurate answers, when asked whether a string of words complies with or deviates from their next-word predictions’ which is an area humans are ‘universally good’ at. Thus, they show that LLMs cannot currently be used to understand natural language. | Reproducibility, bias | Text analysis |
| Vox Populi, Vox AI? | This study considered whether LLMs could be used to estimate vote choice in Germany by generating a silicon sample based on the 2017 German Longitudinal Election Study. They found that vote choice was not predicted accurately and results indicated a bias towards left-wing parties. Additionally, their results indicated that the LLMs were unable to demonstrate nuances in voting behaviours beyond ‘typical’ voter types. | Bias | Synthetic surveys |
| Knowledge of cultural moral norms in large language models | Ramezani and Xu study whether English-language LLMs ‘understand’ moral norms in various countries using the World Values Survey and PEW global surveys. They find that these are more accurately inferred for Western contexts and show that this can be improved through fine-tuning at the expense of their ability to estimate moral norms in English-speaking contexts. | Synthetic surveys | |
| Towards Measuring the Representation of Subjective Global Opinions in Language Models | Durmas et al. consider responses generated by LLMs and find that these resemble most closely respondents from the USA, Canada, Australia and some European and South American countries. When prompted to represent another country, the responses often contained ‘harmful cultural stereotypes’. Additionally, their study shows the distribution of responses to questions often differs greatly from that of human respondents as the LLM shows a high confidence in a single response where humans have more diverse viewpoints | Synthetic surveys | |
| Balancing Large Language Model Alignment and Algorithmic Fidelity in Social Science Research | Lyman et al. caution about the importance of model choice, acknowledging that there are a rapidly changing set of choices. They acknowledge that without "social science-specific guidance", it can be difficult to make an informed choice and offer a "clear process" to ensure a good choice is made. | Synthetic surveys | |
| Large language models that replace human participants can harmfully misportray and flatten identity groups | Wang et al. show there are 2 'inherent limitations' with using LLMs to produce synthetic survey data: misportrayal (LLMs give responses that resemble what out-group' members think of a group, rather than what they group thinks of themselves), and flattening (LLMs portray groups homogenously rather than reflecting diversity within group perpectives on a topic). A third limitation they suggest is that LLMs my 'essentialise' identities. The authors suggest all of these limitations are 'harmful for marginalised demographic groups'. | Synthetic surveys | |
| Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies | The positive findings about the replication of the Milgram Experiment listed above comes from a study where the authors also were able to replicate the results of the Ultimatum Game and Garden Path Sentences. However, genAI could not replicate the ‘Wisdom of Crowds’ which shows that not all known human behaviour phenomena will be reflected in genAI output, and is a cautionary case against using genAI for their discovery. | Synthetic experiments | |
| Large Language Models Do Not Simulate Human Psychology | Schröder et al. demonstrate that by making minor changes to the question text, which change the semantic meaning, significantly changes human responses, but not those from the LLMs. This demonstrates that LLMs cannot be relied upon to simulate human psychology. | Synthetic experiments | |
| Marked Personas | Cheng et al. look at stereotypes that are found in LLM output and present a method called 'Marked Personas' which measures these. This method first prompts the LLMs to generate personas, and then identified words which distinguish groups. They find higher rates of racial stereotypes in LLM output (compared with human-written personas from the same prompts) and evidence of 'othering and exoticising' non-white, non-male groups. | Bias | |
| AI generates covertly racist decisions about people based on their dialect | The authors show 'covert racism' within LLMs such as dialect prejudice (such as using negative raciolinguistic stereotypes). The result of this prejudice could include racist job application and sentencing outcomes if LLMs are used for these! | Bias | |
| Open Letter | In October 2025, 416 qualitative researchers co-signed an open letter ‘We reject the use of generative artificial intelligence for reflexive qualitative research’. There has been active debate around the letter. | Reflexive qualitative analysis | |
| Randomness Not Representation | This authors tested whether LLMs have stable, coherent, and steerable preferences, finding that they instead display instability based on minor changes in prompting, are incoherent since alignment with one culture doesn’t reliably predict alignment on other issues, and have erratic behaviour when prompted to adopt specific cultural perspectives. | Bias, reproducibility, reliability | Synthetic surveys |