- AI tools like OpenAI Whisper and GPT-3 can be used to improve social science research workflows by helping to collect and analyse speech and text data.
- In this article, I describe two worked examples where I applied AI tools to (1) transcribe and (2) conduct basic thematic analysis of a research interview, and provide enough detail for readers to replicate and adapt my approach.
- OpenAI Whisper (example) created a high quality English transcription of a 30 minute research interview at a ~70x cost saving compared to a human transcriber.
- GPT-3 (text-davinci-003; example) answered a research question and identified relevant themes from a transcribed research interview, after providing a structured prompt and one example.
- These tools, when chained together with human oversight, can be considered an early, weak example of PASTA (Process for Automating Scientific and Technological Advancement).
Social science research workflows involve a lot of speech and text data that is laborious to collect and analyse
The daily practice of social science research involves a lot of talking, reading and writing. In my applied behaviour science research consulting role at Monash University and through Ready Research, I generate or participate in the generation of a huge amount of speech and text data. This includes highly structured research activities such as interviews, surveys, observation and experiments; but also less structured research activities like workshops and meetings.
Some fictionalised examples of work I’ve done in the past year:
- Research interviews with 20 regular city commuters to understand what influences their commuting behaviour post-COVID, to assist a public transit authority in planning and operating its services efficiently
- Practitioner interviews with staff from city, regional and rural local governments to assess organisational readiness for flood preparation and response
- Workshop of 5-10 people involved in hospital sepsis care, each representing a different interest (e.g., patients, clinicians, researchers, funders) to identify priority areas to direct $5M research funding
- Survey of 5,000 Australians to understand the impacts and experiences of living under lockdown in Melbourne, Australia during COVID-19
- Evaluation interviews with 4 participants in the AGI Safety Fundamentals course to understand the most significant change in their knowledge, skills, or behaviours as a result of their participation
To make this data useful it needs to be collected, processed, organised, structured and analysed. The typical workflow for these kinds of activities involves taking written notes during the research activity, or recording the audio / video research activity and reviewing the recording later. Interviews are sometimes transcribed by a paid service for later analysis. Other times they are transcribed by the researcher.
The amount of speech and text data generated during research activity is large - each research activity yields thousands of words. The sheer volume of data can be overwhelming and daunting, making it difficult to carry out analysis in any meaningful way. In addition, sometimes data just isn’t collected (e.g., during an interview or workshop) because the researcher is busy facilitating / listening / processing / connecting with research participants.
Even for data that is collected, managing and analysing it is a challenge. Specialised programs such as nVivo are used in social science to manage and analyse text data, but less structured research activities would almost never be managed or analysed through this kind of program, because of the time and skills required. Open text data in surveys may be hand coded based on content or theme, if there is time.
Faster approaches to collecting or analysing data could significantly improve the output of research workflows. For example:
- Researchers could focus on doing the research, rather than documenting it. If researchers could focus on building a rapport with interviewees and asking more well-considered follow-up questions (instead of taking notes) because they could trust in an accurate transcription of the interview, the interviewee may provide answers that better answer the research question; or the researcher time saved could be used to interview more and more diverse participants.
- Unstructured research activities like meetings and workshops could generate useful data. If workshops could be recorded, made searchable, and summarised, this could help researchers and participants recall what happened during the workshop and apply the knowledge or decisions generated through the workshop in subsequent research activities.
- Qualitative research methods could be used more widely. Text-based data has a reputation for being complicated and time-consuming to analyse, leading to a bias towards quantitative evaluations that can miss out on a ‘thick’ understanding of the impact of an intervention.
This is my motivation for experimenting with AI tools such as OpenAI Whisper (speech to text) and GPT-3 (large language model) to improve social science research workflows. The rest of this article describes two worked examples where I used these AI tools to transcribe and analyse evaluation interviews for the AGI Safety Fundamentals course. I received permission to share the transcript from one interviewee.
Worked example 1: speech to text with OpenAI Whisper
I experimented with OpenAI Whisper running on a Hugging Face space to transcribe research interviews of about 30 minutes. I found that Whisper is extremely capable in speech to text transcription, and could effectively replace human transcribing services (at 70x cost saving) for most research interviews where transcription of exact utterances (e.g., hesitations, pauses) is not required.
Use case: recording and transcribing research interviews
The use case for this example was several 30 minute interviews I conducted to evaluate participant experience & outcomes with the Artificial General Intelligence Safety Fundamentals course (AGISF) in late 2022. I designed new AI governance course materials and facilitated a small cohort of Australian and New Zealand participants with support from Chris Leong (AI Safety ANZ) and the Good Ancestors Project. I had three evaluation questions.
First, I wanted to understand whether participating in AGISF had an impact on participants’ behaviours and behavioural influences around AI safety. The evaluation method I used for this was a version of Most Significant Change, which is a participatory evaluation method that asks people to identify and share personal accounts of change.
Second, I wanted to improve my facilitation practice by understanding what participants found helpful in supporting their learning and experience during the course. I’m also designing a new version of facilitation training for Blue Dot Impact (who run AGISF), and wanted to understand which elements were most relevant. I asked participants to imagine themselves in the role of a future facilitator of the course in order to elicit more reflective feedback about what they would keep or change in their experience of my facilitation.
Third, I wanted to hear any other comments participants had about their existing course feedback, which was via Google Forms. I shared the participants’ individual responses to a short survey and asked if they had any comments, reflections or elaborations for their responses to the survey, several weeks later.
List of evaluation questions asked in the interview:
- Looking back over the AGISF course, what do you think was the most significant change in your knowledge, beliefs, skills, or actions when it comes to AGI alignment / governance?
- If you were to facilitate a group for AGISF in the future, what would you keep / change, based on your experience in this course?
- Would you like to comment on any of your end of survey responses?
Once I had the transcripts, I conducted a basic thematic analysis (see worked example 2).
How I used Whisper
I used a web application called whisper-webui on Hugging Space to run OpenAI Whisper on an audio file, which generated a transcript. But what is Whisper and what is Hugging Face?
In September 2022, OpenAI released a speech-to-text transformer model called Whisper (detail / demo). It can transcribe speech to text in many languages, as well as translate non-English speech to English text. This model was trained on 680,000 hours of English and non-English audio.
Hugging Face is a service that is primarily designed to support developers to do ML training and inference by hosting models, datasets, and applications (“spaces”), and providing access to compute. However, anyone can duplicate a public space and modify its code (similar to forking a github repository). I accessed Whisper through Hugging Face by duplicating Kristian Stangeland’s aadnk/whisper-webui space. I removed the 10 minute limit on the length of audio input, and paid for better compute at the cost of $0.90 USD per hour.
I conducted research interviews that were recorded with interviewees’ consent over Zoom. I uploaded the recordings to my private Hugging Face space and ran Whisper through the web interface. For a 25 minute interview and the largest Whisper model (large-v2), this took about 7.5 minutes and cost about USD $0.23. Once it was finished, I downloaded the raw transcripts. Finally, I read through the transcripts and added line breaks and text (e.g., “INTERVIEWER:”) to distinguish between speakers for ease of reading, which took about 5 minutes. You can read more about price and timing details in a footnote
In this section I present an excerpt of the transcript for evaluation question 1 (“what was the most significant change?”). You can read the full transcript as a Google doc with the permission of the interviewee.
INTERVIEWER: Cool. So yeah, I mean the first question is really just looking back over the AGI safety fundamentals course, what do you think was the most significant change in your knowledge police. skills or actions when it comes to AI alignment and governance?
PARTICIPANT: I think the main thing is one we talked about during the course, which is that I feel more AI-pilled was the phrase that I kept using and would still use. But yeah, basically before that point I would have said AI safety is something that we should care about, it's something that we should probably have some people doing some research on and trying to figure things out. But I wouldn't have gone much further than that. If someone asked like why do you think it's important, I would probably say, you know, it seems plausible that we could create an intelligence that will be better than human average intelligence. And that's something that we should at least worry somewhat about and have some people like looking into. But now I think yeah, I feel more persuaded that this is like a pressing issue and something that I could potentially contribute to and should be, not should be, but could be talking to people more about who are interested in this particular cause area. I still have quite a lot of uncertainty, I would say, and I still am not fully on the side of like everyone should drop everything now and work on AI safety, which I don't think you are either. And I don't think that came across in the course, but I have met some EAs who are like that. I'm not there yet, but I am like this is a priority and something that we should be trying to get more people to study and research and care about.
INTERVIEWER: Okay, that's really helpful to hear. Thank you. So, if I understand what you're saying, you... I think that AI safety is now a more pressing concern than you did previously, and that you... When you said, you said, should and then could, were you talking about it in terms of feeling more equipped and confident to have those discussions or feeling more like motivated or like it was needed to have those discussions with people who maybe don't have those thoughts on AI safety?
PARTICIPANT: Both, I would say. I was thinking... I used the word should initially because I was thinking these are the types of conversations that someone should be having with other people, and as a community builder currently, like that person is probably me in a lot of cases, but also I feel more capable to have those conversations, so I've changed the word to could.
Room for improvement / roadmap
- Speaker identification. This would reduce the extra time it takes to read through the transcript and format it by speaker. One example of a service that does speaker identification well is Otter.ai, although the quality of the transcription is worse than Whisper. I have been experimenting with apps that combine Whisper with other models to do speaker identification (referred to as “speaker diarization”), e.g., dwarkesh/whisper-speaker-recognition, Majdoddin/nlp. These show some promise but don’t reliably detect speakers in recording of 20 minutes.
- Dealing with overlapping speakers. This is common in a workshop or meeting setting, especially one that isn’t formally facilitated (where there is direction or enforcement of a single speaker talking at once)
- User experience and integration. Setting transcription options by hand each time is inconvenient. Email or app integration to automate the process would be excellent.
- Reliability. Overall the quality of transcription was excellent - as good as I could have produced by hand. However, in some spot checks I noticed times when Whisper would fail to transcribe a sentence altogether, or add additional words to a sentence that the speaker didn’t say. This happened on one occasion across a 30 minute interview, but not in every interview or every app implementation of Whisper I tried.
- Structure, summarisation and analysis. Automatic recognition of topic or evaluation question change, aggregation across interviews, and other research administration tasks would make this kind of tool much easier to integrate into a social science research workflow.
Worked example 2: text analysis with GPT-3
I experimented with GPT-3 running on OpenAI Playground to conduct summarisation and thematic analysis of a transcribed research interview. I found that providing structured prompts and one example led to useful text analysis output for a single interview.
Use case: analysing research interviews
The use case for this example is the same as in Worked example 1. I had conducted several interviews with participants in an AGI Safety Fundamentals course (AGISF). I transcribed them with OpenAI Whisper using a Hugging Face interface. I had already read over the transcripts but was looking for a low-fidelity, low-effort way to summarise and extract information and identify insights for future decision-making. I modelled my approach on content analysis and thematic analysis, which I use frequently in dealing with text data.
How I used GPT-3
GPT-3 is a large language model initially released in 2020 by OpenAI and updated several times, most recently in November 2022 (as text-davinci-003) alongside the release of ChatGPT (see Yu 2022 for a detailed history). Alongside language generation, GPT-3 can do in-context learning, which makes it effective at completing tasks from very few or zero examples. I used the OpenAI GPT-3 Playground to create a template for analysis of an interview transcription: Basic text analysis from transcript.
I copied a section of the Whisper-generated transcript corresponding to a single research topic / question into the template, then ran the model in Playground. Each section of the transcript was about 1,000 tokens. The template (including 1 example of transcript + analysis) was about 1,500 tokens. The completion window was set at a limit of 1,000 tokens. This meant that analysis for each section of the transcript was about 3,500 tokens, took about 15 seconds, and cost about USD $0.07.
Research (evaluation) question
Looking back over the AGISF course, what do you think was the most significant change in your knowledge, beliefs, skills, or actions when it comes to AGI alignment / governance?
Summary of response to research question
This summary was generated by GPT-3, based on the Whisper-generated transcript for the research question.
The most significant change in the course participant's knowledge, beliefs, skills and actions when it comes to AI alignment and governance was that they felt more assured in their understanding of the importance of the topic, became more confident in their ability to talk and engage in conversations about it, and began actively looking for roles related to AI safety. The participant had also had a conversation with the EA City community about the topic as a result of taking the course, inspiring others to look into the same material.
Themes relevant to research question
This table was generated by GPT-3, based on the Whisper-generated transcript for the research question.
|Increased assurance of importance of AI safety||Participant had a greater appreciation for the importance of AI safety and felt more motivated to contribute to it.|
|Increased confidence in ability to talk about AI safety||Participant felt more confident in their ability to have conversations about the topic and engage in debates about it.|
|Actively looking for roles related to AI safety||Participant began looking more seriously into roles related to AI safety, and applied for one role in particular.|
|Inspiring conversation with EA City Community||Participant had a conversation with the EA City Community about AI safety, which inspired others to look into the material.|
Room for improvement / roadmap
- Aggregation and synthesis of responses. The token limit and manual use of GPT-3 Playground means that aggregation of multiple responses is done manually. Because my use case involved only a few participants, I just manually analysed the summaries and themes across the set of participants. But if responses could be aggregated and synthesised (with a human in the loop to review intermediary steps), this could scale to many more interviews.
- User experience and integration. Embedding the saved prompt and settings into an application using the GPT-3 API would be much faster and easier to use.
- Richer understanding of research context and purpose. I have always struggled to understand just how much context is ‘useful’ to provide these kinds of analyses. The abstract to the paper I’m trying to write? A set of hypotheses to test? I haven’t seen good examples of this kind of work, although I think that Elicit might have it on its roadmap
- Research administration tasks. Automatically recognising when the topic / research question changes, combining responses across interviews, and other research administration tasks would make this kind of tool much easier to integrate into a social science research workflow.
Social science research generates a lot of formal and informal speech and text data. Due to the volume, this data isn’t always collected. When it is collected, it is often not analysed, even in a basic way. AI tools such as Whisper and GPT-3 can be used to improve research workflows by transcribing speech and analysing text data, at least for tasks where speed / efficiency is the priority over rigour / sophistication.
I experimented with these tools and found that Whisper is highly capable in speech to text transcription and can effectively replace human transcribing services, and GPT-3 (text-davinci-003) can be used for summarisation and basic thematic analysis of a transcribed research interview, if provided with structured prompts and one example. These tools - or their next iterations - can be considered an early, weak example of PASTA (Processes that Automate Scientific and Technological Advancement).
Thank you to Michael Noetel, Emily Grundy, Peter Slattery, and Dewi Erwan for their helpful feedback on this article.
In the four weeks since drafting this post, several other Hugging Face spaces have been created that seem to work more efficiently and also detect speakers, such as vumichen/whisper-speaker-diarization.
Note: Dated Jan 2023. Personal experience and low-quality speculation. Corrections welcome!
I purchased compute from Hugging Face to speed up transcription. I selected the T4 medium option, which is USD$0.90 per hour to run and is a “Nvidia T4 with 8 vCPU and 30GiB RAM”.
As far as I can tell, a Hugging Face space uses CPU / GPU time as long as it is active, even if it is idle. The user can specify a sleep time, where the space is paused after [sleep time] minutes idle. Early on, I encountered an issue where if the sleep time was shorter than the time taken to complete transcription, the job would fail. I set my sleep time to 15 minutes (or about 1 hour+ of audio) to try and avoid this issue. This meant that each occasion the space was activated, it cost no less than 15 minutes of compute time. Therefore, the price for any individual transcription is about USD $0.23 ($0.90 x 0.25 hours). This would be cheaper if multiple transcriptions were batched together.
The actual time to run a transcription job depended on audio length, model size and parameters. I used either the medium or large-v2 Whisper model. The only parameter I varied was voice activity detection (VAD) aggregation, either 30s or 60s. A cursory inspection found that the 60s aggregation seemed to have less problems in formatting sentences (I found that Whisper was often cutting off sentences early [e.g., “That’s where I found myself” → “That’s where I. Found myself”]).
I tried running each combination of model and VAD aggregation to understand the processing time and compare it to my subjective assessment of transcription accuracy. I spot checked the generated transcripts against each other (unblinded).
Time to transcribe a 25 minute English audio file Voice Activity Detection (VAD) aggregation 30 seconds 60 seconds Model medium 262s (4:22) 252s (4:12) large-v2 340s (5:40) 444s (7:24)
Overall because the cost for a single job had a lower bound of 15 minutes, I would recommend using large-v2 with VAD aggregation set at 60s.
Consider using Conjecture’s new Verbalize (https://lemm.ai/home) STT tool for transcriptions! They’ll be adding some LLM features on top of it, and I expect it to have some cool features in coming out this year.
I’ve been also pushing (for a while) for more people within EA to start thinking of ways to apply LLMs to our work. After ChatGPT, some people started saying similar stuff so I’m glad people are starting to see the opportunity.
Thanks Jacques, I'll need to check this out. Appreciate the pointer and keen to hear more about an LLM layer on this (e.g., identifying action items or summarising key decision points in a meeting, etc).
A couple of services that make it extremely easy to run machine learning models like Whisper include:
Thanks for writing this up!
I find the claim that all of this is an early and preliminary example of (a) PASTA (Process for Automating Scientific and Technological Advancement) to be pretty interesting.
I hadn't made that connection. Whisper and GPT-3 will almost certainly help to accelerate science (especially if used alongside other tools and improved) and there is already related discussion of how they are going to affect science.
Now I wonder what Holden thinks the threshold for a 'PASTA' is and whether he'd agree that this is an example?
I had wondered if it was too hyperbolic to claim that this was an example of proto- or early-PASTA. My earlier draft hedged and said that the next version of these tools would be something like an early PASTA. I would characterise Holden Karnovsky's post introducing PASTA as describing an agentic system that could improve by making copies of itself and improving itself.
However, when he first introduces the idea of 'explosive' scientific and technological advancement, it's through the thought experiment of creating digital people, which mean that many more minds can be allocated to different research problems.
I would argue that using Whisper or GPT-3 in the way I've described in this article is applying a kind of information processing system that in a very limited sense, is similar to allocating another mind to the research problem of capturing and analysing speech & text data - because it essentially replaced me or another researcher doing the task. This is especially the case when chaining tools together with (for now) human supervision. This allows Whisper (language processing module) and GPT-3 with prompting (summarisation and analysis module) to combine for more useful 'mind-replacement' than either alone.