Sentiment analysis with the OpenAI API - Part 1

In this Python post, I share my experience of accessing an IMDb review dataset from Hugging Face, and describe my setup for accessing the OpenAI API for sentiment analysis.

Note (February 11, 2024)

A couple of changes from OpenAI mean that the API-calling functions described and used in this blog, and Part 2, no longer work. Additionally, the model text-davinci-003, which I experimented with below, and which displayed some pretty quirky behaviour, has now been deprecated. I’ve added some comments on this, as well as an updated function which will work with gpt-3.5-turbo, in Part 3 – I’ve also added some notes there on the secure management of API keys in Python.

Introduction

This is the first of three blogs on my experience of, and thoughts about, using the OpenAI API for sentiment analysis. Initially I just wanted to experiment with API calls, but I discovered some unexpected behaviour with one of the models and this prompted me to explore things further. Although I don’t imagine that what I’ve discovered is original, it does have implications for anyone considering using the OpenAI API seriously for classification tasks in general, not just for sentiment analysis. Given the eagerness of many businesses to “harness” generative AI, I thought this was worth exploring and documenting.

The blogs are organised as follows:

In this blog, I explain how I accessed the IMDb dataset I used, and also the API. There are lots of guides online on how to use this API, so I won’t go into this in too much detail (although I do look closely at the functions which enable interaction with the OpenAI API). The functions I use, together with brief commentary, are in the Appendix below for anyone who is interested. I primarily use these in the second blog, and they should hopefully make it easier to replicate what I present there.
In Part 2, I document some experiments with the endpoints for the text-davinci-003 and GPT-3.5-turbo large language models (LLMs). The text-davinci-003 model, which now has “legacy” status and won’t be updated by OpenAI, gave the strangest results (with a tiny change in a prompt reversing the sentiment classification for one review). The comparison with GPT-3.5-turbo, which most people would probably use for sentiment analysis now, was also revealing and I think highlights additional concerns about this newer model.

One point, for the sake of transparency, is that I’m now using GPT4 as a “co-coding partner”. This contrasts with my R blogs, which were written in what’s now the “old-school” way (i.e. on my own, admittedly with occasional reference to Stack Exchange, DataCamp notes etc). In my view, life is too short not to use amazing technology like this. That said, I’ve found that GPT4 sometimes makes code more complicated than it needs to be, or goes down unhelpful cul-de-sacs. I think it’s important to know what you want, how to achieve it in an economical way, and what’s going on with any code, to get the best out of tech like this (which is one of the reasons I was keen to provide detailed, human-generated, explanations of the code). Oh, and I still look at Stack Exchange etc!

Here, then, are notes on how I accessed the dataset and the OpenAI API.

Getting the data

The dataset I used was downloaded from Hugging Face. For the legally-minded, IMDb permits subsets of their data to be used for personal, non-commercial, purposes, and this use falls into that category. The dataset can be downloaded, or loaded from a local cache if already downloaded, as follows:

from datasets import load_dataset
imdb = load_dataset("scikit-learn/imdb")

More information on the load_dataset() function can be found here on Hugging Face. Running this code creates a DatasetDict object, similar to a Python dictionary, with this structure:

datasets.dataset_dict.DatasetDictDatasetDict({
    train: Dataset({
        features: ['review', 'sentiment'],
        num_rows: 50000
    })
})

I’m only interested in the reviews here, not the sentiment labels which were produced using an older approach to sentiment classification.¹ These can be extracted with the command imdb['train']['review'][:slice_size], where slice_size is some numeric value. I discovered some interesting things with only 20 reviews, and I’ve set 20 as the default slice_size within the classify_sentiment function. Inside the function, this command extracts the reviews into a list of strings which are then passed, within a loop, to the API.

Interacting with the API

In Part 2 I’ll examine the output of two models: the older text-davinci-003 model, which provides faster responses and is still available (although, as mentioned above, has “legacy” status and is no longer being updated), and the newer GPT-3.5-turbo model, which provides slightly slower responses, but is a tenth of the cost of text-davinci-003 (and, as of writing, is still being updated by OpenAI).

My functions for these API calls are called classify_sentimentDavinci() and classify_sentimentTurbo respectively (see Appendix for full details).

Some brief comments about these functions. They are almost identical, apart from the code relating to API-interaction and response-extraction. For classify_sentimentDavinci(), this part is:

response = openai.Completion.create(
            model="text-davinci-003",
            prompt=full_prompt,
            max_tokens=2,
            temperature = 0
        )

This code, which is part of a for-loop, specifies the model, passes full_prompt to it, specifies that up to 2 tokens can be used in the model output, and sets temperature to zero.

The variable full_prompt is formed by concatenating, just prior to this code chunk, the main prompt text (e.g. “Classify the sentiment of the following review as negative or positive”) and the text of a particular review.

I’ve set max_tokens to 2 to cater for the words “negative”, “positive” and “neutral”, which are exactly one token long, and also for my later use of “borderline” which is two tokens long. This can all be confirmed here.

It’s important to explicitly set temperature to 0 for sentiment analysis - I’ll say more about this in Part 2.

The equivalent part of the second function, classify_sentimentTurbo(), is:

response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": full_prompt}],
            max_tokens=2,
            temperature = 0
        )

This is similar to the last example, with some key differences:

Most obviously, the openai.ChatCompletion.create() endpoint needs to be used.
Besides specifying gpt-3.5-turbo as the model, prompts are passed in a different way i.e. as a list of dictionaries. This time, I’m passing prompts in my “role” as a “user” (an option not available with the openai.Completion.create() endpoint).
- It’s also possible to provide a “system” message e.g. {'role': 'system', 'content': "You are an expert at sentiment classification"}. This, though, is optional and I’ve not used it here. I’ll return to how providing a “system” message might affect behaviour in Part 3.
The other difference is with response-extraction. With the text-davinci-003 endpoint, response information is nested within a dictionary within a list within a dictionary, so we need response['choices'][0]['text'] to extract it. With the openai.ChatCompletion.create() endpoint, the response information is even more deeply nested, within another dictionary, so we end up needing response['choices'][0]['message']['content'] to extract the sentiment predictions. Convoluted, but once it’s operationalised within a function it can happily be forgotten about!

The output from the classify_sentimentDavinci() and classify_sentimentTurbo functions, after some further manipulation, is a DataFrame with two columns, e.g.:

df1.head()

##                                               Review Sentiment
## 0  One of the other reviewers has mentioned that ...  positive
## 1  A wonderful little production. <br /><br />The...  positive
## 2  I thought this was a wonderful way to spend ti...  positive
## 3  Basically there's a family where a little boy ...  negative
## 4  Petter Mattei's "Love in the Time of Money" is...  positive

Appendix - functions

functions almost identical (I suppose they could have been combined…)

Note that for the API calls you’ll need an API key, which will require signing up for an account with OpenAI. There are different ways to set an API key (for a detailed reference, see here). In my experiments I used this approach:

import openai

openai.api_key = "<YOUR API KEY>"

Here are the two functions I used for API calls (which both rely on pandas, so I’ve included this import at the start):

For `text-davinci-003`

This is the function I used to access text-davinci-003, classify_sentimentDavinci().

import pandas as pd

def classify_sentimentDavinci(prompt, slice_size=20):
    # Initialize list to store predicted sentiments
    predicted_sentiments = []

    # Extract reviews based on the slice_size
    reviews = imdb['train']['review'][:slice_size]

    # Iterate over the sliced items
    for review in reviews:
        # Construct the full prompt
        full_prompt = f"{prompt}: {review}\nSentiment:"
        
        # Pass the full prompt to OpenAI API for sentiment analysis
        response = openai.Completion.create(
            model="text-davinci-003",
            prompt=full_prompt,
            max_tokens=1,
            temperature = 0
        )
        # Extract the sentiment label from the output (removing any white space, and ensuring lower case)
        predicted_sentiment = response['choices'][0]['text'].strip().lower()
        
        # Add the predicted sentiment to the list
        predicted_sentiments.append(predicted_sentiment)

    # Create a DataFrame from the reviews and predicted sentiment labels
    df = pd.DataFrame({
        'Review': reviews,
        'Sentiment': predicted_sentiments
    })

    return df

Explanation

The function takes 2 arguments:

prompt (a string), which is the main prompt e.g. “Classify the sentiment of the following review as negative or positive”
slice_size (a numeric value), which is the number of reviews to process (defaults to 20)

Within the function, predicted_sentiments, an empty list, is initially created - this will be used to store API responses (see below).

Then, within a for loop:

full_prompt is created by concatenating the main prompt with each review
this version of the prompt is passed to the API, as discussed earlier in the blog
the responses from API are extracted (as discussed earlier) and stored in the variable predicted_sentiment
each predicted_sentiment is appended to the list created earlier, predicted_sentiments, one for each iteration of the loopå

Finally, after the loop has completed, a pandas DataFrame is created, and returned, with 2 columns:

‘Review’, containing the original reviews which were fed into the loop earlier
‘Sentiment’, containing the sentiment labels initially stored in predicted_sentiments

Best practice would normally be to include this information within a docstring at the start of the function. As this is a super-bespoke function which only works with a DatasetDict object called imdb, where ’train’ must be used as a key to access information stored in ‘review’, and more work would be needed to make it more generalisable, I decided this was probably overkill. Obviously, if you find this function useful, please feel free to edit it for your own purposes.

For `GPT-3.5-turbo`

This is the function I used to access gpt-3.5-turbo, classify_sentimentTurbo():

This is exactly same as a last function, classify_sentimentDavinci, but for the use of openai.ChatCompletion.create() and a slightly different way to extract the text of the response from output (as discussed earlier). The explanation for that function therefore applies here too.

import pandas as pd

def classify_sentimentTurbo(prompt, slice_size=20):
    # Initialize list to store predicted sentiments
    predicted_sentiments = []

    # Extract reviews based on the slice_size
    reviews = imdb['train']['review'][:slice_size]

    # Iterate over the sliced items
    for review in reviews:
        
        # Construct the full prompt
        full_prompt = f"{prompt}: {review}\nSentiment:"
        
        # Extract the sentiment label from the output (removing any white space, and ensuring lower case)
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{'role': 'user', 'content': full_prompt}],
            max_tokens=2,
            temperature = 0
        )
        # Extract the sentiment label from the output (removing any white space, and ensuring lower case)
        predicted_sentiment = response['choices'][0]['message']['content'].strip().lower()
        
        # Add the predicted sentiment to the list
        predicted_sentiments.append(predicted_sentiment)

    # Create a DataFrame from the reviews and predicted sentiment labels
    df = pd.DataFrame({
        'Review': reviews,
        'Sentiment': predicted_sentiments
    })

    return df

The dataset was kindly provided by the researchers, Andrew L. Maas and colleagues, who published details of their method this 2011 paper: https://ai.stanford.edu/~ang/papers/acl11-WordVectorsSentimentAnalysis.pdf ↩︎

Introduction¶

Getting the data¶

Interacting with the API¶

Appendix - functions¶

For text-davinci-003¶

Explanation¶

For GPT-3.5-turbo¶