LLMs can be divided into two categories: generative & predictive.

The generative capabilities of LLMs have been the subject of much attention and discussion, and rightly so – they are incredibly impressive and often only require zero or few-shot learning.

The increasing popularity of Prompt Engineering has further highlighted the importance of generative tasks.

The image below shows the most common generative tasks from a Conversational AI Development Framework perspective, along with the predictive tasks.

The importance of correctly predicting an intent with a Large Language Model (LLM) is paramount, as the actions taken by a chatbot are based on this result.

To achieve this, both generative and predictive LLMs can be fine-tuned to create a custom model. OpenAI GPT-3, Ada is an example of a LLM that can be fine-tuned for classifying text into one of two classes, as seen in the image below.

As fine-tuning of LLMs becomes more commonplace, it will become the norm for mass adoption of LLMs in more formal and enterprise settings.

We are ready to begin!

The code below will allow us to access the training data from Sklearn . The command listed displays the various categories of data that have been archived from the original 20 newsgroups website.

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
from pprint import pprint

These are the 20 categories available, from these we will make use of rec.autos and rec.motorcycles.


The code to fetch the two categories we are interested in, also assign the data to vehicles_dataset .

from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import openai

categories = ['rec.autos', 'rec.motorcycles']
vehicles_dataset = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories=categories)

Below a record is printed of dataset:


The result shows that the data is disorganised and each entry has a high possibility of containing ambiguity or inaccuracy.

From: stlucas@gdwest.gd.com (Joseph St. Lucas)
Subject: Re: Dumbest automotive concepts of all time
Organization: General Dynamics Corp.
Distribution: usa
Lines: 10

Don't have a list of what's been said before, so hopefully not repeating.

How about horizontally mounted oil filters (like on my Ford) that, no
matter how hard you try, will spill out their half quart on the bottom
of the car when you change them?

Joe St.Lucas    stlucas@gdwest.gd.com        Standard Disclaimers Apply
General Dynamics Space Systems, San Diego
Work is something to keep me busy between Ultimate Frisbee games.

We can now determine how many records and examples we have for autos and motorcycles.

len_all, len_autos, len_motorcycles = len(vehicles_dataset.data), len([e for e in vehicles_dataset.target if e == 0]), len([e for e in vehicles_dataset.target if e == 1])
print(f"Total examples: {len_all}, Autos examples: {len_autos}, Vehicles examples: {len_motorcycles}")

The printed result:

Total examples: 1192, Autos examples: 594, Vehicles examples: 598

The next step is converting the data into JSON format defined by OpenAI here. Below is an example of the format.

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

The code to convert that data…

import pandas as pd

labels = [vehicles_dataset.target_names[x].split('.')[-1] for x in vehicles_dataset['target']]
texts = [text.strip() for text in vehicles_dataset['data']]
df = pd.DataFrame(zip(texts, labels), columns = ['prompt','completion']) #[:300]

lastly, converting the data frame to a JSONL file named vehicles.jsonl:

df.to_json("vehicles.jsonl", orient='records', lines=True)

Now the OpenAI utility can be used to analyse the JSONL file.

!openai tools fine_tunes.prepare_data -f vehicles.jsonl -q

With the result of the analysis displayed below…


- Your file contains 1192 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 5 examples that are very long. These are rows: [38, 203, 910, 1057, 1130]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Remove 5 long examples [Y/n]: Y
- [Recommended] Add a suffix separator `\n\n###\n\n` to all prompts [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y

Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `vehicles_prepared_train.jsonl` and `vehicles_prepared_valid.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "vehicles_prepared_train.jsonl" -v "vehicles_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " motorcycles"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["s"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 30.82 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

Now we can start the training process and from this point an OpenAI api key is required.

The command to start the fine-tuning is a single line, with the foundation GPT-3 model defined at the end. In this case it is ada. I wanted to make use of davinci, but the cost is extremely high as opposed to ada, which is one of the original base GPT-3 models.

!openai --api-key 'xxxxxxxxxxxxxxxxx' api fine_tunes.create -t "vehicles_prepared_train.jsonl" -v "vehicles_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " autos" -m ada

The output from the training process.

Upload progress: 100% 1.35M/1.35M [00:00<00:00, 1.59Git/s]
Uploaded file from vehicles_prepared_train.jsonl: file-qN7D2kAh9h5Ui1XnZuDNrPgm
Upload progress: 100% 320k/320k [00:00<00:00, 623Mit/s]
Uploaded file from vehicles_prepared_valid.jsonl: file-ijOCUihdypRPrcTzodxN9Pa6
Created fine-tune: ft-xPIJ4BIM4giuXY4JOQ9rno2v
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-01-31 05:44:52] Created fine-tune: ft-xPIJ4BIM4giuXY4JOQ9rno2v
[2023-01-31 05:46:21] Fine-tune costs $0.65
[2023-01-31 05:46:22] Fine-tune enqueued. Queue number: 0
[2023-01-31 05:46:24] Fine-tune started
[2023-01-31 05:49:03] Completed epoch 1/4
[2023-01-31 05:51:36] Completed epoch 2/4
[2023-01-31 05:54:07] Completed epoch 3/4
[2023-01-31 05:56:38] Completed epoch 4/4
[2023-01-31 05:57:11] Uploaded model: ada:ft-personal-2023-01-31-05-57-11
[2023-01-31 05:57:12] Uploaded result file: file-EzPswfO3vXl3RXqbIGr4qebS
[2023-01-31 05:57:12] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m ada:ft-personal-2023-01-31-05-57-11 -p <YOUR_PROMPT>

And lastly, the model is queried with an arbitrary sentence: So how do I steer when my hands aren't on the bars?

openai.api_key = "xxxxxxxxxxxxxxxxx"
ft_model = 'ada:ft-personal-2023-01-31-05-57-11'
sample_utterance ="""So how do I steer when my hands aren't on the bars?"""
res = openai.Completion.create(model=ft_model, prompt=sample_utterance + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)

The correct answer is given in motorcycles .

Another example with the sentence: Is countersteering like benchracing only with a taller seat, so your feet aren't on the floor?

ft_model = 'ada:ft-personal-2023-01-31-05-57-11'
sample_utterance ="""Is countersteering like benchracing only with a taller seat, so your feet aren't on the floor?"""
res = openai.Completion.create(model=ft_model, prompt=sample_utterance + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)

And again the correct result is given as motorcycles .

As production implementations of LLMs become more widespread, more emphasis will be placed on fine-tuning them to maximise performance.

Nevertheless, the importance of fine-tuning LLMs is currently not being fully recognised.

I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.