Dataset
Before we start let us fetch the dataset with which we will finetune the LLM. We choose the dataset for the tweet sentiment classification as below.
from datasets import load_dataset
dataset = load_dataset("mteb/tweet_sentiment_extraction")
train_dataset = dataset['train']
eval_dataset = dataset['test']
df = pd.DataFrame(dataset['train'])
df.head()
The dataset looks like below.
| id | text | label | label_text | |
|---|---|---|---|---|
| 0 | cb774db0d1 | I`d have responded, if I were going | 1 | neutral |
| 1 | 549e992a42 | Sooo SAD I will miss you here in San Diego!!! | 0 | negative |
| 2 | 088c60f138 | my boss is bullying me… | 0 | negative |
| 3 | 9642c003ef | what interview! leave me alone | 0 | negative |
| 4 | 358bd9e861 | Sons of ****, why couldn`t they put them on t… | 0 | negative |
| 5 | 28b57f3990 | http://www.dothebouncy.com/smf - some shameles… | 1 | neutral |
The dataset consists of classification of tweets into three categories: positive, neutral, and negative.
Tokenizer
We will use the GPT2Tokenizer tokenizer. We can take the padding tokens to be the eos tokens.
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# set pad token
tokenizer.pad_token = tokenizer.eos_token
The tokenizer takes as input a list of texts and returns the tokenized text along with their corresponding attention masks. We can pass padding="max_length" to pad the tokens with pad_token. Pad tokens are added upto the context size of the model.
Example tokenization
text = ["This is a tutorial for finetuning LLMs", "In this tutorial, we will train GPT2 sequence classification model"]
tokenized_text = tokenizer(text, padding="max_length")tokenized_text
{‘input_ids’: [[1212, 318, 257, 11808, 329, 957, 316, 46493, 27140, 10128, 50256,…], [818, 428, 11808, 11, 356, 481, 4512, 402, 11571, 17, 8379, 17923, 2746, 50256,…]], ‘attention_mask’ [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,…], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,…]]}
Note that the attention masks for pad tokens are zero. So, the model would descard the pad tokens. The tokens are converted to pytorch tensor. The tensor tokens contains the tokenized text and their corresponding attention masks.
The dataset consists of train and test splits, which we have stored separately. The dataset contains the following features.
Dataset({ features: [‘id’, ‘text’, ‘label’, ‘label_text’], num_rows: 26732 })
We tokenize the dataset by mapping it to the tokenizer.
tokenized_train_dataset = train_dataset.map(lambda x: tokenizer(x['text'],
padding='max_length',
truncation=True))
tokenized_eval_dataset = eval_dataset.map(lambda x: tokenizer(x['text'],
padding='max_length',
truncation=True))
Output
Dataset({ features: [‘id’, ‘text’, ‘label’, ‘label_text’, ‘input_ids’, ‘attention_mask’], num_rows: 26732 })
For the training purpose, we need only input_ids, attention_mask, and label. We select the relevant features.
tokenized_train_dataset = tokenized_train_dataset.select_columns(['input_ids', 'attention_mask', 'label'])
tokenized_eval_dataset = tokenized_eval_dataset.select_columns(['input_ids', 'attention_mask', 'label'])
Model
We initialize the model GPT2ForSequenceClassication. The target for each input text is three labels. So, we set the number of labels to three.
from transformers import GPT2ForSequenceClassification
model = GPT2ForSequenceClassification.from_pretrained('./gpt2-small', num_labels=3)
model.config.pad_token_id = 50256
Output
GPT2ForSequenceClassification( (transformer): GPT2Model( (wte): Embedding(50257, 768) (wpe): Embedding(1024, 768) (drop): Dropout(p=0.1, inplace=False) (h): ModuleList( (0-11): 12 x GPT2Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D(nf=2304, nx=768) (c_proj): Conv1D(nf=768, nx=768) (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D(nf=3072, nx=768) (c_proj): Conv1D(nf=768, nx=3072) (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (score): Linear(in_features=768, out_features=3, bias=False) )
Since we are using pad tokens i.e., each sequence has a trailing sequence of pad tokens, we have set the pad token id in the model config.
Example inference
model(input_ids=tokens[:2,0,:], attention_mask=tokens[:2,1,:])output
SequenceClassifierOutputWithPast(loss=None, logits=tensor([[ 3.6692, -3.7950, 0.9136], [ 5.8988, -8.1318, 2.2913]], grad_fn=<IndexBackward0>), past_key_values=DynamicCache(layers=[DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer]), hidden_states=None, attentions=None)
Training
We will use the transformers Trainer class to finetune the model. The trainer requires the metric for performance so we define a metric to compute loss using the output logits of the model.
from sklearn.metrics import accuracy_score
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return accuracy_score(labels, predictions)
The arguments passed to the Trainer class is defined in the TrainingArguments.
trainer_args = TrainingArguments(output_dir='test_trainer', # directory to store the trained model
per_device_train_batch_size=3,
per_device_eval_batch_size=3,
gradient_accumulation_steps=4)
To train the model using Trainer, an API key is required. The API key can be generated from the wandb account. Login to the wandb in your terminal and proceed to the training.
Setting up wandb
- Create an account on https://wandb.ai/
- Generate an API key.
- Install
wandb. - Login to
wandbusingwandb login. - When prompted enter the API key.
Next, we define the training method using the arguments defined above.
trainer = Trainer(model=model,
args=trainer_args,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_eval_dataset,
compute_metrics=compute_metrics)
The model is trained simply by running trainer.train().
Inference
Let us first start with a simple inference example to see the model’s prediction for a given text. We pass the model a tokenized text and run the model to obtain the model’s classification.
semantics = ['negative', 'neutral', 'positive']
text = "is back from the park and is very sunburnt cant wait 4 2night,and is gonna get smashed..bein sober jst wnt b as fun!"
tokenized_text = tokenizer(text, padding="max_length")
input_ids = th.tensor(tokenized_text['input_ids'].reshape(1,-1), device=model.device)
attention_mask = th.tensor(tokenized_text['attention_mask'].reshape(1,-1), device=model.device)
pred = model(input_ids=input_ids, attention_mask=attention_mask)
print(f'Text: {text}')
print(f'Semantic: {semantic[pred.logits.argmax(dim=-1).item()]}')
Output
Text: is back from the park and is very sunburnt cant wait 4 2night,and is gonna get smashed..bein sober jst wnt b as fun! Semantic: positive