Wed Mar 18 2026

Dataset

Before we start let us fetch the dataset with which we will finetune the LLM. We choose the dataset for the tweet sentiment classification as below.

from datasets import load_dataset

dataset = load_dataset("mteb/tweet_sentiment_extraction")
train_dataset = dataset['train']
eval_dataset = dataset['test']
df = pd.DataFrame(dataset['train'])

df.head()

The dataset looks like below.

idtextlabellabel_text
0cb774db0d1I`d have responded, if I were going1neutral
1549e992a42Sooo SAD I will miss you here in San Diego!!!0negative
2088c60f138my boss is bullying me…0negative
39642c003efwhat interview! leave me alone0negative
4358bd9e861Sons of ****, why couldn`t they put them on t…0negative
528b57f3990http://www.dothebouncy.com/smf - some shameles…1neutral

The dataset consists of classification of tweets into three categories: positive, neutral, and negative.


Tokenizer

We will use the GPT2Tokenizer tokenizer. We can take the padding tokens to be the eos tokens.

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# set pad token
tokenizer.pad_token = tokenizer.eos_token

The tokenizer takes as input a list of texts and returns the tokenized text along with their corresponding attention masks. We can pass padding="max_length" to pad the tokens with pad_token. Pad tokens are added upto the context size of the model.

Example tokenization
text = ["This is a tutorial for finetuning LLMs", "In this tutorial, we will train GPT2 sequence classification model"]

tokenized_text = tokenizer(text, padding="max_length")
tokenized_text

{‘input_ids’: [[1212, 318, 257, 11808, 329, 957, 316, 46493, 27140, 10128, 50256,…], [818, 428, 11808, 11, 356, 481, 4512, 402, 11571, 17, 8379, 17923, 2746, 50256,…]], ‘attention_mask’ [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,…], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,…]]}

Note that the attention masks for pad tokens are zero. So, the model would descard the pad tokens. The tokens are converted to pytorch tensor. The tensor tokens contains the tokenized text and their corresponding attention masks.


The dataset consists of train and test splits, which we have stored separately. The dataset contains the following features.

Dataset({ features: [‘id’, ‘text’, ‘label’, ‘label_text’], num_rows: 26732 })

We tokenize the dataset by mapping it to the tokenizer.

tokenized_train_dataset = train_dataset.map(lambda x: tokenizer(x['text'],
                                            padding='max_length',
                                            truncation=True))

tokenized_eval_dataset = eval_dataset.map(lambda x: tokenizer(x['text'],
                                            padding='max_length',
                                            truncation=True))
Output

Dataset({ features: [‘id’, ‘text’, ‘label’, ‘label_text’, ‘input_ids’, ‘attention_mask’], num_rows: 26732 })

For the training purpose, we need only input_ids, attention_mask, and label. We select the relevant features.

tokenized_train_dataset = tokenized_train_dataset.select_columns(['input_ids', 'attention_mask', 'label'])
tokenized_eval_dataset = tokenized_eval_dataset.select_columns(['input_ids', 'attention_mask', 'label'])

Model

We initialize the model GPT2ForSequenceClassication. The target for each input text is three labels. So, we set the number of labels to three.

from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained('./gpt2-small', num_labels=3)
model.config.pad_token_id = 50256
Output

GPT2ForSequenceClassification( (transformer): GPT2Model( (wte): Embedding(50257, 768) (wpe): Embedding(1024, 768) (drop): Dropout(p=0.1, inplace=False) (h): ModuleList( (0-11): 12 x GPT2Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D(nf=2304, nx=768) (c_proj): Conv1D(nf=768, nx=768) (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D(nf=3072, nx=768) (c_proj): Conv1D(nf=768, nx=3072) (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (score): Linear(in_features=768, out_features=3, bias=False) )

Since we are using pad tokens i.e., each sequence has a trailing sequence of pad tokens, we have set the pad token id in the model config.

Example inference
model(input_ids=tokens[:2,0,:], attention_mask=tokens[:2,1,:])
output

SequenceClassifierOutputWithPast(loss=None, logits=tensor([[ 3.6692, -3.7950, 0.9136], [ 5.8988, -8.1318, 2.2913]], grad_fn=<IndexBackward0>), past_key_values=DynamicCache(layers=[DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer, DynamicLayer]), hidden_states=None, attentions=None)


Training

We will use the transformers Trainer class to finetune the model. The trainer requires the metric for performance so we define a metric to compute loss using the output logits of the model.

from sklearn.metrics import accuracy_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy_score(labels, predictions)

The arguments passed to the Trainer class is defined in the TrainingArguments.

trainer_args = TrainingArguments(output_dir='test_trainer', # directory to store the trained model
                                  per_device_train_batch_size=3,
                                  per_device_eval_batch_size=3,
                                  gradient_accumulation_steps=4)

To train the model using Trainer, an API key is required. The API key can be generated from the wandb account. Login to the wandb in your terminal and proceed to the training.

Setting up wandb
  1. Create an account on https://wandb.ai/
  2. Generate an API key.
  3. Install wandb.
  4. Login to wandb using wandb login.
  5. When prompted enter the API key.

Next, we define the training method using the arguments defined above.

trainer = Trainer(model=model,
                  args=trainer_args,
                  train_dataset=tokenized_train_dataset,
                  eval_dataset=tokenized_eval_dataset,
                  compute_metrics=compute_metrics)

The model is trained simply by running trainer.train().


Inference

Let us first start with a simple inference example to see the model’s prediction for a given text. We pass the model a tokenized text and run the model to obtain the model’s classification.

semantics = ['negative', 'neutral', 'positive']

text = "is back from the park and is very sunburnt  cant wait 4  2night,and is gonna get smashed..bein sober jst wnt b as fun!"

tokenized_text = tokenizer(text, padding="max_length")

input_ids = th.tensor(tokenized_text['input_ids'].reshape(1,-1), device=model.device)
attention_mask = th.tensor(tokenized_text['attention_mask'].reshape(1,-1), device=model.device)

pred = model(input_ids=input_ids, attention_mask=attention_mask)

print(f'Text: {text}')
print(f'Semantic: {semantic[pred.logits.argmax(dim=-1).item()]}')
Output

Text: is back from the park and is very sunburnt cant wait 4 2night,and is gonna get smashed..bein sober jst wnt b as fun! Semantic: positive