Fine-tune Pre-trained Models using Hugging Face and Python

In this tutorial, we will explore how to fine-tune pre-trained models from the Hugging Face library using Python. Fine-tuning pre-trained models is an essential step to improve the performance of your machine learning projects, especially when working with Natural Language Processing (NLP) tasks.

Table of Contents

  1. Introduction to Hugging Face
  2. Pre-requisites
  3. Loading a Pre-trained Model
  4. Preparing the Dataset
  5. Fine-tuning the Model
  6. Evaluating the Model
  7. Using the Fine-tuned Model
  8. Conclusion

Introduction to Hugging Face

Hugging Face is an open-source provider of pre-trained models and datasets for NLP. They offer state-of-the-art models like BERT, GPT-2, RoBERTa, and many others, which can be fine-tuned for specific tasks, such as sentiment analysis, text classification, and more.

Pre-requisites

Before we start, make sure to have Python 3.x installed on your machine. Next, you need to install the following packages:

pip install transformers
pip install datasets

transformers is the Hugging Face library for pre-trained models, and datasets is the library for datasets.

Loading a Pre-trained Model

First, let's load a pre-trained model and its tokenizer. In this example, we will use the distilbert-base-uncased model.

from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

Preparing the Dataset

Next, let's load a dataset to fine-tune the model. We will use the imdb dataset, which is available in the Hugging Face datasets library.

from datasets import load_dataset

raw_datasets = load_dataset("imdb")

Now, we need to tokenize the dataset using the tokenizer we loaded earlier.

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Fine-tuning the Model

To fine-tune the model, we need to create a Trainer object and a training configuration.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    "test_trainer",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

Now, let's start the fine-tuning process.

trainer.train()

Evaluating the Model

After fine-tuning, let's evaluate the model's performance on the test dataset.

trainer.evaluate()

Using the Fine-tuned Model

Now that we have fine-tuned our model, let's use it to make predictions.

import torch

text = "This movie was amazing!"
inputs = tokenizer(text, return_tensors="pt")
logits = model(**inputs).logits
predicted_label = torch.argmax(logits, dim=1)

print(predicted_label)

This will output the predicted label for the given text.

Conclusion

In this tutorial, we learned how to fine-tune pre-trained models using the Hugging Face library and Python. Fine-tuning is an essential step to improve the performance of your NLP tasks, and Hugging Face makes it simple and efficient. Happy fine-tuning!

An AI coworker, not just a copilot

View VelocityAI