Fine-tuning a sequence classification model with LoRA and the peft library¶
I was looking for an notebook showing how to use LoRA with the peft library. I also wanted to use the Trainer class in the transformers library as I'd not really used it.
None of the sequence classification examples use the Trainer class. The example in the documentation does use the Trainer class, but doesn't use LoRA. All of the examples also use the MRPC dataset (from GLUE).
This notebook fine-tunes a RoBERTa large model on the rotten_tomatoes dataset using the peft library. Fine-tuning is done using LoRA and the Trainer class. The hyperparameters were chosen to match this example as close as possible.
In [1]:
import transformers
import peft
import datasets
import evaluate
import numpy as np
import torch
from tqdm.auto import tqdm
# transformers.logging.set_verbosity_info()
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ CUDA SETUP: CUDA runtime path found: /home/ben/miniconda3/envs/main/lib/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /home/ben/miniconda3/envs/main/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
In [2]:
batch_size = 32
model_name_or_path = "roberta-large"
num_epochs = 5
lr = 3e-4
device = torch.device("cuda")
Setting up the dataset¶
In [3]:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name_or_path)
In [4]:
dataset = datasets.load_dataset("rotten_tomatoes")
In [5]:
def tokenize_function(examples):
outputs = tokenizer(examples["text"], truncation=True)
return outputs
dataset = dataset.map(
tokenize_function,
batched=True,
remove_columns=["text"],
)
dataset = dataset.rename_column(
"label", "labels"
) # label field needs to be called "labels"
In [6]:
dataset
Out[6]:
DatasetDict({ train: Dataset({ features: ['labels', 'input_ids', 'attention_mask'], num_rows: 8530 }) validation: Dataset({ features: ['labels', 'input_ids', 'attention_mask'], num_rows: 1066 }) test: Dataset({ features: ['labels', 'input_ids', 'attention_mask'], num_rows: 1066 }) })
Setting up the model¶
In [7]:
peft_config = peft.LoraConfig(
task_type="SEQ_CLS",
lora_alpha=16,
lora_dropout=0.1,
)
In [8]:
peft_config
Out[8]:
LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='SEQ_CLS', inference_mode=False, r=8, target_modules=None, lora_alpha=16, lora_dropout=0.1, fan_in_fan_out=False, bias='none', modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None)
In [9]:
model = transformers.AutoModelForSequenceClassification.from_pretrained(
model_name_or_path, return_dict=True
)
model = peft.get_peft_model(model, peft_config)
model.print_trainable_parameters()
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
trainable params: 2,889,732 || all params: 357,199,876 || trainable%: 0.8089958015550934
In [10]:
model
Out[10]:
PeftModelForSequenceClassification( (base_model): LoraModel( (model): RobertaForSequenceClassification( (roberta): RobertaModel( (embeddings): RobertaEmbeddings( (word_embeddings): Embedding(50265, 1024, padding_idx=1) (position_embeddings): Embedding(514, 1024, padding_idx=1) (token_type_embeddings): Embedding(1, 1024) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): RobertaEncoder( (layer): ModuleList( (0-23): 24 x RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear( in_features=1024, out_features=1024, bias=True (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=1024, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=1024, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear( in_features=1024, out_features=1024, bias=True (lora_dropout): ModuleDict( (default): Dropout(p=0.1, inplace=False) ) (lora_A): ModuleDict( (default): Linear(in_features=1024, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=1024, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() ) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (classifier): ModulesToSaveWrapper( (original_module): RobertaClassificationHead( (dense): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) (out_proj): Linear(in_features=1024, out_features=2, bias=True) ) (modules_to_save): ModuleDict( (default): RobertaClassificationHead( (dense): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) (out_proj): Linear(in_features=1024, out_features=2, bias=True) ) ) ) ) ) )
Setting up the metrics, collator, and training arguments¶
In [11]:
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return metric.compute(predictions=predictions, references=labels)
In [12]:
data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)
In [13]:
training_args = transformers.TrainingArguments(
output_dir="output",
learning_rate=lr,
num_train_epochs=num_epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_strategy="epoch",
load_best_model_at_end=True,
warmup_ratio=0.06,
)
training_args
Out[13]:
TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=epoch, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=False, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.0003, length_column_name=length, load_best_model_at_end=True, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=output/runs/Aug03_20-59-33_benpc, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=500, logging_strategy=epoch, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=loss, mp_parameters=, no_cuda=False, num_train_epochs=5, optim=adamw_hf, optim_args=None, output_dir=output, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=32, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=[], resume_from_checkpoint=None, run_name=output, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=epoch, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.06, warmup_steps=0, weight_decay=0.0, xpu_backend=None, )
Training the model¶
In [14]:
trainer = transformers.Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
/home/ben/miniconda3/envs/main/lib/python3.9/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[1335/1335 04:22, Epoch 5/5]
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | 0.463200 | 0.212241 | 0.923077 |
2 | 0.249600 | 0.220944 | 0.917448 |
3 | 0.212700 | 0.227929 | 0.917448 |
4 | 0.190600 | 0.214612 | 0.923077 |
5 | 0.174200 | 0.220378 | 0.925891 |
Out[14]:
TrainOutput(global_step=1335, training_loss=0.25806437574522323, metrics={'train_runtime': 263.0868, 'train_samples_per_second': 162.114, 'train_steps_per_second': 5.074, 'total_flos': 4087623784953072.0, 'train_loss': 0.25806437574522323, 'epoch': 5.0})
Performing inference (both "by hand" and using pipelines)¶
In [15]:
model_input = tokenizer.pad(
tokenizer(
["This film is great.", "This movie sucked!"],
),
return_tensors="pt",
).to(device)
model_input
Out[15]:
{'input_ids': tensor([[ 0, 713, 822, 16, 372, 4, 2], [ 0, 713, 1569, 28635, 328, 2, 1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0]], device='cuda:0')}
In [16]:
outputs = model(**model_input)
predictions = outputs.logits.argmax(dim=-1)
predictions
Out[16]:
tensor([1, 0], device='cuda:0')
In [17]:
pipe = transformers.pipeline(
"text-classification", model=model, tokenizer=tokenizer, device=device
)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
The model 'PeftModelForSequenceClassification' is not supported for text-classification. Supported models are ['AlbertForSequenceClassification', 'BartForSequenceClassification', 'BertForSequenceClassification', 'BigBirdForSequenceClassification', 'BigBirdPegasusForSequenceClassification', 'BioGptForSequenceClassification', 'BloomForSequenceClassification', 'CamembertForSequenceClassification', 'CanineForSequenceClassification', 'ConvBertForSequenceClassification', 'CTRLForSequenceClassification', 'Data2VecTextForSequenceClassification', 'DebertaForSequenceClassification', 'DebertaV2ForSequenceClassification', 'DistilBertForSequenceClassification', 'ElectraForSequenceClassification', 'ErnieForSequenceClassification', 'ErnieMForSequenceClassification', 'EsmForSequenceClassification', 'FalconForSequenceClassification', 'FlaubertForSequenceClassification', 'FNetForSequenceClassification', 'FunnelForSequenceClassification', 'GPT2ForSequenceClassification', 'GPT2ForSequenceClassification', 'GPTBigCodeForSequenceClassification', 'GPTNeoForSequenceClassification', 'GPTNeoXForSequenceClassification', 'GPTJForSequenceClassification', 'IBertForSequenceClassification', 'LayoutLMForSequenceClassification', 'LayoutLMv2ForSequenceClassification', 'LayoutLMv3ForSequenceClassification', 'LEDForSequenceClassification', 'LiltForSequenceClassification', 'LlamaForSequenceClassification', 'LongformerForSequenceClassification', 'LukeForSequenceClassification', 'MarkupLMForSequenceClassification', 'MBartForSequenceClassification', 'MegaForSequenceClassification', 'MegatronBertForSequenceClassification', 'MobileBertForSequenceClassification', 'MPNetForSequenceClassification', 'MraForSequenceClassification', 'MvpForSequenceClassification', 'NezhaForSequenceClassification', 'NystromformerForSequenceClassification', 'OpenLlamaForSequenceClassification', 'OpenAIGPTForSequenceClassification', 'OPTForSequenceClassification', 'PerceiverForSequenceClassification', 'PLBartForSequenceClassification', 'QDQBertForSequenceClassification', 'ReformerForSequenceClassification', 'RemBertForSequenceClassification', 'RobertaForSequenceClassification', 'RobertaPreLayerNormForSequenceClassification', 'RoCBertForSequenceClassification', 'RoFormerForSequenceClassification', 'SqueezeBertForSequenceClassification', 'TapasForSequenceClassification', 'TransfoXLForSequenceClassification', 'XLMForSequenceClassification', 'XLMRobertaForSequenceClassification', 'XLMRobertaXLForSequenceClassification', 'XLNetForSequenceClassification', 'XmodForSequenceClassification', 'YosoForSequenceClassification'].
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
In [18]:
pipe(["This film is great.", "This movie sucked!"])
Out[18]:
[{'label': 'LABEL_1', 'score': 0.9857044219970703}, {'label': 'LABEL_0', 'score': 0.9899106621742249}]
Getting test set metrics¶
In [19]:
eval_dataloader = torch.utils.data.DataLoader(
dataset["test"], collate_fn=data_collator, batch_size=batch_size
)
In [20]:
model.eval()
for batch in tqdm(eval_dataloader):
batch.to(device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
references = batch["labels"]
metric.add_batch(
predictions=predictions,
references=references,
)
0%| | 0/34 [00:00<?, ?it/s]
In [21]:
metric.compute()
Out[21]:
{'accuracy': 0.9052532833020638}