Fine-tunes the LLM model with the provided list of texts.
This method tokenizes the input texts, prepares them into a dataset, and trains the model using
the Hugging Face Trainer API. The fine-tuned model is saved to the specified output directory.
Parameters: |
-
texts
(List[str] )
–
A list of strings to be used for training the model.
-
output_dir
(str , default:
'./fine_tuned_model'
)
–
The directory where the fine-tuned model will be saved (default is "./fine_tuned_model").
|
Returns: |
-
LLM
–
The LLM instance updated to use the fine-tuned model.
|
Raises: |
-
ValueError
–
If the model category of the LLM is not "huggingface".
|
Source code in llamarch/common/fine_tuner.py
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133 | def fine_tune(self, texts: List[str], output_dir: str = "./fine_tuned_model"):
"""
Fine-tunes the LLM model with the provided list of texts.
This method tokenizes the input texts, prepares them into a dataset, and trains the model using
the Hugging Face Trainer API. The fine-tuned model is saved to the specified output directory.
Parameters
----------
texts : List[str]
A list of strings to be used for training the model.
output_dir : str, optional
The directory where the fine-tuned model will be saved (default is "./fine_tuned_model").
Returns
-------
LLM
The LLM instance updated to use the fine-tuned model.
Raises
------
ValueError
If the model category of the LLM is not "huggingface".
"""
if self.llm.model_category != "huggingface":
raise ValueError(
"Fine-tuning is only supported for Hugging Face models in this implementation.")
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(self.llm.model_name)
tokenizer = AutoTokenizer.from_pretrained(self.llm.model_name)
# Set pad_token to eos_token
tokenizer.pad_token = tokenizer.eos_token
# Tokenize data
encodings = tokenizer(
texts,
truncation=True,
padding="max_length",
max_length=512, # Adjust max_length as needed
return_tensors="pt" # Return PyTorch tensors
)
# Create dataset from encodings
train_dataset = CustomDataset(encodings)
# Set training arguments
training_args = TrainingArguments(
output_dir=output_dir,
eval_strategy="no",
learning_rate=2e-5,
weight_decay=0.01,
num_train_epochs=3,
per_device_train_batch_size=8,
save_total_limit=2, # Save only the last 2 models
logging_dir='./logs', # Directory for storing logs
logging_steps=10, # Log every 10 steps
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset
)
# Fine-tune the model
trainer.train()
# Save the fine-tuned model
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
# Update the LLM instance to use the fine-tuned model
self.llm.model_name = output_dir
self.llm.model = self.llm._initialize_llm()
return self.llm
|