Fine-Tuning LLMs for Custom Agent Behaviors - Part 3: Fine-Tuning with Hugging Face

π Fine-Tuning LLMs for Custom Agent Behaviors
View All Parts in This Series
Ad Space
Fine-Tuning LLMs for Custom Agent Behaviors - Part 3: Fine-Tuning with Hugging Face
While OpenAI provides convenience, Hugging Face offers complete control over your AI agent's model training process. With Hugging Face, you can fine-tune any open-source model, customize training parameters, and deploy models independently without relying on external APIs.
This approach is particularly valuable when you need data privacy, cost control, or specialized model architectures that aren't available through commercial APIs.
Why Choose Hugging Face for AI Agent Fine-Tuning
Complete Model Control Unlike OpenAI's black-box approach, Hugging Face lets you see and modify every aspect of your model training. You control the architecture, training process, and deployment.
Cost Effectiveness After initial training costs, you own the model completely. No per-token charges, no API rate limits, no ongoing subscription fees.
Data Privacy Your training data never leaves your infrastructure. This is crucial for sensitive applications like healthcare, finance, or proprietary business data.
Open Source Ecosystem Access to thousands of pre-trained models, from small efficient models to large state-of-the-art architectures. You can start with models specifically designed for your use case.
Customization Freedom Modify model architectures, implement custom training loops, and experiment with cutting-edge techniques not available in commercial APIs.
What You'll Learn in This Tutorial
By the end of this tutorial, you'll have:
- β Complete Hugging Face training pipeline from data to deployed model
- β Advanced training techniques including LoRA, QLoRA, and gradient checkpointing
- β Model optimization strategies for inference speed and memory efficiency
- β Custom evaluation metrics for AI agent-specific performance
- β Deployment strategies for self-hosted model serving
- β Cost optimization techniques for training and inference
Estimated Time: 45-50 minutes
Step 1: Understanding Hugging Face vs OpenAI Fine-Tuning
Before diving into implementation, it's crucial to understand the fundamental differences between these approaches.
OpenAI vs Hugging Face Comparison
Aspect | OpenAI Fine-Tuning | Hugging Face Fine-Tuning |
---|---|---|
Control | Limited parameters | Full control over everything |
Cost Model | Pay per token used | Pay for compute time only |
Data Privacy | Data sent to OpenAI | Data stays on your infrastructure |
Model Access | API only | Full model ownership |
Customization | Predefined options | Unlimited customization |
Infrastructure | Managed by OpenAI | You manage compute resources |
When to Choose Hugging Face
Choose Hugging Face When:
- You need complete data privacy
- You want to minimize long-term costs
- You require custom model architectures
- You need offline/air-gapped deployment
- You want to experiment with latest research techniques
Choose OpenAI When:
- You want minimal setup complexity
- You need immediate results
- You don't have GPU infrastructure
- You prefer managed services
Understanding the Training Process
Hugging Face Fine-Tuning Steps:
- Model Selection: Choose base model from Hugging Face Hub
- Data Preparation: Format data for transformers library
- Training Configuration: Set hyperparameters and training arguments
- Training Execution: Run training with monitoring
- Model Evaluation: Test performance and quality
- Model Deployment: Serve model for inference
Step 2: Setting Up Hugging Face Training Environment
Let's create a comprehensive training environment that handles the complexities of transformer fine-tuning.
Environment Setup and Dependencies
GPU Requirements Understanding: Fine-tuning large language models requires significant computational resources. Here's what you need to know:
Minimum Requirements:
- GPU Memory: 8GB VRAM for small models (up to 1B parameters)
- System RAM: 16GB for data loading and processing
- Storage: 50GB+ for models, datasets, and checkpoints
Recommended Setup:
- GPU Memory: 24GB+ VRAM for larger models (7B+ parameters)
- System RAM: 32GB+ for efficient data processing
- Storage: 200GB+ SSD for fast data access
Cloud Alternatives: If you don't have local GPU resources:
- Google Colab Pro: $10/month, good for experimentation
- AWS EC2 GPU instances: Pay-per-hour, scalable
- Hugging Face Spaces: Integrated training environment
Installing Required Dependencies
# requirements-huggingface.txt - Comprehensive dependencies for HF fine-tuning
# Core Hugging Face libraries
transformers==4.36.0
datasets==2.14.0
tokenizers==0.15.0
accelerate==0.24.0
# Training and optimization
torch==2.1.0
torchvision==0.16.0
torchaudio==2.1.0
# Efficient training techniques
peft==0.6.0 # Parameter-Efficient Fine-Tuning (LoRA, etc.)
bitsandbytes==0.41.0 # Quantization for memory efficiency
# Monitoring and evaluation
wandb==0.16.0 # Experiment tracking
tensorboard==2.15.0 # Training visualization
evaluate==0.4.0 # Model evaluation metrics
# Data processing
pandas==2.1.0
numpy==1.24.0
scikit-learn==1.3.0
# Utilities
tqdm==4.66.0
rich==13.7.0
python-dotenv==1.0.0
Why These Dependencies Matter:
Transformers: The core library that handles model loading, training, and inference.
Datasets: Efficient data loading and processing, especially important for large datasets.
PEFT (Parameter-Efficient Fine-Tuning): Enables techniques like LoRA that dramatically reduce memory requirements and training time.
Accelerate: Handles distributed training, mixed precision, and other optimizations automatically.
Training Environment Configuration
# training/training_config.py
import os
from dataclasses import dataclass, field
from typing import Optional, List, Dict, Any
import torch
@dataclass
class HuggingFaceTrainingConfig:
"""Configuration for Hugging Face fine-tuning"""
# Model configuration
model_name: str = "microsoft/DialoGPT-medium" # Good starting model for agents
tokenizer_name: Optional[str] = None # Uses model_name if None
cache_dir: str = "./models"
# Training data
train_file: str = "data/train.json"
validation_file: Optional[str] = "data/validation.json"
max_seq_length: int = 512
# Training parameters
num_train_epochs: int = 3
per_device_train_batch_size: int = 4
per_device_eval_batch_size: int = 4
gradient_accumulation_steps: int = 4
learning_rate: float = 5e-5
weight_decay: float = 0.01
warmup_steps: int = 100
# Memory optimization
use_lora: bool = True # Use LoRA for efficient training
lora_r: int = 16 # LoRA rank
lora_alpha: int = 32 # LoRA alpha
use_gradient_checkpointing: bool = True
fp16: bool = True # Mixed precision training
# Monitoring and saving
output_dir: str = "./fine_tuned_model"
logging_steps: int = 10
eval_steps: int = 100
save_steps: int = 500
save_total_limit: int = 3
# Evaluation
evaluation_strategy: str = "steps"
load_best_model_at_end: bool = True
metric_for_best_model: str = "eval_loss"
def __post_init__(self):
"""Validate configuration after initialization"""
self.validate_config()
def validate_config(self):
"""Validate training configuration"""
# Check GPU availability
if not torch.cuda.is_available():
print("β οΈ CUDA not available. Training will be slow on CPU.")
self.fp16 = False # Disable mixed precision on CPU
# Validate file paths
if not os.path.exists(self.train_file):
raise ValueError(f"Training file not found: {self.train_file}")
# Check memory requirements
if self.use_lora:
print("β
Using LoRA for memory-efficient training")
else:
print("β οΈ Full fine-tuning requires significant GPU memory")
# Validate batch size
if self.per_device_train_batch_size * self.gradient_accumulation_steps < 8:
print("β οΈ Effective batch size is very small, may affect training stability")
print(f"β
Training configuration validated")
print(f" Model: {self.model_name}")
print(f" Epochs: {self.num_train_epochs}")
print(f" Batch size: {self.per_device_train_batch_size}")
print(f" LoRA enabled: {self.use_lora}")
Configuration Explanation:
LoRA (Low-Rank Adaptation): Instead of updating all model parameters, LoRA only trains small additional matrices. This reduces memory usage by 90% while maintaining performance.
Gradient Checkpointing: Trades computation for memory by recomputing activations during backpropagation instead of storing them.
Mixed Precision (FP16): Uses 16-bit floats instead of 32-bit, reducing memory usage and increasing training speed on modern GPUs.
Step 3: Data Preparation for Hugging Face
Hugging Face requires specific data formats that differ from OpenAI's approach.
Understanding Hugging Face Data Formats
Key Differences from OpenAI:
- Flexible Formats: JSON, CSV, Parquet, or custom datasets
- Column-Based: Data organized in columns rather than message arrays
- Preprocessing Control: You handle tokenization and formatting
Data Preparation Pipeline
# data_preparation/hf_data_processor.py
import json
import pandas as pd
from typing import List, Dict, Any, Optional
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer
class HuggingFaceDataProcessor:
"""Process training data for Hugging Face fine-tuning"""
def __init__(self, config: HuggingFaceTrainingConfig):
self.config = config
# Initialize tokenizer
tokenizer_name = config.tokenizer_name or config.model_name
self.tokenizer = AutoTokenizer.from_pretrained(
tokenizer_name,
cache_dir=config.cache_dir
)
# Add special tokens if needed
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
print(f"β
Initialized tokenizer: {tokenizer_name}")
def prepare_conversational_data(self, input_file: str, output_file: str) -> Dict[str, Any]:
"""
Convert OpenAI-style conversational data to Hugging Face format
Args:
input_file: Path to OpenAI JSONL format file
output_file: Path to save Hugging Face format
Returns:
Processing statistics
"""
print(f"π Converting {input_file} to Hugging Face format...")
processed_examples = []
stats = {
"total_examples": 0,
"processed_examples": 0,
"skipped_examples": 0,
"average_length": 0,
"max_length": 0
}
try:
# Read OpenAI format data
with open(input_file, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
try:
example = json.loads(line.strip())
stats["total_examples"] += 1
# Convert to Hugging Face format
hf_example = self.convert_openai_to_hf_format(example)
if hf_example:
processed_examples.append(hf_example)
stats["processed_examples"] += 1
# Track length statistics
text_length = len(hf_example["text"])
stats["max_length"] = max(stats["max_length"], text_length)
else:
stats["skipped_examples"] += 1
except json.JSONDecodeError as e:
print(f"β οΈ Skipping invalid JSON on line {line_num}: {e}")
stats["skipped_examples"] += 1
# Calculate average length
if processed_examples:
total_length = sum(len(ex["text"]) for ex in processed_examples)
stats["average_length"] = total_length / len(processed_examples)
# Save processed data
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(processed_examples, f, indent=2, ensure_ascii=False)
print(f"β
Processed {stats['processed_examples']}/{stats['total_examples']} examples")
print(f" Average length: {stats['average_length']:.0f} characters")
print(f" Max length: {stats['max_length']} characters")
print(f" Saved to: {output_file}")
return stats
except Exception as e:
print(f"β Error processing data: {e}")
raise
def convert_openai_to_hf_format(self, openai_example: Dict[str, Any]) -> Optional[Dict[str, Any]]:
"""
Convert single OpenAI example to Hugging Face format
OpenAI format: {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
HF format: {"text": "Human: ... Assistant: ..."}
"""
try:
messages = openai_example.get("messages", [])
if len(messages) < 2:
return None
# Build conversation text
conversation_parts = []
for message in messages:
role = message.get("role", "")
content = message.get("content", "").strip()
if not content:
continue
# Map roles to conversation format
if role == "system":
conversation_parts.append(f"System: {content}")
elif role == "user":
conversation_parts.append(f"Human: {content}")
elif role == "assistant":
conversation_parts.append(f"Assistant: {content}")
if len(conversation_parts) < 2:
return None
# Join with special separator
conversation_text = "\n".join(conversation_parts)
# Add end-of-sequence token
conversation_text += self.tokenizer.eos_token
return {
"text": conversation_text,
"length": len(conversation_text)
}
except Exception as e:
print(f"β οΈ Error converting example: {e}")
return None
def create_dataset(self, data_file: str, validation_split: float = 0.1) -> DatasetDict:
"""
Create Hugging Face Dataset from processed data
Args:
data_file: Path to processed JSON data
validation_split: Fraction of data to use for validation
Returns:
DatasetDict with train and validation splits
"""
print(f"π Creating dataset from {data_file}...")
try:
# Load processed data
with open(data_file, 'r', encoding='utf-8') as f:
data = json.load(f)
# Create dataset
dataset = Dataset.from_list(data)
# Split into train/validation
if validation_split > 0:
split_dataset = dataset.train_test_split(
test_size=validation_split,
shuffle=True,
seed=42
)
dataset_dict = DatasetDict({
'train': split_dataset['train'],
'validation': split_dataset['test']
})
else:
dataset_dict = DatasetDict({
'train': dataset
})
print(f"β
Dataset created:")
print(f" Training examples: {len(dataset_dict['train'])}")
if 'validation' in dataset_dict:
print(f" Validation examples: {len(dataset_dict['validation'])}")
return dataset_dict
except Exception as e:
print(f"β Error creating dataset: {e}")
raise
def tokenize_dataset(self, dataset: DatasetDict) -> DatasetDict:
"""
Tokenize dataset for training
This is where we convert text to tokens that the model can understand
"""
print("π€ Tokenizing dataset...")
def tokenize_function(examples):
"""Tokenize a batch of examples"""
# Tokenize the text
tokenized = self.tokenizer(
examples["text"],
truncation=True,
padding=False, # We'll pad dynamically during training
max_length=self.config.max_seq_length,
return_tensors=None
)
# For causal language modeling, labels are the same as input_ids
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
try:
# Apply tokenization to all splits
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
remove_columns=dataset["train"].column_names, # Remove original text columns
desc="Tokenizing dataset"
)
print("β
Dataset tokenization complete")
# Print tokenization statistics
train_lengths = [len(ex) for ex in tokenized_dataset["train"]["input_ids"]]
print(f" Average tokens per example: {sum(train_lengths) / len(train_lengths):.0f}")
print(f" Max tokens: {max(train_lengths)}")
print(f" Min tokens: {min(train_lengths)}")
return tokenized_dataset
except Exception as e:
print(f"β Error tokenizing dataset: {e}")
raise
Data Processing Explanation:
Tokenization: Converting human-readable text into numerical tokens that the model can process. Each word or subword gets mapped to a number.
Truncation: Long conversations are cut to fit within the model's maximum sequence length. This prevents memory issues during training.
Labels Creation: For language modeling, we predict the next token, so the labels are just the input shifted by one position.
Step 4: Advanced Training Techniques
Let's implement modern fine-tuning techniques that make training efficient and effective.
LoRA (Low-Rank Adaptation) Implementation
Why LoRA is Revolutionary: Instead of updating all billions of parameters in a large model, LoRA adds small "adapter" layers that capture the task-specific knowledge. This reduces:
- Memory usage: By 90% or more
- Training time: Significantly faster
- Storage requirements: Only store the small adapter weights
# training/lora_trainer.py
from peft import LoraConfig, get_peft_model, TaskType
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling
)
import torch
class LoRATrainer:
"""Trainer using LoRA for efficient fine-tuning"""
def __init__(self, config: HuggingFaceTrainingConfig):
self.config = config
# Initialize model and tokenizer
self.setup_model_and_tokenizer()
# Configure LoRA
self.setup_lora()
print("β
LoRA trainer initialized")
def setup_model_and_tokenizer(self):
"""Initialize the base model and tokenizer"""
print(f"π₯ Loading model: {self.config.model_name}")
# Load tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(
self.config.model_name,
cache_dir=self.config.cache_dir
)
# Ensure tokenizer has pad token
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
# Load model with appropriate settings
model_kwargs = {
"cache_dir": self.config.cache_dir,
"torch_dtype": torch.float16 if self.config.fp16 else torch.float32,
}
# Use device_map for multi-GPU setups
if torch.cuda.device_count() > 1:
model_kwargs["device_map"] = "auto"
self.model = AutoModelForCausalLM.from_pretrained(
self.config.model_name,
**model_kwargs
)
print(f"β
Model loaded: {self.model.num_parameters():,} parameters")
def setup_lora(self):
"""Configure LoRA for the model"""
if not self.config.use_lora:
print("βΉοΈ LoRA disabled, using full fine-tuning")
return
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, # Causal language modeling
r=self.config.lora_r, # Rank of adaptation
lora_alpha=self.config.lora_alpha, # LoRA scaling parameter
lora_dropout=0.1, # Dropout for LoRA layers
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Which layers to adapt
)
# Apply LoRA to model
self.model = get_peft_model(self.model, lora_config)
# Print trainable parameters
trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in self.model.parameters())
print(f"β
LoRA applied:")
print(f" Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)")
print(f" Total parameters: {total_params:,}")
def create_training_arguments(self) -> TrainingArguments:
"""Create training arguments for the Trainer"""
return TrainingArguments(
output_dir=self.config.output_dir,
# Training parameters
num_train_epochs=self.config.num_train_epochs,
per_device_train_batch_size=self.config.per_device_train_batch_size,
per_device_eval_batch_size=self.config.per_device_eval_batch_size,
gradient_accumulation_steps=self.config.gradient_accumulation_steps,
# Optimization
learning_rate=self.config.learning_rate,
weight_decay=self.config.weight_decay,
warmup_steps=self.config.warmup_steps,
# Memory optimization
fp16=self.config.fp16,
gradient_checkpointing=self.config.use_gradient_checkpointing,
dataloader_pin_memory=False, # Can cause issues with some setups
# Logging and evaluation
logging_steps=self.config.logging_steps,
eval_steps=self.config.eval_steps,
evaluation_strategy=self.config.evaluation_strategy,
# Saving
save_steps=self.config.save_steps,
save_total_limit=self.config.save_total_limit,
load_best_model_at_end=self.config.load_best_model_at_end,
metric_for_best_model=self.config.metric_for_best_model,
# Reporting
report_to=["tensorboard"], # Log to TensorBoard
run_name=f"agent-finetune-{self.config.model_name.split('/')[-1]}",
# Reproducibility
seed=42,
data_seed=42,
)
def train_model(self, tokenized_dataset: DatasetDict) -> str:
"""
Train the model using Hugging Face Trainer
Args:
tokenized_dataset: Tokenized training data
Returns:
Path to trained model
"""
print("π Starting model training...")
try:
# Create data collator for dynamic padding
data_collator = DataCollatorForLanguageModeling(
tokenizer=self.tokenizer,
mlm=False, # We're doing causal LM, not masked LM
pad_to_multiple_of=8 if self.config.fp16 else None
)
# Create training arguments
training_args = self.create_training_arguments()
# Initialize trainer
trainer = Trainer(
model=self.model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset.get("validation"),
tokenizer=self.tokenizer,
data_collator=data_collator,
)
# Start training
print("β³ Training in progress...")
train_result = trainer.train()
# Save the final model
trainer.save_model()
self.tokenizer.save_pretrained(self.config.output_dir)
print("β
Training completed successfully!")
print(f" Final loss: {train_result.training_loss:.4f}")
print(f" Training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f" Model saved to: {self.config.output_dir}")
return self.config.output_dir
except Exception as e:
print(f"β Training failed: {e}")
raise
Training Process Explanation:
Data Collator: Handles batching and padding of sequences. Since conversations have different lengths, we need dynamic padding to create uniform batches.
Trainer Class: Hugging Face's high-level training interface that handles the training loop, evaluation, checkpointing, and logging automatically.
Mixed Precision Training: Uses both 16-bit and 32-bit floating point numbers to speed up training while maintaining numerical stability.
Step 5: Model Evaluation and Testing
After training, it's crucial to evaluate your model's performance for AI agent tasks.
AI Agent-Specific Evaluation
# evaluation/agent_evaluator.py
import json
import torch
from typing import List, Dict, Any, Tuple
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datasets import Dataset
class AgentModelEvaluator:
"""Evaluate fine-tuned models for AI agent performance"""
def __init__(self, model_path: str, tokenizer_path: str = None):
"""
Initialize evaluator with trained model
Args:
model_path: Path to fine-tuned model
tokenizer_path: Path to tokenizer (uses model_path if None)
"""
self.model_path = model_path
tokenizer_path = tokenizer_path or model_path
print(f"π₯ Loading model for evaluation: {model_path}")
# Load model and tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto" if torch.cuda.is_available() else None
)
# Create generation pipeline
self.generator = pipeline(
"text-generation",
model=self.model,
tokenizer=self.tokenizer,
device=0 if torch.cuda.is_available() else -1
)
print("β
Model loaded for evaluation")
def evaluate_conversational_ability(self, test_cases: List[Dict[str, Any]]) -> Dict[str, Any]:
"""
Evaluate model's conversational abilities
Args:
test_cases: List of test conversations
Returns:
Evaluation results with metrics
"""
print(f"π§ͺ Evaluating conversational ability on {len(test_cases)} test cases...")
results = {
"total_cases": len(test_cases),
"successful_responses": 0,
"average_response_length": 0,
"response_quality_scores": [],
"detailed_results": []
}
total_response_length = 0
for i, test_case in enumerate(test_cases):
try:
# Extract conversation context and expected response
context = test_case.get("context", "")
user_message = test_case.get("user_message", "")
expected_response = test_case.get("expected_response", "")
# Build prompt
prompt = f"{context}\nHuman: {user_message}\nAssistant:"
# Generate response
generated = self.generator(
prompt,
max_new_tokens=150,
temperature=0.7,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
eos_token_id=self.tokenizer.eos_token_id
)
# Extract generated text (remove prompt)
full_response = generated[0]["generated_text"]
response = full_response[len(prompt):].strip()
# Evaluate response quality
quality_score = self.evaluate_response_quality(
user_message, response, expected_response
)
results["response_quality_scores"].append(quality_score)
total_response_length += len(response)
if quality_score > 0.6: # Threshold for "successful" response
results["successful_responses"] += 1
# Store detailed result
results["detailed_results"].append({
"test_case": i + 1,
"user_message": user_message,
"generated_response
Ad Space
Recommended Tools & Resources
* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.
π Featured AI Books
OpenAI API
AI PlatformAccess GPT-4 and other powerful AI models for your agent development.
LangChain Plus
FrameworkAdvanced framework for building applications with large language models.
Pinecone Vector Database
DatabaseHigh-performance vector database for AI applications and semantic search.
AI Agent Development Course
EducationComplete course on building production-ready AI agents from scratch.
π‘ Pro Tip
Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.
π Fine-Tuning LLMs for Custom Agent Behaviors
View All Parts in This Series
π Join the AgentForge Community
Get weekly insights, tutorials, and the latest AI agent developments delivered to your inbox.
No spam, ever. Unsubscribe at any time.