ai-agenttutorialllmfine-tuninghuggingfacetransformersmachine-learning

Fine-Tuning LLMs for Custom Agent Behaviors - Part 3: Fine-Tuning with Hugging Face

By AgentForge Hub8/14/202514 min read

Advanced

📚 Fine-Tuning LLMs for Custom Agent Behaviors

Part 3 of 5

Part 2: Fine-Tuning with OpenAI

All Tutorials

Part 4: Deploying Custom Models

Series Progress60% Complete

View All Parts in This Series

Preparing Training Data

Fine-Tuning with OpenAI

Fine-Tuning with Hugging FaceCurrent

Deploying Custom Models

Integration & Troubleshooting

Ad Space

Fine-Tuning LLMs for Custom Agent Behaviors - Part 3: Fine-Tuning with Hugging Face

While OpenAI provides convenience, Hugging Face offers complete control over your AI agent's model training process. With Hugging Face, you can fine-tune any open-source model, customize training parameters, and deploy models independently without relying on external APIs.

This approach is particularly valuable when you need data privacy, cost control, or specialized model architectures that aren't available through commercial APIs.

Why Choose Hugging Face for AI Agent Fine-Tuning

Complete Model Control Unlike OpenAI's black-box approach, Hugging Face lets you see and modify every aspect of your model training. You control the architecture, training process, and deployment.

Cost Effectiveness After initial training costs, you own the model completely. No per-token charges, no API rate limits, no ongoing subscription fees.

Data Privacy Your training data never leaves your infrastructure. This is crucial for sensitive applications like healthcare, finance, or proprietary business data.

Open Source Ecosystem Access to thousands of pre-trained models, from small efficient models to large state-of-the-art architectures. You can start with models specifically designed for your use case.

Customization Freedom Modify model architectures, implement custom training loops, and experiment with cutting-edge techniques not available in commercial APIs.

What You'll Learn in This Tutorial

By the end of this tutorial, you'll have:

✅ Complete Hugging Face training pipeline from data to deployed model
✅ Advanced training techniques including LoRA, QLoRA, and gradient checkpointing
✅ Model optimization strategies for inference speed and memory efficiency
✅ Custom evaluation metrics for AI agent-specific performance
✅ Deployment strategies for self-hosted model serving
✅ Cost optimization techniques for training and inference

Estimated Time: 45-50 minutes

Step 1: Understanding Hugging Face vs OpenAI Fine-Tuning

Before diving into implementation, it's crucial to understand the fundamental differences between these approaches.

OpenAI vs Hugging Face Comparison

Aspect	OpenAI Fine-Tuning	Hugging Face Fine-Tuning
Control	Limited parameters	Full control over everything
Cost Model	Pay per token used	Pay for compute time only
Data Privacy	Data sent to OpenAI	Data stays on your infrastructure
Model Access	API only	Full model ownership
Customization	Predefined options	Unlimited customization
Infrastructure	Managed by OpenAI	You manage compute resources

When to Choose Hugging Face

Choose Hugging Face When:

You need complete data privacy
You want to minimize long-term costs
You require custom model architectures
You need offline/air-gapped deployment
You want to experiment with latest research techniques

Choose OpenAI When:

You want minimal setup complexity
You need immediate results
You don't have GPU infrastructure
You prefer managed services

Understanding the Training Process

Hugging Face Fine-Tuning Steps:

Model Selection: Choose base model from Hugging Face Hub
Data Preparation: Format data for transformers library
Training Configuration: Set hyperparameters and training arguments
Training Execution: Run training with monitoring
Model Evaluation: Test performance and quality
Model Deployment: Serve model for inference

Step 2: Setting Up Hugging Face Training Environment

Let's create a comprehensive training environment that handles the complexities of transformer fine-tuning.

Environment Setup and Dependencies

GPU Requirements Understanding: Fine-tuning large language models requires significant computational resources. Here's what you need to know:

Minimum Requirements:

GPU Memory: 8GB VRAM for small models (up to 1B parameters)
System RAM: 16GB for data loading and processing
Storage: 50GB+ for models, datasets, and checkpoints

Recommended Setup:

GPU Memory: 24GB+ VRAM for larger models (7B+ parameters)
System RAM: 32GB+ for efficient data processing
Storage: 200GB+ SSD for fast data access

Cloud Alternatives: If you don't have local GPU resources:

Google Colab Pro: $10/month, good for experimentation
AWS EC2 GPU instances: Pay-per-hour, scalable
Hugging Face Spaces: Integrated training environment

Installing Required Dependencies

# requirements-huggingface.txt - Comprehensive dependencies for HF fine-tuning

# Core Hugging Face libraries
transformers==4.36.0
datasets==2.14.0
tokenizers==0.15.0
accelerate==0.24.0

# Training and optimization
torch==2.1.0
torchvision==0.16.0
torchaudio==2.1.0

# Efficient training techniques
peft==0.6.0          # Parameter-Efficient Fine-Tuning (LoRA, etc.)
bitsandbytes==0.41.0 # Quantization for memory efficiency

# Monitoring and evaluation
wandb==0.16.0        # Experiment tracking
tensorboard==2.15.0  # Training visualization
evaluate==0.4.0      # Model evaluation metrics

# Data processing
pandas==2.1.0
numpy==1.24.0
scikit-learn==1.3.0

# Utilities
tqdm==4.66.0
rich==13.7.0
python-dotenv==1.0.0

Why These Dependencies Matter:

Transformers: The core library that handles model loading, training, and inference.

Datasets: Efficient data loading and processing, especially important for large datasets.

PEFT (Parameter-Efficient Fine-Tuning): Enables techniques like LoRA that dramatically reduce memory requirements and training time.

Accelerate: Handles distributed training, mixed precision, and other optimizations automatically.

Training Environment Configuration

# training/training_config.py

import os
from dataclasses import dataclass, field
from typing import Optional, List, Dict, Any
import torch

@dataclass
class HuggingFaceTrainingConfig:
    """Configuration for Hugging Face fine-tuning"""
    
    # Model configuration
    model_name: str = "microsoft/DialoGPT-medium"  # Good starting model for agents
    tokenizer_name: Optional[str] = None  # Uses model_name if None
    cache_dir: str = "./models"
    
    # Training data
    train_file: str = "data/train.json"
    validation_file: Optional[str] = "data/validation.json"
    max_seq_length: int = 512
    
    # Training parameters
    num_train_epochs: int = 3
    per_device_train_batch_size: int = 4
    per_device_eval_batch_size: int = 4
    gradient_accumulation_steps: int = 4
    learning_rate: float = 5e-5
    weight_decay: float = 0.01
    warmup_steps: int = 100
    
    # Memory optimization
    use_lora: bool = True  # Use LoRA for efficient training
    lora_r: int = 16       # LoRA rank
    lora_alpha: int = 32   # LoRA alpha
    use_gradient_checkpointing: bool = True
    fp16: bool = True      # Mixed precision training
    
    # Monitoring and saving
    output_dir: str = "./fine_tuned_model"
    logging_steps: int = 10
    eval_steps: int = 100
    save_steps: int = 500
    save_total_limit: int = 3
    
    # Evaluation
    evaluation_strategy: str = "steps"
    load_best_model_at_end: bool = True
    metric_for_best_model: str = "eval_loss"
    
    def __post_init__(self):
        """Validate configuration after initialization"""
        self.validate_config()
    
    def validate_config(self):
        """Validate training configuration"""
        
        # Check GPU availability
        if not torch.cuda.is_available():
            print("⚠️ CUDA not available. Training will be slow on CPU.")
            self.fp16 = False  # Disable mixed precision on CPU
        
        # Validate file paths
        if not os.path.exists(self.train_file):
            raise ValueError(f"Training file not found: {self.train_file}")
        
        # Check memory requirements
        if self.use_lora:
            print("✅ Using LoRA for memory-efficient training")
        else:
            print("⚠️ Full fine-tuning requires significant GPU memory")
        
        # Validate batch size
        if self.per_device_train_batch_size * self.gradient_accumulation_steps < 8:
            print("⚠️ Effective batch size is very small, may affect training stability")
        
        print(f"✅ Training configuration validated")
        print(f"   Model: {self.model_name}")
        print(f"   Epochs: {self.num_train_epochs}")
        print(f"   Batch size: {self.per_device_train_batch_size}")
        print(f"   LoRA enabled: {self.use_lora}")

Configuration Explanation:

LoRA (Low-Rank Adaptation): Instead of updating all model parameters, LoRA only trains small additional matrices. This reduces memory usage by 90% while maintaining performance.

Gradient Checkpointing: Trades computation for memory by recomputing activations during backpropagation instead of storing them.

Mixed Precision (FP16): Uses 16-bit floats instead of 32-bit, reducing memory usage and increasing training speed on modern GPUs.

Step 3: Data Preparation for Hugging Face

Hugging Face requires specific data formats that differ from OpenAI's approach.

Understanding Hugging Face Data Formats

Key Differences from OpenAI:

Flexible Formats: JSON, CSV, Parquet, or custom datasets
Column-Based: Data organized in columns rather than message arrays
Preprocessing Control: You handle tokenization and formatting

Data Preparation Pipeline

# data_preparation/hf_data_processor.py

import json
import pandas as pd
from typing import List, Dict, Any, Optional
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer

class HuggingFaceDataProcessor:
    """Process training data for Hugging Face fine-tuning"""
    
    def __init__(self, config: HuggingFaceTrainingConfig):
        self.config = config
        
        # Initialize tokenizer
        tokenizer_name = config.tokenizer_name or config.model_name
        self.tokenizer = AutoTokenizer.from_pretrained(
            tokenizer_name,
            cache_dir=config.cache_dir
        )
        
        # Add special tokens if needed
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        print(f"✅ Initialized tokenizer: {tokenizer_name}")
    
    def prepare_conversational_data(self, input_file: str, output_file: str) -> Dict[str, Any]:
        """
        Convert OpenAI-style conversational data to Hugging Face format
        
        Args:
            input_file: Path to OpenAI JSONL format file
            output_file: Path to save Hugging Face format
            
        Returns:
            Processing statistics
        """
        
        print(f"🔄 Converting {input_file} to Hugging Face format...")
        
        processed_examples = []
        stats = {
            "total_examples": 0,
            "processed_examples": 0,
            "skipped_examples": 0,
            "average_length": 0,
            "max_length": 0
        }
        
        try:
            # Read OpenAI format data
            with open(input_file, 'r', encoding='utf-8') as f:
                for line_num, line in enumerate(f, 1):
                    try:
                        example = json.loads(line.strip())
                        stats["total_examples"] += 1
                        
                        # Convert to Hugging Face format
                        hf_example = self.convert_openai_to_hf_format(example)
                        
                        if hf_example:
                            processed_examples.append(hf_example)
                            stats["processed_examples"] += 1
                            
                            # Track length statistics
                            text_length = len(hf_example["text"])
                            stats["max_length"] = max(stats["max_length"], text_length)
                        else:
                            stats["skipped_examples"] += 1
                            
                    except json.JSONDecodeError as e:
                        print(f"⚠️ Skipping invalid JSON on line {line_num}: {e}")
                        stats["skipped_examples"] += 1
            
            # Calculate average length
            if processed_examples:
                total_length = sum(len(ex["text"]) for ex in processed_examples)
                stats["average_length"] = total_length / len(processed_examples)
            
            # Save processed data
            with open(output_file, 'w', encoding='utf-8') as f:
                json.dump(processed_examples, f, indent=2, ensure_ascii=False)
            
            print(f"✅ Processed {stats['processed_examples']}/{stats['total_examples']} examples")
            print(f"   Average length: {stats['average_length']:.0f} characters")
            print(f"   Max length: {stats['max_length']} characters")
            print(f"   Saved to: {output_file}")
            
            return stats
            
        except Exception as e:
            print(f"❌ Error processing data: {e}")
            raise
    
    def convert_openai_to_hf_format(self, openai_example: Dict[str, Any]) -> Optional[Dict[str, Any]]:
        """
        Convert single OpenAI example to Hugging Face format
        
        OpenAI format: {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
        HF format: {"text": "Human: ... Assistant: ..."}
        """
        
        try:
            messages = openai_example.get("messages", [])
            if len(messages) < 2:
                return None
            
            # Build conversation text
            conversation_parts = []
            
            for message in messages:
                role = message.get("role", "")
                content = message.get("content", "").strip()
                
                if not content:
                    continue
                
                # Map roles to conversation format
                if role == "system":
                    conversation_parts.append(f"System: {content}")
                elif role == "user":
                    conversation_parts.append(f"Human: {content}")
                elif role == "assistant":
                    conversation_parts.append(f"Assistant: {content}")
            
            if len(conversation_parts) < 2:
                return None
            
            # Join with special separator
            conversation_text = "\n".join(conversation_parts)
            
            # Add end-of-sequence token
            conversation_text += self.tokenizer.eos_token
            
            return {
                "text": conversation_text,
                "length": len(conversation_text)
            }
            
        except Exception as e:
            print(f"⚠️ Error converting example: {e}")
            return None
    
    def create_dataset(self, data_file: str, validation_split: float = 0.1) -> DatasetDict:
        """
        Create Hugging Face Dataset from processed data
        
        Args:
            data_file: Path to processed JSON data
            validation_split: Fraction of data to use for validation
            
        Returns:
            DatasetDict with train and validation splits
        """
        
        print(f"📊 Creating dataset from {data_file}...")
        
        try:
            # Load processed data
            with open(data_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
            
            # Create dataset
            dataset = Dataset.from_list(data)
            
            # Split into train/validation
            if validation_split > 0:
                split_dataset = dataset.train_test_split(
                    test_size=validation_split,
                    shuffle=True,
                    seed=42
                )
                
                dataset_dict = DatasetDict({
                    'train': split_dataset['train'],
                    'validation': split_dataset['test']
                })
            else:
                dataset_dict = DatasetDict({
                    'train': dataset
                })
            
            print(f"✅ Dataset created:")
            print(f"   Training examples: {len(dataset_dict['train'])}")
            if 'validation' in dataset_dict:
                print(f"   Validation examples: {len(dataset_dict['validation'])}")
            
            return dataset_dict
            
        except Exception as e:
            print(f"❌ Error creating dataset: {e}")
            raise
    
    def tokenize_dataset(self, dataset: DatasetDict) -> DatasetDict:
        """
        Tokenize dataset for training
        
        This is where we convert text to tokens that the model can understand
        """
        
        print("🔤 Tokenizing dataset...")
        
        def tokenize_function(examples):
            """Tokenize a batch of examples"""
            
            # Tokenize the text
            tokenized = self.tokenizer(
                examples["text"],
                truncation=True,
                padding=False,  # We'll pad dynamically during training
                max_length=self.config.max_seq_length,
                return_tensors=None
            )
            
            # For causal language modeling, labels are the same as input_ids
            tokenized["labels"] = tokenized["input_ids"].copy()
            
            return tokenized
        
        try:
            # Apply tokenization to all splits
            tokenized_dataset = dataset.map(
                tokenize_function,
                batched=True,
                remove_columns=dataset["train"].column_names,  # Remove original text columns
                desc="Tokenizing dataset"
            )
            
            print("✅ Dataset tokenization complete")
            
            # Print tokenization statistics
            train_lengths = [len(ex) for ex in tokenized_dataset["train"]["input_ids"]]
            print(f"   Average tokens per example: {sum(train_lengths) / len(train_lengths):.0f}")
            print(f"   Max tokens: {max(train_lengths)}")
            print(f"   Min tokens: {min(train_lengths)}")
            
            return tokenized_dataset
            
        except Exception as e:
            print(f"❌ Error tokenizing dataset: {e}")
            raise

Data Processing Explanation:

Tokenization: Converting human-readable text into numerical tokens that the model can process. Each word or subword gets mapped to a number.

Truncation: Long conversations are cut to fit within the model's maximum sequence length. This prevents memory issues during training.

Labels Creation: For language modeling, we predict the next token, so the labels are just the input shifted by one position.

Step 4: Advanced Training Techniques

Let's implement modern fine-tuning techniques that make training efficient and effective.

LoRA (Low-Rank Adaptation) Implementation

Why LoRA is Revolutionary: Instead of updating all billions of parameters in a large model, LoRA adds small "adapter" layers that capture the task-specific knowledge. This reduces:

Memory usage: By 90% or more
Training time: Significantly faster
Storage requirements: Only store the small adapter weights

# training/lora_trainer.py

from peft import LoraConfig, get_peft_model, TaskType
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
import torch

class LoRATrainer:
    """Trainer using LoRA for efficient fine-tuning"""
    
    def __init__(self, config: HuggingFaceTrainingConfig):
        self.config = config
        
        # Initialize model and tokenizer
        self.setup_model_and_tokenizer()
        
        # Configure LoRA
        self.setup_lora()
        
        print("✅ LoRA trainer initialized")
    
    def setup_model_and_tokenizer(self):
        """Initialize the base model and tokenizer"""
        
        print(f"📥 Loading model: {self.config.model_name}")
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.config.model_name,
            cache_dir=self.config.cache_dir
        )
        
        # Ensure tokenizer has pad token
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Load model with appropriate settings
        model_kwargs = {
            "cache_dir": self.config.cache_dir,
            "torch_dtype": torch.float16 if self.config.fp16 else torch.float32,
        }
        
        # Use device_map for multi-GPU setups
        if torch.cuda.device_count() > 1:
            model_kwargs["device_map"] = "auto"
        
        self.model = AutoModelForCausalLM.from_pretrained(
            self.config.model_name,
            **model_kwargs
        )
        
        print(f"✅ Model loaded: {self.model.num_parameters():,} parameters")
    
    def setup_lora(self):
        """Configure LoRA for the model"""
        
        if not self.config.use_lora:
            print("ℹ️ LoRA disabled, using full fine-tuning")
            return
        
        # LoRA configuration
        lora_config = LoraConfig(
            task_type=TaskType.CAUSAL_LM,  # Causal language modeling
            r=self.config.lora_r,          # Rank of adaptation
            lora_alpha=self.config.lora_alpha,  # LoRA scaling parameter
            lora_dropout=0.1,              # Dropout for LoRA layers
            target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Which layers to adapt
        )
        
        # Apply LoRA to model
        self.model = get_peft_model(self.model, lora_config)
        
        # Print trainable parameters
        trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        total_params = sum(p.numel() for p in self.model.parameters())
        
        print(f"✅ LoRA applied:")
        print(f"   Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)")
        print(f"   Total parameters: {total_params:,}")
    
    def create_training_arguments(self) -> TrainingArguments:
        """Create training arguments for the Trainer"""
        
        return TrainingArguments(
            output_dir=self.config.output_dir,
            
            # Training parameters
            num_train_epochs=self.config.num_train_epochs,
            per_device_train_batch_size=self.config.per_device_train_batch_size,
            per_device_eval_batch_size=self.config.per_device_eval_batch_size,
            gradient_accumulation_steps=self.config.gradient_accumulation_steps,
            
            # Optimization
            learning_rate=self.config.learning_rate,
            weight_decay=self.config.weight_decay,
            warmup_steps=self.config.warmup_steps,
            
            # Memory optimization
            fp16=self.config.fp16,
            gradient_checkpointing=self.config.use_gradient_checkpointing,
            dataloader_pin_memory=False,  # Can cause issues with some setups
            
            # Logging and evaluation
            logging_steps=self.config.logging_steps,
            eval_steps=self.config.eval_steps,
            evaluation_strategy=self.config.evaluation_strategy,
            
            # Saving
            save_steps=self.config.save_steps,
            save_total_limit=self.config.save_total_limit,
            load_best_model_at_end=self.config.load_best_model_at_end,
            metric_for_best_model=self.config.metric_for_best_model,
            
            # Reporting
            report_to=["tensorboard"],  # Log to TensorBoard
            run_name=f"agent-finetune-{self.config.model_name.split('/')[-1]}",
            
            # Reproducibility
            seed=42,
            data_seed=42,
        )
    
    def train_model(self, tokenized_dataset: DatasetDict) -> str:
        """
        Train the model using Hugging Face Trainer
        
        Args:
            tokenized_dataset: Tokenized training data
            
        Returns:
            Path to trained model
        """
        
        print("🚀 Starting model training...")
        
        try:
            # Create data collator for dynamic padding
            data_collator = DataCollatorForLanguageModeling(
                tokenizer=self.tokenizer,
                mlm=False,  # We're doing causal LM, not masked LM
                pad_to_multiple_of=8 if self.config.fp16 else None
            )
            
            # Create training arguments
            training_args = self.create_training_arguments()
            
            # Initialize trainer
            trainer = Trainer(
                model=self.model,
                args=training_args,
                train_dataset=tokenized_dataset["train"],
                eval_dataset=tokenized_dataset.get("validation"),
                tokenizer=self.tokenizer,
                data_collator=data_collator,
            )
            
            # Start training
            print("⏳ Training in progress...")
            train_result = trainer.train()
            
            # Save the final model
            trainer.save_model()
            self.tokenizer.save_pretrained(self.config.output_dir)
            
            print("✅ Training completed successfully!")
            print(f"   Final loss: {train_result.training_loss:.4f}")
            print(f"   Training time: {train_result.metrics['train_runtime']:.2f} seconds")
            print(f"   Model saved to: {self.config.output_dir}")
            
            return self.config.output_dir
            
        except Exception as e:
            print(f"❌ Training failed: {e}")
            raise

Training Process Explanation:

Data Collator: Handles batching and padding of sequences. Since conversations have different lengths, we need dynamic padding to create uniform batches.

Trainer Class: Hugging Face's high-level training interface that handles the training loop, evaluation, checkpointing, and logging automatically.

Mixed Precision Training: Uses both 16-bit and 32-bit floating point numbers to speed up training while maintaining numerical stability.

Step 5: Model Evaluation and Testing

After training, it's crucial to evaluate your model's performance for AI agent tasks.

AI Agent-Specific Evaluation

# evaluation/agent_evaluator.py

import json
import torch
from typing import List, Dict, Any, Tuple
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datasets import Dataset

class AgentModelEvaluator:
    """Evaluate fine-tuned models for AI agent performance"""
    
    def __init__(self, model_path: str, tokenizer_path: str = None):
        """
        Initialize evaluator with trained model
        
        Args:
            model_path: Path to fine-tuned model
            tokenizer_path: Path to tokenizer (uses model_path if None)
        """
        
        self.model_path = model_path
        tokenizer_path = tokenizer_path or model_path
        
        print(f"📥 Loading model for evaluation: {model_path}")
        
        # Load model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
            device_map="auto" if torch.cuda.is_available() else None
        )
        
        # Create generation pipeline
        self.generator = pipeline(
            "text-generation",
            model=self.model,
            tokenizer=self.tokenizer,
            device=0 if torch.cuda.is_available() else -1
        )
        
        print("✅ Model loaded for evaluation")
    
    def evaluate_conversational_ability(self, test_cases: List[Dict[str, Any]]) -> Dict[str, Any]:
        """
        Evaluate model's conversational abilities
        
        Args:
            test_cases: List of test conversations
            
        Returns:
            Evaluation results with metrics
        """
        
        print(f"🧪 Evaluating conversational ability on {len(test_cases)} test cases...")
        
        results = {
            "total_cases": len(test_cases),
            "successful_responses": 0,
            "average_response_length": 0,
            "response_quality_scores": [],
            "detailed_results": []
        }
        
        total_response_length = 0
        
        for i, test_case in enumerate(test_cases):
            try:
                # Extract conversation context and expected response
                context = test_case.get("context", "")
                user_message = test_case.get("user_message", "")
                expected_response = test_case.get("expected_response", "")
                
                # Build prompt
                prompt = f"{context}\nHuman: {user_message}\nAssistant:"
                
                # Generate response
                generated = self.generator(
                    prompt,
                    max_new_tokens=150,
                    temperature=0.7,
                    do_sample=True,
                    pad_token_id=self.tokenizer.eos_token_id,
                    eos_token_id=self.tokenizer.eos_token_id
                )
                
                # Extract generated text (remove prompt)
                full_response = generated[0]["generated_text"]
                response = full_response[len(prompt):].strip()
                
                # Evaluate response quality
                quality_score = self.evaluate_response_quality(
                    user_message, response, expected_response
                )
                
                results["response_quality_scores"].append(quality_score)
                total_response_length += len(response)
                
                if quality_score > 0.6:  # Threshold for "successful" response
                    results["successful_responses"] += 1
                
                # Store detailed result
                results["detailed_results"].append({
                    "test_case": i + 1,
                    "user_message": user_message,
                    "generated_response

Ad Space

Recommended Tools & Resources

* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.

📚 Featured AI Books

The Agentic AI Bible

The AI Revolution in Project Management

The AI Engineering Bible

OpenAI API

AI Platform

Access GPT-4 and other powerful AI models for your agent development.

Pay-per-use

LangChain Plus

Framework

Advanced framework for building applications with large language models.

Free + Paid

Pinecone Vector Database

Database

High-performance vector database for AI applications and semantic search.

Free tier available

AI Agent Development Course

Education

Complete course on building production-ready AI agents from scratch.

$199

💡 Pro Tip

Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.