ai-agenttutorialllmdeploymentmodel-hostingproduction

Fine-Tuning LLMs for Custom Agent Behaviors - Part 4: Deploying Custom Models

By AgentForge Hub8/14/202516 min read
Advanced
Fine-Tuning LLMs for Custom Agent Behaviors - Part 4: Deploying Custom Models

Ad Space

Fine-Tuning LLMs for Custom Agent Behaviors - Part 4: Deploying Custom Models

You've successfully fine-tuned your model in Parts 2 and 3, but a model sitting on your local machine provides no value to users. Model deployment is where your fine-tuning work transforms into a production-ready AI agent that can serve thousands of users with your custom behaviors and expertise.

However, model deployment is complex - you need to consider hosting options, performance optimization, cost management, and integration patterns that ensure your custom model works seamlessly with your AI agent infrastructure.

Why Model Deployment is Critical

From Training to Production Fine-tuning creates a model file, but deployment makes it accessible to your AI agent. This involves:

Model Hosting Infrastructure Your model needs computational resources (CPU/GPU) to generate responses. Different hosting options offer different trade-offs between cost, performance, and control.

API Integration Your AI agent needs to communicate with your deployed model through well-designed APIs that handle authentication, rate limiting, and error recovery.

Performance Optimization Raw model inference can be slow. Production deployment requires optimization techniques like caching, batching, and model quantization.

Scalability Planning As your AI agent gains users, your model hosting must scale to handle increased load without degrading performance or exploding costs.

What You'll Learn in This Tutorial

By the end of this tutorial, you'll have:

  • βœ… Multiple deployment strategies for different use cases and budgets
  • βœ… Production-ready model serving with proper API design
  • βœ… Performance optimization techniques for fast inference
  • βœ… Cost management strategies for sustainable operations
  • βœ… Integration patterns for seamless AI agent connectivity
  • βœ… Monitoring and maintenance frameworks for deployed models

Estimated Time: 45-50 minutes


Step 1: Understanding Model Deployment Options

Before choosing a deployment strategy, it's crucial to understand the available options and their trade-offs.

Deployment Strategy Comparison

Option Cost Control Complexity Performance Best For
OpenAI Hosted High (per token) Low Very Low Excellent Quick deployment, testing
Hugging Face Inference Medium Medium Low Good Prototyping, small scale
Self-Hosted Cloud Variable High High Excellent Production, custom needs
Local Hosting Low Complete Medium Variable Development, privacy

When to Choose Each Option

OpenAI Hosted Fine-Tuned Models

  • Pros: Zero infrastructure management, automatic scaling, excellent performance
  • Cons: Ongoing per-token costs, limited customization, vendor lock-in
  • Best For: Rapid deployment, testing, applications with unpredictable usage

Hugging Face Inference Endpoints

  • Pros: Managed hosting, reasonable costs, easy setup
  • Cons: Limited customization, potential cold starts, dependency on HF infrastructure
  • Best For: Prototyping, small to medium scale applications

Self-Hosted Solutions

  • Pros: Complete control, cost optimization potential, no vendor lock-in
  • Cons: Infrastructure management complexity, scaling challenges
  • Best For: Large scale applications, specific performance requirements, data privacy needs

Step 2: OpenAI Model Deployment (Easiest Path)

If you fine-tuned with OpenAI in Part 2, deployment is straightforward but requires understanding cost implications.

Understanding OpenAI Model Hosting

How OpenAI Hosting Works: When you fine-tune with OpenAI, they automatically host your model on their infrastructure. You access it through the same API as base models, but with your custom model ID.

Cost Structure:

  • Training Cost: One-time fee based on tokens in training data
  • Hosting Cost: No additional hosting fees
  • Usage Cost: Per-token pricing (typically 8x base model cost)

Why This Might Be Expensive: If your AI agent processes 1 million tokens per month:

  • Base GPT-3.5-turbo: ~$2/month
  • Fine-tuned GPT-3.5-turbo: ~$16/month

Integrating OpenAI Fine-Tuned Models

// ai-agent/openai-integration.js

const { OpenAI } = require('openai');

class OpenAIFineTunedModelClient {
    constructor(config) {
        this.config = {
            apiKey: config.apiKey || process.env.OPENAI_API_KEY,
            fineTunedModelId: config.fineTunedModelId || process.env.OPENAI_FINE_TUNED_MODEL_ID,
            
            // Fallback configuration
            fallbackModel: config.fallbackModel || 'gpt-3.5-turbo',
            maxRetries: config.maxRetries || 3,
            timeout: config.timeout || 30000
        };
        
        this.client = new OpenAI({
            apiKey: this.config.apiKey,
            timeout: this.config.timeout
        });
        
        // Track usage for cost monitoring
        this.usageMetrics = {
            totalTokens: 0,
            totalRequests: 0,
            totalCost: 0,
            errorCount: 0
        };
        
        console.log(`βœ… OpenAI fine-tuned model client initialized: ${this.config.fineTunedModelId}`);
    }
    
    async generateResponse(messages, options = {}) {
        /**
         * Generate response using fine-tuned model with fallback
         */
        
        const requestConfig = {
            model: this.config.fineTunedModelId,
            messages: messages,
            max_tokens: options.maxTokens || 500,
            temperature: options.temperature || 0.7,
            top_p: options.topP || 1.0,
            frequency_penalty: options.frequencyPenalty || 0,
            presence_penalty: options.presencePenalty || 0
        };
        
        let attempt = 0;
        
        while (attempt < this.config.maxRetries) {
            try {
                console.log(`πŸ€– Generating response with fine-tuned model (attempt ${attempt + 1})`);
                
                const response = await this.client.chat.completions.create(requestConfig);
                
                // Track usage metrics
                this.updateUsageMetrics(response.usage);
                
                return {
                    content: response.choices[0].message.content,
                    model: response.model,
                    usage: response.usage,
                    finishReason: response.choices[0].finish_reason,
                    isFineTuned: true
                };
                
            } catch (error) {
                attempt++;
                console.error(`❌ Fine-tuned model error (attempt ${attempt}):`, error.message);
                
                // If it's a model-specific error and we have retries left, try fallback
                if (this.shouldUseFallback(error) && attempt === this.config.maxRetries) {
                    console.warn(`⚠️ Using fallback model due to fine-tuned model issues`);
                    return await this.generateWithFallback(messages, options);
                }
                
                // If not the last attempt, wait before retrying
                if (attempt < this.config.maxRetries) {
                    await new Promise(resolve => setTimeout(resolve, 1000 * attempt));
                }
            }
        }
        
        // All retries failed
        this.usageMetrics.errorCount++;
        throw new Error('Fine-tuned model failed after all retry attempts');
    }
    
    shouldUseFallback(error) {
        /**
         * Determine if we should use fallback model based on error type
         */
        
        const fallbackErrors = [
            'model_not_found',
            'model_overloaded',
            'insufficient_quota',
            'rate_limit_exceeded'
        ];
        
        return fallbackErrors.some(errorType => 
            error.message.toLowerCase().includes(errorType)
        );
    }
    
    async generateWithFallback(messages, options) {
        /**
         * Generate response using fallback base model
         */
        
        try {
            const response = await this.client.chat.completions.create({
                model: this.config.fallbackModel,
                messages: messages,
                max_tokens: options.maxTokens || 500,
                temperature: options.temperature || 0.7
            });
            
            return {
                content: response.choices[0].message.content,
                model: response.model,
                usage: response.usage,
                finishReason: response.choices[0].finish_reason,
                isFineTuned: false,
                usedFallback: true
            };
            
        } catch (error) {
            console.error('❌ Fallback model also failed:', error);
            throw error;
        }
    }
    
    updateUsageMetrics(usage) {
        /**
         * Update usage metrics for cost tracking
         */
        
        this.usageMetrics.totalTokens += usage.total_tokens;
        this.usageMetrics.totalRequests++;
        
        // Estimate cost (fine-tuned models typically cost 8x base model)
        const estimatedCost = (usage.total_tokens / 1000) * 0.016; // $0.016 per 1K tokens
        this.usageMetrics.totalCost += estimatedCost;
        
        console.log(`πŸ’° Usage: ${usage.total_tokens} tokens, estimated cost: $${estimatedCost.toFixed(4)}`);
    }
    
    getUsageStats() {
        /**
         * Get comprehensive usage statistics
         */
        
        return {
            ...this.usageMetrics,
            averageTokensPerRequest: this.usageMetrics.totalRequests > 0 
                ? this.usageMetrics.totalTokens / this.usageMetrics.totalRequests 
                : 0,
            errorRate: this.usageMetrics.totalRequests > 0
                ? this.usageMetrics.errorCount / this.usageMetrics.totalRequests
                : 0
        };
    }
}

module.exports = OpenAIFineTunedModelClient;

OpenAI Integration Explanation:

Fallback Strategy: If your fine-tuned model fails, the system automatically falls back to the base model, ensuring your AI agent remains functional.

Usage Tracking: Monitoring token usage and costs helps you understand the financial impact of your fine-tuned model.

Retry Logic: Temporary failures are handled with exponential backoff, improving reliability.


Step 3: Self-Hosted Model Deployment

For more control and potentially lower costs, you can host your fine-tuned model on your own infrastructure.

Understanding Self-Hosted Deployment

Why Self-Host Your Model:

  • Cost Control: After initial infrastructure costs, inference is essentially free
  • Data Privacy: Your data never leaves your infrastructure
  • Customization: Complete control over model serving, caching, and optimization
  • Performance: Optimize specifically for your use case and traffic patterns

Infrastructure Requirements:

  • GPU Memory: 4-8GB for small models, 16-24GB for larger models
  • System RAM: 16-32GB for efficient model loading and caching
  • Storage: Fast SSD for model files and caching
  • Network: High bandwidth for serving multiple concurrent requests

Docker-Based Model Serving

# Dockerfile for model serving

FROM python:3.9-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy requirements first (for better caching)
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy model files and application code
COPY models/ ./models/
COPY src/ ./src/
COPY app.py .

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run the application
CMD ["python", "app.py"]

Model Serving Application

# app.py - FastAPI model serving application

from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
from typing import List, Dict, Any, Optional
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import uvicorn
import logging
import time
from datetime import datetime

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(
    title="AI Agent Fine-Tuned Model API",
    description="Serving fine-tuned models for AI agents",
    version="1.0.0"
)

# Request/Response models
class ChatRequest(BaseModel):
    messages: List[Dict[str, str]]
    max_tokens: Optional[int] = 500
    temperature: Optional[float] = 0.7
    top_p: Optional[float] = 1.0

class ChatResponse(BaseModel):
    content: str
    model: str
    usage: Dict[str, int]
    processing_time: float
    timestamp: str

# Global model and tokenizer (loaded once at startup)
model = None
tokenizer = None
generator = None

# Performance metrics
metrics = {
    "requests_served": 0,
    "total_tokens_generated": 0,
    "average_response_time": 0,
    "error_count": 0,
    "model_load_time": 0
}

@app.on_event("startup")
async def load_model():
    """Load model and tokenizer at startup"""
    
    global model, tokenizer, generator
    
    logger.info("πŸš€ Loading fine-tuned model...")
    start_time = time.time()
    
    try:
        model_path = "./models/fine_tuned_model"  # Path to your fine-tuned model
        
        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        
        # Load model with appropriate device mapping
        device = "cuda" if torch.cuda.is_available() else "cpu"
        
        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16 if device == "cuda" else torch.float32,
            device_map="auto" if device == "cuda" else None,
            low_cpu_mem_usage=True
        )
        
        # Create generation pipeline
        generator = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            device=0 if device == "cuda" else -1,
            torch_dtype=torch.float16 if device == "cuda" else torch.float32
        )
        
        load_time = time.time() - start_time
        metrics["model_load_time"] = load_time
        
        logger.info(f"βœ… Model loaded successfully in {load_time:.2f} seconds")
        logger.info(f"   Device: {device}")
        logger.info(f"   Model parameters: {model.num_parameters():,}")
        
    except Exception as e:
        logger.error(f"❌ Failed to load model: {e}")
        raise

@app.post("/chat", response_model=ChatResponse)
async def chat_completion(request: ChatRequest, background_tasks: BackgroundTasks):
    """Generate chat completion using fine-tuned model"""
    
    if generator is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    
    start_time = time.time()
    
    try:
        # Build prompt from messages
        prompt = build_prompt_from_messages(request.messages)
        
        # Generate response
        logger.info(f"πŸ€– Generating response for prompt length: {len(prompt)}")
        
        generated = generator(
            prompt,
            max_new_tokens=request.max_tokens,
            temperature=request.temperature,
            top_p=request.top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            return_full_text=False  # Only return generated text
        )
        
        # Extract generated content
        generated_text = generated[0]["generated_text"].strip()
        
        # Calculate metrics
        processing_time = time.time() - start_time
        input_tokens = len(tokenizer.encode(prompt))
        output_tokens = len(tokenizer.encode(generated_text))
        total_tokens = input_tokens + output_tokens
        
        # Update metrics in background
        background_tasks.add_task(
            update_metrics, 
            processing_time, 
            total_tokens, 
            True
        )
        
        return ChatResponse(
            content=generated_text,
            model="fine-tuned-model",
            usage={
                "prompt_tokens": input_tokens,
                "completion_tokens": output_tokens,
                "total_tokens": total_tokens
            },
            processing_time=processing_time,
            timestamp=datetime.now().isoformat()
        )
        
    except Exception as e:
        logger.error(f"❌ Error generating response: {e}")
        
        # Update error metrics
        background_tasks.add_task(update_metrics, time.time() - start_time, 0, False)
        
        raise HTTPException(
            status_code=500,
            detail=f"Model inference failed: {str(e)}"
        )

def build_prompt_from_messages(messages: List[Dict[str, str]]) -> str:
    """Convert messages to prompt format expected by model"""
    
    prompt_parts = []
    
    for message in messages:
        role = message.get("role", "")
        content = message.get("content", "")
        
        if role == "system":
            prompt_parts.append(f"System: {content}")
        elif role == "user":
            prompt_parts.append(f"Human: {content}")
        elif role == "assistant":
            prompt_parts.append(f"Assistant: {content}")
    
    # Add assistant prompt for generation
    prompt_parts.append("Assistant:")
    
    return "\n".join(prompt_parts)

async def update_metrics(processing_time: float, tokens: int, success: bool):
    """Update performance metrics"""
    
    metrics["requests_served"] += 1
    
    if success:
        metrics["total_tokens_generated"] += tokens
        
        # Update average response time
        current_avg = metrics["average_response_time"]
        request_count = metrics["requests_served"]
        metrics["average_response_time"] = (
            (current_avg * (request_count - 1)) + processing_time
        ) / request_count
    else:
        metrics["error_count"] += 1

@app.get("/health")
async def health_check():
    """Health check endpoint"""
    
    return {
        "status": "healthy" if generator is not None else "loading",
        "model_loaded": generator is not None,
        "device": "cuda" if torch.cuda.is_available() else "cpu",
        "metrics": metrics,
        "timestamp": datetime.now().isoformat()
    }

@app.get("/metrics")
async def get_metrics():
    """Get detailed performance metrics"""
    
    return {
        "performance": metrics,
        "system": {
            "gpu_available": torch.cuda.is_available(),
            "gpu_memory": torch.cuda.get_device_properties(0).total_memory if torch.cuda.is_available() else None,
            "model_parameters": model.num_parameters() if model else None
        },
        "timestamp": datetime.now().isoformat()
    }

if __name__ == "__main__":
    uvicorn.run(
        "app:app",
        host="0.0.0.0",
        port=8000,
        workers=1,  # Single worker for GPU models
        log_level="info"
    )

Self-Hosted Model Explanation:

FastAPI Framework: Provides automatic API documentation, request validation, and excellent performance for model serving.

Model Loading Strategy: Models are loaded once at startup and kept in memory for fast inference.

Error Handling: Comprehensive error handling ensures your model API remains stable even when individual requests fail.

Metrics Collection: Detailed metrics help you understand performance and optimize your deployment.


Step 4: Performance Optimization Techniques

Raw model inference can be slow. Let's implement optimization techniques that dramatically improve performance.

Model Optimization Strategies

1. Model Quantization Reduces model size and memory usage while maintaining most of the performance:

# optimization/model_quantization.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class ModelOptimizer:
    """Optimize models for production deployment"""
    
    def __init__(self, model_path: str):
        self.model_path = model_path
        self.logger = logging.getLogger(__name__)
    
    def quantize_model(self, output_path: str, quantization_type: str = "int8"):
        """
        Quantize model to reduce memory usage and improve inference speed
        
        Args:
            output_path: Where to save quantized model
            quantization_type: Type of quantization (int8, int4)
        """
        
        self.logger.info(f"πŸ”§ Quantizing model: {self.model_path}")
        
        # Load original model
        model = AutoModelForCausalLM.from_pretrained(
            self.model_path,
            torch_dtype=torch.float16
        )
        
        original_size = sum(p.numel() * p.element_size() for p in model.parameters())
        
        if quantization_type == "int8":
            # Apply 8-bit quantization
            quantized_model = torch.quantization.quantize_dynamic(
                model,
                {torch.nn.Linear},
                dtype=torch.qint8
            )
        else:
            raise ValueError(f"Unsupported quantization type: {quantization_type}")
        
        # Calculate size reduction
        quantized_size = sum(p.numel() * p.element_size() for p in quantized_model.parameters())
        size_reduction = (1 - quantized_size / original_size) * 100
        
        # Save quantized model
        quantized_model.save_pretrained(output_path)
        
        self.logger.info(f"βœ… Model quantized successfully")
        self.logger.info(f"   Size reduction: {size_reduction:.1f}%")
        self.logger.info(f"   Saved to: {output_path}")
        
        return {
            "original_size_mb": original_size / 1024 / 1024,
            "quantized_size_mb": quantized_size / 1024 / 1024,
            "size_reduction_percent": size_reduction,
            "output_path": output_path
        }

2. Response Caching Cache responses for repeated queries to reduce inference costs:

# optimization/response_cache.py

import hashlib
import json
import time
from typing import Optional, Dict, Any

class ResponseCache:
    """Cache model responses to improve performance and reduce costs"""
    
    def __init__(self, max_size: int = 1000, ttl_seconds: int = 3600):
        self.cache = {}
        self.access_times = {}
        self.max_size = max_size
        self.ttl_seconds = ttl_seconds
        
        self.stats = {
            "hits": 0,
            "misses": 0,
            "evictions": 0
        }
    
    def _generate_cache_key(self, messages: List[Dict[str, str]], options: Dict[str, Any]) -> str:
        """Generate cache key from messages and options"""
        
        # Create deterministic hash from messages and relevant options
        cache_data = {
            "messages": messages,
            "temperature": options.get("temperature", 0.7),
            "max_tokens": options.get("max_tokens", 500)
        }
        
        cache_string = json.dumps(cache_data, sort_keys=True)
        return hashlib.sha256(cache_string.encode()).hexdigest()
    
    def get(self, messages: List[Dict[str, str]], options: Dict[str, Any]) -> Optional[Dict[str, Any]]:
        """Get cached response if available and not expired"""
        
        cache_key = self._generate_cache_key(messages, options)
        
        if cache_key in self.cache:
            cached_item = self.cache[cache_key]
            
            # Check if expired
            if time.time() - cached_item["timestamp"] > self.ttl_seconds:
                del self.cache[cache_key]
                del self.access_times[cache_key]
                return None
            
            # Update access time
            self.access_times[cache_key] = time.time()
            self.stats["hits"] += 1
            
            logger.info(f"🎯 Cache hit for key: {cache_key[:16]}...")
            return cached_item["response"]
        
        self.stats["misses"] += 1
        return None
    
    def set(self, messages: List[Dict[str, str]], options: Dict[str, Any], response: Dict[str, Any]):
        """Cache response"""
        
        cache_key = self._generate_cache_key(messages, options)
        
        # Evict old items if cache is full
        if len(self.cache) >= self.max_size:
            self._evict_oldest()
        
        # Store response
        self.cache[cache_key] = {
            "response": response,
            "timestamp": time.time()
        }
        self.access_times[cache_key] = time.time()
        
        logger.info(f"πŸ’Ύ Cached response for key: {cache_key[:16]}...")
    
    def _evict_oldest(self):
        """Evict least recently used item"""
        
        if not self.access_times:
            return
        
        # Find oldest accessed item
        oldest_key = min(self.access_times.keys(), key=lambda k: self.access_times[k])
        
        # Remove from cache
        del self.cache[oldest_key]
        del self.access_times[oldest_key]
        
        self.stats["evictions"] += 1
        logger.info(f"πŸ—‘οΈ Evicted cache item: {oldest_key[:16]}...")
    
    def get_stats(self) -> Dict[str, Any]:
        """Get cache performance statistics"""
        
        total_requests = self.stats["hits"] + self.stats["misses"]
        hit_rate = self.stats["hits"] / total_requests if total_requests > 0 else 0
        
        return {
            **self.stats,
            "hit_rate": hit_rate,
            "cache_size": len(self.cache),
            "max_size": self.max_size
        }

Optimization Explanation:

Quantization: Reduces model precision from 32-bit to 8-bit numbers, significantly reducing memory usage with minimal quality loss.

Response Caching: Identical requests return cached responses instantly, reducing inference costs and improving response time.

LRU Eviction: When cache is full, least recently used items are removed to make space for new responses.


Step 5: Integration with Your AI Agent

Now let's integrate your deployed model with your AI agent architecture.

Model Client Abstraction

// ai-agent/model-client.js

class ModelClient {
    /**
     * Abstract interface for different model hosting options
     */
    
    constructor(config) {
        this.config = config;
        this.type = config.type; // 'openai', 'huggingface', 'self-hosted'
        
        // Initialize appropriate client based on type
        switch (this.type) {
            case 'openai':
                this.client = new OpenAIFineTunedModelClient(config);
                break;
            case 'self-hosted':
                this.client = new SelfHostedModelClient(config);
                break;
            case 'huggingface':
                this.client = new HuggingFaceModelClient(config);
                break;
            default:
                throw new Error(`Unsupported model client type: ${this.type}`);
        }
        
        console.log(`βœ… Model client initialized: ${this.type}`);
    }
    
    async generateResponse(messages, options = {}) {
        /**
         * Generate response using configured model client
         */
        
        try {
            const response = await this.client.generateResponse(messages, options);
            
            // Add client type to response metadata
            response.clientType = this.type;
            response.modelConfig = this.config;
            
            return response;
            
        } catch (error) {
            console.error(`❌ Model client error (${this.type}):`, error);
            throw error;
        }
    }
    
    getUsageStats() {
        /**
         * Get usage statistics from underlying client
         */
        
        if (this.client.getUsageStats) {
            return {
                clientType: this.type,
                ...this.client.getUsageStats()
            };
        }
        
        return { clientType: this.type };
    }
}

class SelfHostedModelClient {
    /**
     * Client for self-hosted model endpoints
     */
    
    constructor(config) {
        this.config = {
            endpoint: config.endpoint || 'http://localhost:8000',
            apiKey: config.apiKey,
            timeout: config.timeout || 60000,
            maxRetries: config.maxRetries || 3
        };
        
        this.usageMetrics = {
            totalRequests: 0,
            totalTokens: 0,
            errorCount: 0,
            averageResponseTime: 0
        };
    }
    
    async generateResponse(messages, options = {}) {
        /**
         * Generate response from self-hosted model
         */
        
        const axios = require('axios');
        const startTime = Date.now();
        
        const requestData = {
            messages: messages,
            max_tokens: options.maxTokens || 500,
            temperature: options.temperature || 0.7,
            top_p: options.topP || 1.0
        };
        
        let attempt = 0;
        
        while (attempt < this.config.maxRetries) {
            try {
                const response = await axios.post(
                    `${this.config.endpoint}/chat`,
                    requestData,
                    {
                        headers: {
                            'Content-Type': 'application/json',
                            ...(this.config.apiKey && { 'Authorization': `Bearer ${this.config.apiKey}` })
                        },
                        timeout: this.config.timeout
                    }
                );
                
                // Update metrics
                const processingTime = Date.now() - startTime;
                this.updateMetrics(processingTime, response.data.usage.total_tokens, true);
                
                return {
                    content: response.


---

## πŸ“¦ Practical Deployment Example: FastAPI

Let’s make this concrete with a small service wrapping our fine-tuned model using **FastAPI**:

```python
from fastapi import FastAPI
from pydantic import BaseModel
import openai

app = FastAPI()

class Query(BaseModel):
    prompt: str

@app.post("/generate")
async def generate_text(query: Query):
    response = openai.ChatCompletion.create(
        model="ft:gpt-3.5-turbo:your-org:custom-agent:latest",
        messages=[{"role": "user", "content": query.prompt}],
    )
    return {"response": response['choices'][0]['message']['content']}

Run it locally with:

uvicorn app:app --reload --port 8000

Now you can POST prompts and get responses.


πŸ“Š Monitoring and Logging

In production, it’s critical to observe behavior:

  • Log inputs/outputs (with privacy filtering)
  • Track latency and throughput
  • Measure error rates (timeouts, 5xx failures)
  • Capture feedback signals for continual improvement

Example with prometheus_client:

from prometheus_client import Counter, Histogram

REQUESTS = Counter("requests_total", "Total requests")
LATENCY = Histogram("request_latency_seconds", "Request latency")

@app.post("/generate")
async def generate_text(query: Query):
    REQUESTS.inc()
    with LATENCY.time():
        # call model as before
        ...

πŸ“ Case Study: Support Bot Deployment

Suppose you trained a support-bot LLM for customer queries.

  • Deploy as a FastAPI service.
  • Put behind a load balancer (NGINX, API Gateway).
  • Add autoscaling rules (Kubernetes HPA or ECS Fargate).
  • Log unresolved queries to continuously fine-tune.

This end-to-end path shows how your fine-tuned model becomes a production-ready service.


πŸ”— Transition to Part 5

With deployment in place, the next step is integration: connecting the fine-tuned model into multi-agent workflows, handling errors gracefully, and ensuring smooth interoperability with your other systems.

Ad Space

Recommended Tools & Resources

* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.

OpenAI API

AI Platform

Access GPT-4 and other powerful AI models for your agent development.

Pay-per-use

LangChain Plus

Framework

Advanced framework for building applications with large language models.

Free + Paid

Pinecone Vector Database

Database

High-performance vector database for AI applications and semantic search.

Free tier available

AI Agent Development Course

Education

Complete course on building production-ready AI agents from scratch.

$199

πŸ’‘ Pro Tip

Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.

πŸš€ Join the AgentForge Community

Get weekly insights, tutorials, and the latest AI agent developments delivered to your inbox.

No spam, ever. Unsubscribe at any time.

Loading conversations...