LLM Engineer's Handbook

Book description

Step into the world of LLMs with this practical guide that takes you from the fundamentals to deploying advanced applications using LLMOps best practices

Key Features

  • Build and refine LLMs step by step, covering data preparation, RAG, and fine-tuning
  • Learn essential skills for deploying and monitoring LLMs, ensuring optimal performance in production
  • Utilize preference alignment, evaluation, and inference optimization to enhance performance and adaptability of your LLM applications

Book Description

Artificial intelligence has undergone rapid advancements, and Large Language Models (LLMs) are at the forefront of this revolution. This LLM book offers insights into designing, training, and deploying LLMs in real-world scenarios by leveraging MLOps best practices. The guide walks you through building an LLM-powered twin that’s cost-effective, scalable, and modular. It moves beyond isolated Jupyter notebooks, focusing on how to build production-grade end-to-end LLM systems.

Throughout this book, you will learn data engineering, supervised fine-tuning, and deployment. The hands-on approach to building the LLM Twin use case will help you implement MLOps components in your own projects. You will also explore cutting-edge advancements in the field, including inference optimization, preference alignment, and real-time data processing, making this a vital resource for those looking to apply LLMs in their projects.

By the end of this book, you will be proficient in deploying LLMs that solve practical problems while maintaining low-latency and high-availability inference capabilities. Whether you are new to artificial intelligence or an experienced practitioner, this book delivers guidance and practical techniques that will deepen your understanding of LLMs and sharpen your ability to implement them effectively.

What you will learn

  • Implement robust data pipelines and manage LLM training cycles
  • Create your own LLM and refine it with the help of hands-on examples
  • Get started with LLMOps by diving into core MLOps principles such as orchestrators and prompt monitoring
  • Perform supervised fine-tuning and LLM evaluation
  • Deploy end-to-end LLM solutions using AWS and other tools
  • Design scalable and modularLLM systems
  • Learn about RAG applications by building a feature and inference pipeline

Who this book is for

This book is for AI engineers, NLP professionals, and LLM engineers looking to deepen their understanding of LLMs. Basic knowledge of LLMs and the Gen AI landscape, Python and AWS is recommended. Whether you are new to AI or looking to enhance your skills, this book provides comprehensive guidance on implementing LLMs in real-world scenarios

Table of contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Get in touch
  2. Understanding the LLM Twin Concept and Architecture
    1. Understanding the LLM Twin concept
      1. What is an LLM Twin?
      2. Why building an LLM Twin matters
      3. Why not use ChatGPT (or another similar chatbot)?
    2. Planning the MVP of the LLM Twin product
      1. What is an MVP?
      2. Defining the LLM Twin MVP
    3. Building ML systems with feature/training/inference pipelines
      1. The problem with building ML systems
      2. The issue with previous solutions
      3. The solution – ML pipelines for ML systems
        1. The feature pipeline
        2. The training pipeline
        3. The inference pipeline
      4. Benefits of the FTI architecture
    4. Designing the system architecture of the LLM Twin
      1. Listing the technical details of the LLM Twin architecture
      2. How to design the LLM Twin architecture using the FTI pipeline design
        1. Data collection pipeline
        2. Feature pipeline
        3. Training pipeline
        4. Inference pipeline
      3. Final thoughts on the FTI design and the LLM Twin architecture
    5. Summary
    6. References
  3. Tooling and Installation
    1. Python ecosystem and project installation
      1. Poetry: dependency and virtual environment management
      2. Poe the Poet: task execution tool
    2. MLOps and LLMOps tooling
      1. Hugging Face: model registry
      2. ZenML: orchestrator, artifacts, and metadata
        1. Orchestrator
        2. Artifacts and metadata
        3. How to run and configure a ZenML pipeline
      3. Comet ML: experiment tracker
      4. Opik: prompt monitoring
    3. Databases for storing unstructured and vector data
      1. MongoDB: NoSQL database
      2. Qdrant: vector database
    4. Preparing for AWS
      1. Setting up an AWS account, an access key, and the CLI
      2. SageMaker: training and inference compute
        1. Why AWS SageMaker?
    5. Summary
    6. References
  4. Data Engineering
    1. Designing the LLM Twin’s data collection pipeline
      1. Implementing the LLM Twin’s data collection pipeline
      2. ZenML pipeline and steps
      3. The dispatcher: How do you instantiate the right crawler?
      4. The crawlers
        1. Base classes
        2. GitHubCrawler class
        3. CustomArticleCrawler class
        4. MediumCrawler class
      5. The NoSQL data warehouse documents
        1. The ORM and ODM software patterns
        2. Implementing the ODM class
        3. Data categories and user document classes
    2. Gathering raw data into the data warehouse
      1. Troubleshooting
        1. Selenium issues
        2. Import our backed-up data
    3. Summary
    4. References
  5. RAG Feature Pipeline
    1. Understanding RAG
      1. Why use RAG?
        1. Hallucinations
        2. Old information
      2. The vanilla RAG framework
        1. Ingestion pipeline
        2. Retrieval pipeline
        3. Generation pipeline
      3. What are embeddings?
        1. Why embeddings are so powerful
        2. How are embeddings created?
        3. Applications of embeddings
      4. More on vector DBs
        1. How does a vector DB work?
        2. Algorithms for creating the vector index
        3. DB operations
    2. An overview of advanced RAG
      1. Pre-retrieval
      2. Retrieval
      3. Post-retrieval
    3. Exploring the LLM Twin’s RAG feature pipeline architecture
      1. The problem we are solving
      2. The feature store
      3. Where does the raw data come from?
      4. Designing the architecture of the RAG feature pipeline
        1. Batch pipelines
        2. Batch versus streaming pipelines
        3. Core steps
        4. Change data capture: syncing the data warehouse and feature store
        5. Why is the data stored in two snapshots?
        6. Orchestration
    4. Implementing the LLM Twin’s RAG feature pipeline
      1. Settings
      2. ZenML pipeline and steps
        1. Querying the data warehouse
        2. Cleaning the documents
        3. Chunk and embed the cleaned documents
        4. Loading the documents to the vector DB
      3. Pydantic domain entities
        1. OVM
      4. The dispatcher layer
      5. The handlers
        1. The cleaning handlers
        2. The chunking handlers
        3. The embedding handlers
    5. Summary
    6. References
  6. Supervised Fine-Tuning
    1. Creating an instruction dataset
      1. General framework
        1. Data quantity
      2. Data curation
      3. Rule-based filtering
      4. Data deduplication
      5. Data decontamination
      6. Data quality evaluation
      7. Data exploration
      8. Data generation
      9. Data augmentation
    2. Creating our own instruction dataset
    3. Exploring SFT and its techniques
      1. When to fine-tune
      2. Instruction dataset formats
      3. Chat templates
      4. Parameter-efficient fine-tuning techniques
        1. Full fine-tuning
        2. LoRA
        3. QLoRA
      5. Training parameters
        1. Learning rate and scheduler
        2. Batch size
        3. Maximum length and packing
        4. Number of epochs
        5. Optimizers
        6. Weight decay
        7. Gradient checkpointing
    4. Fine-tuning in practice
    5. Summary
    6. References
  7. Fine-Tuning with Preference Alignment
    1. Understanding preference datasets
      1. Preference data
        1. Data quantity
      2. Data generation and evaluation
        1. Generating preferences
        2. Tips for data generation
        3. Evaluating preferences
    2. Creating our own preference dataset
    3. Preference alignment
      1. Reinforcement Learning from Human Feedback
      2. Direct Preference Optimization
    4. Implementing DPO
    5. Summary
    6. References
  8. Evaluating LLMs
    1. Model evaluation
      1. Comparing ML and LLM evaluation
      2. General-purpose LLM evaluations
      3. Domain-specific LLM evaluations
      4. Task-specific LLM evaluations
    2. RAG evaluation
      1. Ragas
      2. ARES
    3. Evaluating TwinLlama-3.1-8B
      1. Generating answers
      2. Evaluating answers
      3. Analyzing results
    4. Summary
    5. References
  9. Inference Optimization
    1. Model optimization strategies
      1. KV cache
      2. Continuous batching
      3. Speculative decoding
      4. Optimized attention mechanisms
    2. Model parallelism
      1. Data parallelism
      2. Pipeline parallelism
      3. Tensor parallelism
      4. Combining approaches
    3. Model quantization
      1. Introduction to quantization
      2. Quantization with GGUF and llama.cpp
      3. Quantization with GPTQ and EXL2
      4. Other quantization techniques
    4. Summary
    5. References
  10. RAG Inference Pipeline
    1. Understanding the LLM Twin’s RAG inference pipeline
    2. Exploring the LLM Twin’s advanced RAG techniques
      1. Advanced RAG pre-retrieval optimizations: query expansion and self-querying
        1. Query expansion
        2. Self-querying
      2. Advanced RAG retrieval optimization: filtered vector search
      3. Advanced RAG post-retrieval optimization: reranking
    3. Implementing the LLM Twin’s RAG inference pipeline
      1. Implementing the retrieval module
      2. Bringing everything together into the RAG inference pipeline
    4. Summary
    5. References
  11. Inference Pipeline Deployment
    1. Criteria for choosing deployment types
      1. Throughput and latency
      2. Data
    2. Understanding inference deployment types
      1. Online real-time inference
      2. Asynchronous inference
      3. Offline batch transform
    3. Monolithic versus microservices architecture in model serving
      1. Monolithic architecture
      2. Microservices architecture
      3. Choosing between monolithic and microservices architectures
    4. Exploring the LLM Twin’s inference pipeline deployment strategy
      1. The training versus the inference pipeline
    5. Deploying the LLM Twin service
      1. Implementing the LLM microservice using AWS SageMaker
        1. What are Hugging Face’s DLCs?
        2. Configuring SageMaker roles
        3. Deploying the LLM Twin model to AWS SageMaker
        4. Calling the AWS SageMaker Inference endpoint
      2. Building the business microservice using FastAPI
    6. Autoscaling capabilities to handle spikes in usage
      1. Registering a scalable target
      2. Creating a scalable policy
      3. Minimum and maximum scaling limits
        1. Cooldown period
    7. Summary
    8. References
  12. MLOps and LLMOps
    1. The path to LLMOps: Understanding its roots in DevOps and MLOps
      1. DevOps
        1. The DevOps lifecycle
        2. The core DevOps concepts
      2. MLOps
        1. MLOps core components
        2. MLOps principles
        3. ML vs. MLOps engineering
      3. LLMOps
        1. Human feedback
        2. Guardrails
        3. Prompt monitoring
    2. Deploying the LLM Twin’s pipelines to the cloud
      1. Understanding the infrastructure
      2. Setting up MongoDB
      3. Setting up Qdrant
      4. Setting up the ZenML cloud
        1. Containerize the code using Docker
        2. Run the pipelines on AWS
        3. Troubleshooting the ResourceLimitExceeded error after running a ZenML pipeline on SageMaker
    3. Adding LLMOps to the LLM Twin
      1. LLM Twin’s CI/CD pipeline flow
        1. More on formatting errors
        2. More on linting errors
      2. Quick overview of GitHub Actions
      3. The CI pipeline
        1. GitHub Actions CI YAML file
      4. The CD pipeline
      5. Test out the CI/CD pipeline
      6. The CT pipeline
        1. Initial triggers
        2. Trigger downstream pipelines
      7. Prompt monitoring
      8. Alerting
    4. Summary
    5. References
  13. Appendix: MLOps Principles
    1. Automation or operationalization
    2. Versioning
    3. Experiment tracking
    4. Testing
      1. Test types
      2. What do we test?
      3. Test examples
    5. Monitoring
      1. Logs
      2. Metrics
        1. System metrics
        2. Model metrics
        3. Drifts
        4. Monitoring vs. observability
        5. Alerts
    6. Reproducibility
  14. Other Books You May Enjoy
  15. Index

Product information

  • Title: LLM Engineer's Handbook
  • Author(s): Paul Iusztin, Maxime Labonne
  • Release date: October 2024
  • Publisher(s): Packt Publishing
  • ISBN: 9781836200079