Chapter 1. Introduction

AI is no longer the realm of science fiction novels and dystopian Hollywood movies. It is fast becoming an integral part of people’s lives. Most of us interact with AI on a daily basis, often times without even realizing it.

Current progress in AI has to a large extent been driven by advances in language modeling. LLMs (Large Language Models) are one of the most significant technological advances in recent times, marking a new epoch in the world of tech. Previous inflection points include the advent of the computer that ushered in the digital revolution, the birth of the Internet and the World Wide Web that connected the world together, and the emergence of the smartphone that reshaped human communication. Given current trends, Generative AI, with LLMs playing a key part, is poised to be the next such transformative epoch.

Generative AI models learn patterns from data and are able to generate new data in response to user queries, called prompts. Generative AI encompasses models that generate images, videos, speech, music, and of course text. While there is an increasing focus on bringing all these modalities together into a single model, in this book we will stick to language and LLMs.

In this chapter, we will introduce language models and define what makes a language model large. We will go through a brief history of LLMs, contextualizing their place within the field of NLP (Natural Language Processing) and their evolution. We will highlight the impact LLMs are already having in the world and showcase key use cases. We will also introduce LLM prompting and show how to interact with an LLM effectively. We will then explore how to access LLMs through an API. Finally, we will end this chapter with a quick tutorial on building a Chat with my PDF chatbot prototype. We will then discuss the limitations of the prototype and the factors limiting its suitability for production use cases, thus setting the stage for the rest of the book.

Defining LLMs

A model is an approximation of a real world concept or phenomenon. A faithful model will be able to make predictions about the concept it is approximating. A language model approximates human language, and is built by training over a large body of text, thus imbuing it with various properties of langauge, including aspects of grammar (syntax) and meaning (semantics).

During the training process, the language model predicts the next token (this is equivalent to a word or a subword, but we will ignore this distinction until the next chapter) in a known text sequence and its parameters are updated if it gets it wrong.

For example, consider the text sequence in our training data

The umbrella thief sauntered his way to the hostel

We ask the language model to predict the next word that comes after The umbrella thief sauntered his way to the _

There are a large number of valid continuations to this text sequence. It could be restaurant or bank, but it could also be embankment or orifice. However, it is definitely not the or is, because that would break the rules of the English language. After training on a sufficiently large body of text, the model learns that neither the nor is are valid continuations. Thus, you can see how a simple task like learning to predict the next word in the sequence can lead the model to learning the grammar of the language in its parameters, as well as even more complex skills.

Is there a limit to what a model can learn from next token prediction alone? This is a very important question that determines how powerful LLMs can eventually be. There is plenty of disagreement in the research community, with some researchers arguing next token prediction is enough to achieve general intelligence in models, and others pointing out the shortfalls of this paradigm. We will come back to this question throughout the book, and especially in Chapter 10, where we will discuss skills like reasoning.

Modern-day language models are based on neural networks. Several neural network architectures are used to train LLMs, the most prominent being the Transformer. We will learn more about neural networks, Transformers, and other architectures in detail in Chapter 2.

Language models can be trained to model not just human languages, but also programming languages like Python or Java. In fact, the Transformer architecture and the next token prediction objective can be applied to sequences that are not languages at all, like representations of chess moves, DNA sequences, or airline schedules.

For example, Adam Karvonen trained Chess-GPT, a model trained only on chess games represented in PGN (Portable Game Notation) strings. PGN strings for chess look like ‘1. e4 d5 2. exd5 Qxd5…’ and so on. Even without providing the rules of the game explicitly, and just training the model to predict the next character in the PGN sequence, the model was able to learn the rules of the game including moves like castling, check, checkmate and even win chess games against experts. This shows the power of the next token prediction objective, and the Transformer architecture that forms the basis of the model. In Chapter 6, we will learn how to train our own Chess-GPT from scratch.

Another such example is the Geneformer, a model trained on millions of single-cell transcriptomes (representation of RNA molecules in a single cell) which can be used for making predictions in network biology, including disease progression, gene-dosage sensitivity, and finding therapeutic candidates.

Therefore, I encourage you to think beyond the realm of human language when determining how to use language models. If you have a concept or phenomenon that can be encoded in a sequence using a finite vocabulary (we will more formally define vocabulary in Chapter 4), then we can potentially train a useful model on it.

Around 2019, researchers realized that increasing the size of the language model predictably improved performance, with no saturation point in sight. This led to Kaplan et al.’s work on LLM scaling laws, which derives a mathematical formula describing the relationship between the amount of compute for training the model, the training dataset size, and the model size. Ever since then, companies and organizations have been training increasingly larger models.

For our purposes, we will call any model that contains over a billion parameters as a Large Language Model. Note that there is no accepted convention on when a language model is deemed to be large.

Another way in which a ‘large’ language model differs from smaller ones is the emergent capabilities that they possess. First hypothesized by Wei et al., emergent capabilities are those capabilities that are exhibited by larger models but not smaller ones.

According to this theory, for tasks that require these capabilities, the performance of smaller models is close to random. However, when the model size reaches a threshold, the performance suddenly starts to increase with size. Examples include multi-digit arithmetic operations, arithmetic and logical reasoning etc. This also suggests that certain capabilities that are completely absent in current models could be exhibited by future models.

These thresholds are not absolute, and as we see more advances in language modelling, we can expect the thresholds to come down. Moreover, there has been recent work showing that perhaps most capabilities thought of as emergent, including reasoning capabilities, are perhaps not emergent at all.

The question of what abilities are emergent is still being explored in the research community. In Chapter 5, we will discuss its implications for selecting the right model for our desired use case.

Note

LLMs are expensive and slow, thus prompting a growing body of research to investigate smaller language models that can still exhibit the same properties as larger ones. These are called SLMs (Small Language Models), although I have seen models as large as three billion parameters being called SLMs!

To understand how current LLMs came to be, it is instructive to walk through a brief history of LLMs. As more historical details are out of scope for the book, we will provide links to external resources for further reading throughout the section.

A Brief history of LLMs

In order to present the history of LLMs, we need to start from the history of NLP (Natural Language Processing), the field that LLMs originated from. For a more detailed history of NLP, refer to Daniel Jurafsky’s seminal book on NLP.

Early years

The field traces its origins to the 1950s, driven by demand for machine translation, the task of automatically translating from one language to another. The early days were dominated by symbolic approaches; these were rule-based algorithms based on linguistic theories influenced by the works of linguists like Noam Chomsky.

In the mid-1960s, Joseph Weizenbaum released ELIZA, a chatbot program that applied pattern matching using regular expressions on the user’s input and selected response templates to generate an output. ELIZA consisted of several scripts, the most famous one being DOCTOR, that simulated a psychotherapist. This variant would respond by rephrasing the user’s input in the form of a question, similar to how a therapist would.

As an example:

User: I am not feeling well
ELIZA: Do you believe it is normal to be not feeling well?’

You can try chatting with ELIZA online. Even in the era of ChatGPT, ELIZA can hold a somewhat convincing conversation, despite the fact that it is just rules-based.

As the decades rolled by, the limitations of symbolic approaches became more and more evident, and statistical approaches became more commonplace. NLP researcher Frederik Jelinek famously quipped

Every time I fire a linguist, the performance of the speech recognizer goes up.

Frederik Jelinek

Machine learning based approaches became more widely used in the 1990s and 2000s. Feature engineering and selection, the process of identifying features that are predictive to solve a task, was a crucial and time-consuming task. These features could be statistical like the average word length, or linguistic, like parts of speech. To learn more about traditional statistical NLP, I recommend reading Christopher Manning’s book.

The relevance of linguistics to modern day NLP application development is a point of debate. Many university courses on NLP have completely dropped content related to linguistics. Even though I don’t directly use linguistics in my work, I find that I rely on them more often than I expect to develop intuitions about model behavior and task decomposition. As such, I recommend Emily Bender’s books on syntax and semantics to understand the basics of this field.

The 2010s saw the advent of deep learning and its widespread impact on NLP. In deep learning, the neural network model learned informative features by itself given raw input, thus removing the need for cumbersome feature engineering. To learn the principles of deep learning and neural networks, I recommend reading Goodfellow et al.’s book. For more hands-on deep learning training, I recommend Zhang et al.’s Dive into Deep Learning.

During the early years of deep learning, it was customary to construct a task-specific architecture to solve each task. Some of the types of neural network architectures used include multi-layer perceptrons, convolutional neural networks, recurrent neural networks, and recursive neural networks. To learn more about this era of NLP, I recommend reading Yoav Goldberg’s book on Neural NLP.

The modern LLM era

In 2017, the Transformer architecture was invented, quickly followed by the invention of transfer learning and Transformer-based language models like BERT. These advances removed the need for constructing complex task-specific architectures. Instead, one could use the same Transformer model to train a variety of tasks. This new paradigm divided the training step into two stages: pre-training and fine-tuning. An initial large scale pre-training step initialized the Transformer model with general language capabilities. Subsequently, the pre-trained model could be trained on individual tasks separately, creating a copy of the model each time, using a process called fine-tuning. We will cover all these topics in detail throughout the book.

The evolution of the modern era can be traced through the advances brought in by each of the GPT (Generative Pre-trained Transformer) models trained by Open AI.

  • GPT-1 - Showcased unsupervised pre-training on large scale data, followed by task-specific supervised fine-tuning.

  • GPT-2 - This version could solve several types of tasks in a zero-shot setting, without any task-specific fine-tuning. This marked the rise of prompting as a means to interact with a language model. We will discuss zero-shot and prompting later in the chapter.

  • GPT-3 - Inspired by the scaling laws, this model is a hundred times larger than GPT-2 and popularized in-context/few-shot learning. We will learn more about this later in the chapter.

  • GPT-4 - As of the book’s writing, this model is the current state-of-the-art. A key aspect of this release is the alignment training used to make the model more controllable and adhere to the principles and values of the model trainer. We will learn about alignment training in Chapter 7.

You might have noticed a trend here; through the years, the field has been experiencing a consolidation effect, with more and more parts of the NLP task pipeline being performed end-to-end, i.e. by a single model. Throughout this book, we will point out the consolidation effect where it is apparent, and discuss its implications for the future of LLMs.

A history of LLMs wouldn’t be complete without mentioning the impact of open source contributions to this field. Open source models, datasets, model architectures, and various developer libraries and tools have all had significant impacts on the development of this field. This book places a special importance on open-source, providing a thorough survey of the open-source LLM landscape and showcasing many open-source models and datasets.

Next, let’s explore how LLMs are being adopted and their impact on society.

The impact of LLMs

The tech world has long been susceptible to hype cycles, with exhilerating booms and depressing busts. More recently, we have witnessed the crypto/blockchain and Web3 booms, both of which are yet to live up to their promise. Is AI heading towards a similar fate? We have hard evidence that it is not.

At my company Hudson Labs, we analyzed discussions in the quarterly earnings calls of the 4000 largest publicly listed companies in the United States to track adoption of crypto, Web3, and AI in the enterprise.

We observed that 85 companies discussed Web3 in their earnings calls, with even fewer tangibly working on it. Crypto fared better, with 313 companies discussing it. Meanwhile, LLMs were discussed and adopted by 2,195 companies, meaning that at least 50% of America’s largest public companies are not only using LLMs to drive value, but it is also strategically so important to them as to merit discussion in their quarterly earnings call. Effective or not, LLM adoption in the enterprise is already a reality.

Figure 1-1 shows the number of companies discussing Web3 in their earnings calls over time. As you can see, the Web3 hype seems to be tapering off.

web3
Figure 1-1. Companies that discussed Web3 in their earnings calls across time.

Similarly, Figure 1-2 shows the number of companies discussing crypto/blockchain in their earnings calls over time. As you can see, this is also a declining trend.

crypto
Figure 1-2. Companies that discussed crypto in their earnings calls across time.

Finally, let’s look at AI. As mentioned before, AI has reached amounts of adoption in the enterprise that no other recent technology trend has managed to in the recent past. The trend is only accelerating, as shown in Figure 1-3, which shows the number of companies at which AI was discussed during the earnings calls in just the first two months of the year. The sharp spike in 2024 shows that the trend is only growing.

ai-adoption
Figure 1-3. Companies that were asked questions about AI in their earnings calls during the first two months of the year

Note that these statistics only include Generative AI/LLM adoption and not data science/data analytics, whose adoption is even more ubiquitous in the enterprise. AI adoption is also not just limited to tech companies, with companies ranging from real estate companies to insurance firms joining in on the fun.

LLM usage in the enterprise

From the same analysis, we observed the key ways in which LLMs are used in the enterprise:

  • Employee Productivity: The primary means by which employee productivity has improved through LLM usage is with coding assistants like Github Copilot. LLMs are also widely used to help draft marketing and promotional text and automate marketing campaigns. In fact, the first major commercial success stories of LLMs were marketing startups like Jasper.ai and Copy.ai. Another key LLM-driven productivity enhancement is question answering assistants over a company’s extensive knowledge base drawn from heterogenous data sources.

  • Report Generation: These include summarization tasks, completing mundane paperwork, and even drafting contracts. Examples of summarization tasks include summarizing financial reports, research papers, or even meetings minutes from audio or call transcripts.

  • Chatbots: LLM-driven chatbots are being increasingly deployed as customer service agents. They are also being used as an interface to a company’s documentation or product page.

  • Information Extraction and Sequence Tagging: Over the years, a large number of enterprises have developed complex NLP (Natural Language Processing) pipelines for langauge processing tasks. Many of these pipelines are being fully or partially replaced by an LLM. These pipelines are used to solve common NLP tasks like sentiment analysis, information extraction tasks like entity extraction and relation extraction, and sequence tagging tasks like named entity recognition. For a detailed list of NLP tasks and their description, see Fabio Chiusano’s blog

  • Translation: Translation tasks not only include translating text from one language to another, but also includees tasks where text is converted to a different form but in the same language, for example, converting informal text to formal text, abusive text to polite text and so on. Real-time translation apps like Erudite’s Instant Voice Translate promise to make language barrier driven embarassing moments for tourists a thing of the past.

Prompting

Now that we have our fundamentals in place, let’s begin learning how to effectively use LLMs.

The process by which you interact with an LLM is called prompting. Despite attempts by companies to anthropomorphise LLMs by giving them a name or a persona, when you are chatting with an LLM you are not conversing with them like you would with a human, you are prompting them. Remember that LLMs are next word predictors. This means that the text they generate is heavily dependent on the text they are fed, called the context.

By feeding the LLM the right text in the context, you are priming it to generate the type of output you need. The ideal prompt would then be What would be the best prefix of N tokens that when fed to an LLM, will lead it to generate the correct answer with the highest probability? ' When I started writing this book, I solicited opinions from the target readership on the topics they would like covered in this book. The most number of requests I received were on the topic of prompting, with practitioners wanting to understand how to effectively create prompts for their specific use cases.

Indeed, prompting is an important aspect of modern day LLMs. In fact, you will probably end up spending a significant amount of your time on any LLM-based project iterating on prompts, very inaccurately referred to as prompt engineering.

Tip

There have been attempts to automatically optimize prompts, like APO (Automatic Prompt Optimization) and AutoPrompt. We will discuss this further in Chapter 13.

It is important to manage one’s expectations on the effectiveness of prompt engineering. Prompts aren’t magical incantations that unlock hidden LLM capabilities. It is very unlikely that there are companies who have a significant advantage over others just by using a superior prompting technique unknown to others. On the flip side, not following basic prompting principles can severely hamper the performance of your LLM.

There are umpteen prompting tutorials available online. I recommend learningprompting.org’s prompting guide in particular. You do not need to know all the prompting techniques to become well-versed in prompting. Most of what you need to know about prompting can be learned in a couple of hours. What matters more is interacting with the LLMs you use frequently to observe their outputs and developing an intuition on their behavior.

If you have programming experience, I suggest viewing prompting through the lens of programming. In programming, instructions need to be explicit with no room for ambiguity. The challenge with prompting is that it is done in natural language, which is inherently ambigiuous. Still, the best prompts state instructions as explicitly and as detailed as possible. We will learn more prompting nuances in Chapter 5 and 13.

Note

Language models are insensitive to word order. This property has been observed even in earlier models like BERT. For example, ask ChatGPT or your favorite LLM provider the question How do I tie my shoe laces? in jumbled form, say shoe tie my I how do laces? ChatGPT responds with Certainly! Here are step-by-step instructions on how to tie your shoe laces: …​ as if you asked a straight-forward question. In Chapter 2, we will see why this happens.

Next, let’s go through a few types of prompting modes:

Zero-shot prompting

This is the standard approach to prompting, where you provide the LLM with an instruction and optionally, some input text.

Consider an example where your task is to retrieve the sentiment of a restaurant review. To achieve this through zero-shot prompting, you can issue the following prompt:

“Classify the given sentence according to its sentiment. The output can be one of Positive, Negative, Neutral. The mashed potatoes took me back to my childhood school meals. I was so looking forward to having them again. NOT! Sentiment:

A good zero-shot prompt will:

  • Provide the instruction in a precise and explicit manner.

  • Describe the output space or the range of acceptable outputs and output format. In this example, we mention the output should be one of three values.

  • Prime it to generate the correct text. By ending the prompt with Sentiment:, we are increasing the probability of the LLM generating the sentiment value as the next token.

In real-world settings, your output format needs to be highly controllable, in order for it to fit in automated systems. We will discuss more techniques for ensuring controllability of outputs in Chapter 5.

Warning

Prompts are sensitive to model changes. You might painstakingly construct a prompt that seems to work well, but you might notice that the same prompt does not work for a different model. In fact, the same prompt might see degraded performance on the same API endpoint if the underlying model is updated in the meanwhile. We call this prompt drift. It is a good idea to version control prompts, and we will discuss tools to facilitate this in Chapter 5.

Few-shot prompting

In our example for zero-shot prompting, the LLM was able to solve the task without explaining it how to solve it. This is because the task is simple and clearly defined. In many cases, the tasks might be not so easy to describe in natural language. We can then add some examples in our prompt consisting of either outputs or input-output pairs. While this is called few-shot learning colloquially, the language model is not updated in any way through this prompting technique.

Here is an example for few-shot prompting:

‘A palindrome is a word that has the same letters when spelled left to right

or right to left.

Examples of words that are palindromes: kayak, civic, madam, radar

Examples of words that are not palindromes: kayla, civil, merge, moment

Answer the question with either Yes or No

Is the word Toyota a palindrome?

Answer:'

Chain-of-Thought prompting

This is probably the most important prompting technique to know about. As we have seen, the context of the LLM determines the next token predicted by it. Therefore, we need to optimize the content in the prompt to maximize the probability of the LLM generating the right tokens. One way to do it is to ask the LLM to think before generating. This prompts the LLM to generate the process to get to the answer instead of directly generating the answer. This might involve breaking the input task into subtasks and solving them one after the other.

When the LLM is eventually at the cusp of generating the answer, it can rely on a more relevant context that increases its probability of generating the right answer. We will call the initial text generated by the LLM as scratchpad text.

As an example, consider the following example:

'Solve the equation. 34 + 44 + 3  * 23 / 3 * 2. Think step by step.'

After recieving the instruction Think step by step, the LLM then breaks down the problem and solves them individually.

Given equation:

34 + 44 + 3 * 23 / 3 * 2

First, perform multiplication and division from left to right:

= 34 + 44 + (3 * 23 / 3 * 2)
= 34 + 44 + (69 / 3 * 2)
= 34 + 44 + (23 * 2)
= 34 + 44 + 46

Now, perform addition:

= 78 + 46

Finally, complete the addition:

= 124

So, the solution to the equation is 124.
Tip

Many LLMs these days solve tasks step-by-step without being explicitly prompted to. This is because they have been instruction-tuned to do so. We will learn more about instruction-tuning in Chapter 5. LLMs that have been instruction-tuned are easier to prompt.

In the case of LLMs accessible through a user interface, there might be a hidden prompt (called a system prompt) by the LLM provider that applies chain-of-thought prompting to relevant user promots.

Adversarial Prompting

You might notice that for some queries, the LLM declines to execute your request. This is because it has been specifically trained to refuse certain kinds of requests (We will learn how to achieve this in Chapter 7). This kind of training, which we will call alignment training, is imparted to the model to align it with the values and preferences of the entity developing the model.

For example, asking any decent LLM directly for instructions to build a bomb will result in a refusal. However, as of today, alignment training only imparts a weak layer of security, as it can be bypassed by cleverly prompting the LLM, called adversarial prompting. Adversarial prompts can be generated either manually or using algorithms. These cleverly phrased prompts ‘trick’ the LLM into generating a response even if it was trained not to.

These clever prompting schemes are not just useful for illicit purposes. In many cases, the LLM simply does not respond the way you want it to, and clever prompting schemes might help. These clever prompting schemes range from asking the LLM to adopt a specific persona to outright emotional blackmail (‘If you don’t respond correctly to this query, many children will suffer!'). While there has been some work showing that adding emotion to a prompt may lead to better performance, there is no hard and sustained evidence that this is universally effective for a given model. Thus, I would not necessarily recommend using these in production applications.

Accessing LLMs through an API

You most likely have already interacted with an LLM through a chat interface like ChatGPT, Gemini, or Claude. Let’s now explore how to access them using the API. We will use the Open AI API as an example to access their GPT family of models. Most other proprietary models expose similar parameters through their API

GPT-3.5 and GPT-4 can be accessed through the Chat completion API. Here is an example:

import os
import openai
openai.api_key = <INSERT YOUR KEY HERE>

output = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are an expert storywriter."},
    {"role": "user", "content": "Write me a short children's story
    about a dog and an elephant stopping
    being friends with each other."}
  ]
)

print(output.choices[0].message)

Roles can be either system, user, or assistant, with assistant referring to the model responses. If you are are having a chat session with the LLM you will need to add the entirety of the conversation history in the messages array in the form of a sequence of user and assistant messages.

Note

What is the difference between the system and user roles? Which instructions should go into the system prompt and which ones into the user prompt? System prompts are used for dictating the high-level overarching behavior of an LLM, like You are a financial expert well versed in writing formal reports. If you are allowing your users to directly interact with the LLM, then the system prompt can be used to provide your own instruction to the LLM along with the user request. In my experiments I have noticed that it doesn’t matter much if you place your instructions in the system prompt vs user prompt. What does matter is the length and number of instructions. LLMs typically can handle only a few instructions at a time. Instructions at the end of the prompt are more likely to be adhered to.

Here are some of the parameters made available by Open AI:

n - This refers to the number of generations the model has to make for each input. As an example, if we used n=5 in the given example, it would generate five different children’s stories.

Tip

For most tasks, I would advice generating multiple generations; i.e. n>1 and then using a postprocessing function (which could involve an LLM call) to choose the best one. This is because of the probabilistic nature of LLMs, where an answer might be wrong/bad just based on an unlucky token sampling. You might have to balance this process against your budget limitations.

stop and max_tokens - These are used to limit the length of the generated output. stop allows you to specify end tokens which if generated, would stop the generation process. An example stop sequence is the newline token. If you ask the model to adhere to a particular output format, like a numbered list of sentences, then in order to stop generating after a particular number of sentences have been output, you can just provide the final number as a stop parameter.

presence_penalty and frequency_penalty - These are used to limit the repetitiveness of the generated output. By penalizing the probability for tokens that have already appeared in the output so far, we can ensure that the model isn’t being too repetitive. These parameters can be used while performing more creative tasks.

logit bias - We have seen we can reduce the probability or prevent certain tokens from being generated. Can we do the opposite and make it more probable that some tokens will be generated? The logit bias parameter can be used to do that. In fact, it is also able to reduce the probability of a token being generated, if you provide negative values to the parameter.

top_p and temperature - Both these parameters relate to decoding strategies. Generative models produce a distribution of token probabilities, and will use these probabilities to generate the next token. There are many strategies to choose the next token to generate given the token probability distribution. We will discuss them in detail in Chapter 5.

Strengths and limitations of LLMs

Knowing the strengths and limitations of LLMs is a crucial skill in being able to build useful LLM applications. In general, LLMs are proficient at language tasks. You will almost never see them make spelling or grammar errors. They are also a vast improvement over existing techniques for understanding user requests (which we will henceforth refer to as queries). They also exhibit state-of-the-art performance on most NLP tasks like entity and relationship extraction and named entity recognition. They are also strong at generating code, which is where LLMs have arguably found their greatest success through tools like Github Copilot.

Most LLM limitations boil down to the fact that they are just not intelligent enough. Even state-of-the-art models suffer from significant limitations in various forms of reasoning, including arithmetic reasoning, logical reasoning, and common sense reasoning. LLMs are also unable to adhere to factuality, because of thier lack of connection to the real world. Therefore, they tend to generate text that is inconsistent with the facts and principles in the real world, which are colloquially termed as hallucinations. Hallucination is the bane of LLMs and one of the key reasons for hesitations in adopting them. In Chapter 10, we will engage in a deep dive into various methods to tackle hallucinations and address reasoning limitations.

Every LLM generates text with a distinct signature, some more apparent to humans than others. For example, you might have noticed that ChatGPT has a tendancy to overuse certain words like delve, tapestry, bustling etc. ChatGPT also tends to generate sentences with an explanatory final clause, like He ate the entire pizza, indicating he was hungry. or The vampire sent a thousand text messages in a month, suggesting effective use of digital technologies. However, it is extremely hard to detect AI-generated text with 100% accuracy. Bad actors are also employing evasion techniques, for example by asking another LLM to rephrase LLM-generated text so as to dilute the signature of the LLM.

Thus, plagiarism detection has become even more challenging. There have also been cases of students being unfairly accused of plagiarism due to inaccurate AI-text detectors. These trends are prompting universities worldwide to rethink the means through which students are evaluated, depending less on essays. Students are one of the heaviest users of LLM products, as showcased by a decline in ChatGPT usage numbers during summer months.

LLM generated text, while gramatically impressive, can have factuality issues. Tons of LLM-generated articles are being generated every day and uploaded to the Web, and many of them make their way to the top of search engine results. For example, for a short while, for the query Can you melt eggs?, Google showed Yes, an egg can be melted, due to an AI-generated web page containing the incorrect answer. This is referred to as AI-driven web pollution. Thus, there is a very strong incentive for search engines to accurately detect AI-generated text. Note that since LLMs are primarily trained on Web text, so future LLMs can be contaminated by polluted text as well.

Building your first chatbot prototype

Next, let’s get into the weeds already and start building!

A healthy ecosystem of libraries have propped up that make experimenting and prototyping LLM applications so much easier. In fact, you can build a Chat with your PDF question answering chatbot in just around a hundred lines of code!

Let’s implement a simple application that allows the user to upload a PDF document, and provides a chat interface through which the user can ask questions about the content of the PDF and receive responses in a conversational manner.

The workflow for this app is as follows:

  1. The user uploads a PDF of their choice through the user interface.

  2. The application parses the PDF using a PDF parsing library and splits the extracted text into manageable chunks.

  3. The chunks are converted into vector form, called embeddings.

  4. When a user issues a query through the chat interface, the query is also converted into vector form.

  5. The vector similarity between the query vector and each of the document vectors is calculated.

  6. The text corresponding to the top-k most similar vectors are retrieved.

  7. The retrieved text is fed along with the query and any other addiitonal instructions to an LLM

  8. The LLM uses the given information to provide an answer to the user query.

  9. The response is provided back to the user interface. The user can now respond (clarification question, new question, gratitude etc.)

  10. The entire conversational history is fed back to the LLM during each turn of the conversation.

Let’s begin by installing the required libraries. For this set up, we are going to use:

  • LangChain: This very popular framework enables building LLM application pipelines.

  • Gradio: This library allows you to build LLM-driven user interfaces

  • Unstructured: This is a PDF parsing suite that supports a variety of methods for extracting text from PDFs.

  • Sentence-Transformers: This is a library facilitating embeddings generation from texts

  • Open AI: This API provides access to the GPT* family of models from Open AI.

Let’s import the required libraries and functions.

!pip install openai langchain gradio unstructured

from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
import gradio as gr

Next, let’s implement the PDF loading and parsing function. LangChain supports several PDF parsing libraries. PDF parsing can be done in a variety of ways, including by LLMs. For this example, we will choose the Unstructured library.

loader = UnstructuredPDFLoader(input_file.name)
data = loader.load()

The data variable contains the parsed PDF that has been split into paragraphs. We will refer to each paragraph as a chunk. Each chunk is now converted into its vector representation using an embedding model. LangChain supports a wide variety of embedding models. For this example, we will use the all-MiniLM-L6-V2 variant of sentence-transformer embeddings, available through the HuggingFace platform.

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

Now that we have loaded the embedding model, we can generate the vectors from the data and store them in a vector database. Several vector database integrations are available on LangChain. We will use Chroma for this example, as it is the simplest to use.

db = Chroma.from_documents(data, embeddings)

Now, the vector database is ready with the vectors! We can ask queries and get responses. For instance,

query = "How do I request a refund?"
docs = db.similarity_search(query)
print(docs[0].page_content)

This code gives you the paragraph in the PDF whose vector is most similar to the vector representing the user query. Since vectors encode the meaning of the text, this means that the paragraph representing the similar vector has content similar to the content of the query.

Note that it is not guaranteed that the paragraph contains the answer to the query. Using embeddings, we can only get text that is similar to the query. The matched text need not contain the answer or even be relevant to answering the query.

We can use an LLM to figure this out. We provide the LLM with the query and the retrieved text and ask it to answer the query given the provided information. This workflow can be implemented using a chain in LangChain.

conversational_chain =

ConversationalRetrievalChain.from_llm(ChatOpenAI(temperature=0.1),
                                   retriever=pdfsearch.as_retriever(search_kwargs={"k": 3}))

We use the ConversationalRetrievalChain, which supports the following worfklow:

  1. Takes the previous conversational history if exists, and the current response/query from the user and creates a standalone question.

  2. Uses a chosen retrieval method to retrieve top-k most similar chunks to the question

  3. Takes the retrieved chunks, the conversational history, the current user/response query, instructions and feeds it to the LLM. The LLM generates the answer.

We can call the chain and append the result to the chat history as thus:

output = conversational_chain({'question': query, 'chat_history':

conversational_history})
conversational_history += [(query, output['answer'])]

Our chatbot is ready. Let’s wrap it up by connecting it with a user interface. We will use Gradio, a light-weight Python framework for building LLM-driven user interfaces.

with gr.Blocks() as app:


    with gr.Row():

       chatbot = gr.Chatbot(value=[], elem_id='qa_chatbot').style(height=500)

    with gr.Row():
        with gr.Column(scale=0.80):
            textbox = gr.Textbox(
                placeholder="Enter text"
            ).style(container=False)


        with gr.Column(scale=0.10):
            upload_button = gr.UploadButton("Upload a PDF", file_types=[".pdf"]).style()

We need some more code for writing the event handlers that wait for user events. Refer to the full code on the book’s Github repo.

Finally, we initialize the application.

if __name__ == "__main__":
    app.launch()[source,python]

Our chatbot application is ready!

From prototype to production

Is building LLM applications that easy? Unfortunately not. We have built a prototype, a decent one at that. In fact, for many non-critical use cases, the performance of this application might even be sufficient. However, a large number of use cases demand quality levels that this application is not able to meet. This book aims to address the gap between prototype and production.

Here are some situations you might encounter:

Consider the instances where the LLM gives an unexpected (and incorrect) answer. In order to be able to debug it, it is helpful to understand the internal workings of the LLM. In Chapter 2, we will dive into the Transformer architecture that underpins most LLMs of today, along with alternative architectures that are slowly gaining adoption.

An LLM’s behavior is entirely dependent on its training process - the source and composition of the training datasets, the data preprocessing steps undergone by the training datasets, the tasks (called learning objectives) the LLM is trained on, and so on. We will discuss all these aspects in Chapter 3 and 4.

The application you are building may have to satisfy multiple, sometimes even contradictory criteria. For example, your Chat-with-a-PDF application might have a limited budget or it might have very stringent latency requirements. In Chapter 5, we will show how we can reason about these various criteria and choose the right LLM to use for a given task, by showcasing various evaluation techniques.

LLMs are generalist models, but your PDFs and the questions you ask about it may be too specialized to be addressed by a generalist model. In this case, you can update the model parameters using a training process called fine-tuning, which we will explore in Chapter 6. In that chapter, we will also discuss how to generate effective training datasets for fine-tuning.

Fine-tuning a model can be compute and memory intensive. We will go through more advanced fine-tuning techniques in Chapter 7, including for cases where our data belongs to specialized domains that the LLM isn’t familiar with.

In order to answer some of your queries, your application might have to interact with its environment by connecting to external databases or software tools. In fact, your application can behave as an autonomous agent that determines all the knowledge and skills it needs to answer your questions and takes the necessary steps to acquire them. (like calling a database or invoking an API). We will discuss the various paradigms to achieve this in Chapter 8.

In reality, chatting with your PDF is still nowhere like chatting with a domain expert that created the PDF, because the LLM still lacks critical reasoning capabilities. However, for many applications only limited types of reasoning are sufficient. In Chapter 10, we will discuss how to address reasoning limitations of LLMs and how best to induce reasoning behavior in LLMs.

You might decide to release your humble chat-with-your-pdf application with the general public. But there is a danger of adverse effects. Perhaps the LLM exhibits racist or sexist tendencies. Or maybe it leaks private information. We will discuss how to handle safety, bias and fairness issues in Chapter 11.

As seen in our example, a typical LLM application involves the LLM being connected with several software components to make it a robust application. There are various patterns in which these components can be put together in a pipeline. One such pattern is Retrieval-Augmented Generation, which we used for this application and will further discuss in Chapter 12. In Chapter 13, we will go through several other design patterns for LLM application development.

Summary

In this chapter, we highlighted the impact that LLMs are having in the enterprise. We also defined LLMs and the scaling laws that are driving their ever increasing sizes. We also showed how one could build prototype applications seamlessly and highlighted the challenges involved in taking it to production. In the rest of the book, let’s understand the internals of the LLM, learn how to update it, and make it interact with its environment.

Get Designing Large Language Model Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.