Chapter 4. Tokenization

This is our first chapter in the section of NLP from the ground up. In the first three chapters, we walked you through the high-level components of an NLP pipeline. From here till Chapter 9, we’ll be covering a lot of the underlying details to really understand how modern NLP systems work. The main components of this are:

Tokenization
Embeddings
Architectures

Previously, all of these steps were abstracted away in the libraries we used (spaCy, transformers, and fastai). But now, we’ll try to understand how these libraries actually work and how you can modify your code at a low level to build amazing NLP applications beyond the simple examples we presented in this book.

One thing to note: “low level” is a subjective term. While some may call PyTorch a low-level deep learning library, others may scoff at using that term for anything other than building a custom memory allocator in x86 assembly. It’s a matter of perspective. What we mean by low level here is that after learning about these things, you’ll have enough of an understanding to build useful applications with NLP in the real world and that you’ll also be able to understand and follow the latest research in the field. We won’t be discussing anything that’s too far beyond the scope of NLP. For example, learning about how CUDA works is certainly both interesting and useful, and we’ll do a bit of that in Appendix B. But CUDA itself as a tool is useful for many things outside NLP, so we’d consider that beyond the scope of this book. As much as possible, we’ll try to keep the focus on things that actually improve the performance of your models in production.

Each of the items in the list that we just saw (i.e., tokenizers, embeddings, and models) can be thought of as an independent function. They take in some input and generate some output. Each of these functions then passes on its output to the next stage of the pipeline. To be more specific, we pass tokenized text into the embedding layer, and we pass embeddings into the model. You can treat these functions as black boxes if you would like and choose to focus on only one at a time. We’ll look at each one in isolation, with tokenizers first.

A Minimal Tokenizer

As we start thinking about low-level parts of the deep learning stack, it’s useful to understand components in terms of what their inputs and outputs are.

So what are the inputs and outputs here? The input is text. Usually, this is provided as a .txt file or something else that is read into a Python object. The output is a sequence of tokens. One of the main topics of this chapter will be a discussion of what exactly a “token” is and what it should be doing.

As always, one of the best ways to understand something is to look at the code. So here’s essentially what a tokenizer is:

text = open('example.txt', 'r').read()
words = text.split(" ")
tokens = {v: k for k, v in enumerate(words)}

tokens

{'The': 0,
 'quick': 1,
 'brown': 2,
 'fox': 3,
 'jumps': 4,
 'over': 5,
 'the': 6,
 'lazy': 7,
 'dog.': 8}

A tokenizer reads in text and returns a mapping between words and indices. Essentially, it creates a dictionary (both figuratively and literally, since the preceding example creates a Python dictionary) that maps words to numbers. This is extremely useful, because we now have a representation of the source text that can be fed into an NLP model:

token_map = map(lambda t: tokens[t], words)
list(token_map)

[0, 1, 2, 3, 4, 5, 6, 7, 8]

This was, of course, a drastically oversimplified example. In practice, you’d never want to do tokenization this way. It’s slow, for one thing, and does not account for a lot of intricacies across different languages. Furthermore, this simple tokenizer does not account for punctuation, grammar, or compound word structure (i.e., the fact that words ending in “-ing,” “-ify,” etc., are related) in any meaningful way. Nonetheless, it’s a start.

Here’s a more precise way of stating what a tokenizer should be: a tokenizer is a program that converts a sequence of characters into a sequence of tokens. Tokenizers as a general tool are very useful even outside NLP. Wherever there is a need to parse text, there is probably some form of a tokenizer. Let’s take an example from the world of compilers, because it turns out that tokenization is a very old, fundamental, and useful thing to do.

Tip

So useful, in fact, that there were popular tools like lex and flex invented in the ’80s that generated the C code for a fast tokenizer given a simple description of the token you wanted to parse!

When building a compiler for a programming language, one of the first things to do is identify and mark keywords like if and for to pass on to the next stage. Here, the tokenizer reads in a file and builds a new representation of the source code where the raw ASCII/Unicode characters are replaced by tokens that represent these keywords, which can then be used to construct a data structure called a parse tree.

We’re not building a compiler here, so the parse tree isn’t entirely relevant, and in practice, we’ll be using libraries instead of complex code-generation procedures. But we wanted to illustrate an example of how a tokenizer is a very useful and robust program to have, even outside of NLP.

The type of tokenizers we’re interested in as deep learning practitioners, though, usually don’t give us parse trees. What we want is a tokenizer that reads the text and generates a sequence of one-hot vectors.

That is the most important thing to understand about tokenizers from our top-down perspective. The input is raw text, and the output is a sequence of vectors. To be even more specific, the vectors, in our case, are simply one-hot encoded PyTorch tensors that we pass into an nn.Embedding layer. Once we get to that stage, where we can pass something to an embedding layer (which we’ll discuss in the next chapter), we’re done with tokenization.

Now that we understand the input and outputs, let’s jump straight into the implementation, after which we’ll look at some of the new ideas in this space, and examine them in more low-level detail.

In our opinion, there are two tools for tokenization that are superior to most of the others–spaCy’s tokenizer and the Hugging Face tokenizers library.

spaCy’s tokenizer is more widely used, is older, and is somewhat more reliable. It has its own unique tokenization algorithm that tends to work well for common NLP tasks. The tokenizers library is a slightly more modern package that focuses on implementing the newest algorithms from the newest research.

Warning

Some models like BERT expect certain specific tokens, so you cannot use any tokenizer you like on these models. To work around this, recent versions of spaCy include wrappers around the Hugging Face transformers library, which allows you to combine the rest of your spaCy workflow with transformers. But behind the scenes, this will still use the BERT, not spaCy, tokenizer.

We’ve already used spaCy in Chapters 1 and 3, and will revisit it when we deploys model in Part III. So in this chapter, we’ll focus on the Hugging Face tokenizers library.

Hugging Face Tokenizers

tokenizers is Hugging Face’s official tokenization tool written in the Rust programming language (which happens to be Ajay’s favorite programming language at the time of writing), with bindings to Python and JavaScript. While tokenizers could be used as a general-purpose tokenizer, it being Hugging Face, it’s designed to be used specifically for deep learning and NLP, with a specific focus on fast subword tokenizer (which we’ll look at in detail once we try out the code first).

Tokenization, unlike other parts of the deep learning pipeline, is typically done on the CPU. But that doesn’t mean it has to be slow! Hugging Face’s library makes good use of the multiple cores you might have on your machine, and can tokenize large datasets at the gigabyte scale (which is fairly large for nonacademic NLP) in under a minute.

The tokenizers library further subdivides the task of tokenizations into smaller, more manageable steps. Here’s Hugging Face’s description of the components of the tokenization process in its library:

Normalizer: Executes all the initial transformations over the initial input string. For example, when you need to lowercase some text, maybe strip it, or even apply one of the common Unicode normalization processes, you will add a Normalizer.
PreTokenizer: In charge of splitting the initial input string. That’s the component that decides where and how to pre-segment the origin string. The simplest example would be like we saw before, to split on spaces.
Model: Handles all the subtoken discovery and generation. This part is trainable and really dependent of your input data.
Post-Processor: Provides advanced construction features to be compatible with some of the Transformer-based SOTA models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.
Decode: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according to the PreTokenizer we used previously.
Trainer: Provides training capabilities to each model.

Each of those logical modules has multiple options/implementations in the library:

Normalizer: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip…
PreTokenizer: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, …
Model: WordLevel, BPE, WordPiece, …
Post-Processor: BertProcessor, …
Decoder: WordLevel, BPE, WordPiece, …

You have some amount of freedom in choosing these, but more often than not, you’ll be restricted to the components supported by the pretrained model you’re using. In practice, you’ll want to use whatever is suggested on the documentation for your model, so we suggest going through it if/when you encounter bugs.

Installing the library is as simple as running the following:

pip install tokenizers

But of course, we have already included this in our requirements.txt and environment.yml files on the GitHub repo:

import tokenizers

Subword Tokenization

If you continue looking through the documentation for tokenizers, you’ll notice that there are a lot of different algorithms that are implemented in the library. But tokenization seems like a fairly straightforward task, right? What gives?

Well, it turns out that there are plenty of ways you can decide to form a “token” from a string of text.

For example, consider the strings "cat" and "cats". One valid subtokenization of "cats" would be [cat, ##s], where the double-hashtag represents a prefix subtoken of the initial input. The advantage of this approach is that you get the semantic information that word-based tokenizers provide without incurring the cost of a very large vocabulary. These training algorithms might extract subtokens such as "##ing" and "##ed" over an English corpus.

This approach has pros and cons in terms of computation cost. On the one hand, you’ll have fewer words in your vocabulary, meaning a smaller embedding matrix (discussed in Chapter 5). But on the other hand, one word will now have multiple tokens, so you’ll be able to fit fewer words into a model that accepts a fixed number of tokens.

As illustrated in Figure 4-1, the simplest character-based tokenizers will generally never produce unknown tokens but will also break up a word into many small pieces, which may cause some loss of information. On the other hand, you can fully and accurately represent words with word-level tokenization, but then you’ll need a very large vocabulary, or you risk having many unidentified tokens.

So, the goal here is twofold:

Increase the amount of information per token.
Decrease the total number of tokens (vocabulary size).

Subword tokenizers achieve this effectively by finding a good balance between characters, subwords, and words.

Note

Subword tokenization algorithms (the newer ones, at least) are not set in stone. There is a “training” phase before we can actually tokenize the text. This is not the training of the language model itself, but rather a process we run to find the optimal balance between character-level and word-level tokenization.

The idea of using prefixes and suffixes is simple enough, and you could perhaps design a somewhat effective subword tokenizer by coding in rules for common subwords like "##ing" and "##ed". However, in practice, there are a number of difficulties with this approach:

There are many different languages, each with its own rules. Building a good subword tokenizer would then mean understanding and implementing a new set of rules for each language.
There is no guarantee that the rules you make are actually any good. As an extreme example, you might decide to make a subword token for "super##", but that might never show up in the text. So you’ve essentially wasted a spot in the dictionary. You could evaluate the number of tokens matched and tune your rules again, but at that point you might as well use a training algorithm.
You as a human reading the text may not be able to capture intricacies in repeated language patterns. It’s much simpler to have a computer read 40+ GB of text and figure out the repeating tokens than it is to actually read 40+ GB of text yourself!

So, the goal of the training procedure, then, is to identify recurring text in a corpus and “refactor” it into a token. If a particular pattern is not repeated often, it is not included as a token.

For example, if your text corpus has an even balance of the strings "car" and "cat" (along with many other words), then the tokens you might get would be ["ca##", "r", "t", ...]. But if your corpus has many more occurrences of "cat" than other words, then it might be beneficial to condense that into a single token, giving the tokens ["cat", "ca##", "r" ...]. Ideally, we want to avoid chunking entire words together into a single token like that, since it increases the vocabulary size, as shown in Figure 4-1. But when something is repeated often, like the word the, it is more efficient to chunk that information together into a single token.

Subword tokenization also reduces the impact of the issue where the model encounters a new word that it’s never seen before. If your training corpus has the strings swim, play, and playing, a word-level tokenizer would identify the string swimming as an unknown word. However, "swimming" is simply a new word constructed on primitive subwords that the model has seen. So a subword tokenizer could identify it as ["swim", "##m##", "##ing"] and pass more relevant information to the model.

Let’s take a look at how these ideas are implemented in the tokenizers library.

Building Your Own Tokenizer

The ready-to-use subword tokenizers are great, but sometimes, you really do need a tokenizer that picks out nuances specific to your text domain. The canonical examples are legal and medical text. These domains usually have a specific set of frequently used terms that are important enough to deserve their own token (think of molecule names or specific sections of legal documents).

Note

Yes, we said “train,” because subword tokenizers need some criteria to decide how to split words, and learning is often the best solution.

If you want to train your own tokenizer, there are a few popular options. Here are some references to the state-of-the-art research in tokenizers:

Byte pair encoding (BPE): See R. Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units,” arXiv, 2015, https://oreil.ly/dlFNw.
WordPiece: See M. Schuster and K. Nakajima, “Japanese and Korean Voice Search,” International Conference on Acoustics, Speech and Signal Processing, IEEE (2012), https://oreil.ly/fvGTh.
SentencePiece: See T. Kudo and J. Richardson, “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing,” arXiv, 2018, https://oreil.ly/YNFhP.

Getting real-world medical data is actually quite hard due to regulations and a lack of privacy-preserving machine learning techniques. So for now, we’ll use the WikiText-103 dataset, which is the set of Wikipedia articles we used in Chapter 2. Just know that if your text data represents the typical literary patterns on the internet, you won’t have to train your own tokenizers from scratch most of the time.

First, we need to get the dataset (in case you didn’t download it already):

wget https://s3.amazonaws.com/research.metamind.io/wikitext/
  wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

Using an established tokenizer is quite simple with Hugging Face’s tokenizers library. Here, we first set up a byte-pair encoding (a form of subword tokenization) tokenizer in a single line of code:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

Next, we initialize a special BpreTrainer object. This is only required if you’re training a new tokenizer from scratch:

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

Finally, we specify the files and train our BPE tokenizer:

files = [
    f"data/wikitext-103-raw/wiki.{split}.raw" for split in
    ["test", "train", "valid"]]

tokenizer.train(files, trainer)

How Tokenizers Are Made

Unlike training, tokenization is primarily a CPU-driven program. We typically don’t use GPUs for standard tokenizers, even though there is scope for massive parallelization, for the same reason we don’t use GPUs for PowerPoint or Excel—the gains are small, and writing efficient GPU code is very hard. However, there are some admirable new efforts from Rapids AI, which has built a modern CUDA-accelerated tokenizer for BERT.

From a practical standpoint, writing a GPU-accelerated tokenizer likely has a low return on investment. Hugging Face claims that their tokenizers library can tokenize a gigabyte of text in 20 seconds. If you’re training on a 40 GB dataset (which is about the scale of GPT-2), tokenization would take 40 * 20 = 800 seconds, which is roughly 13.3 minutes. Compared to the days (or potentially weeks) it can take to actually train the model, that’s small.

During inference, you likely aren’t using text streams nearly as large as your training set, and the tokenization is much, much lower. Besides, if you are running inference on very large text samples, your bigger technical concern may be fitting the model and the data into memory and actually running a fast forward pass. In almost every scenario, a fast CPU tokenizer should be more than performant enough for your application.

But how were we able to get such efficient CPU tokenizers in the first place?

One of the main sources of speed in spacy and tokenizers is that they’re both written in lower-level systems programming languages. spacy is written in C++, and tokenizers is written in Rust. These languages have a number of benefits when it comes to speed and memory usage, which is why they’re used in performance-critical applications, even if they may be harder to read, write, and understand.

If you haven’t already heard of C++, it’s one of the oldest and most popular programming languages in the world. With C (a different but very similar language that is a subset of C++), it powers much of the world’s technology infrastructure to this day.

Rust is a newer systems programming language built by Mozilla research that got its first stable release in 2015. It is designed to be syntactically similar to and perform similar tasks to C++, but incorporates many new ideas in programming language design that had been learned in the years since C++ was first released.

You can use these libraries written in other languages in your Python code because the developers also released a set of bindings to Python. This means that while it looks like you’re running Python code, using Python functions, etc., what you’re doing is calling into code that was written in an entirely different language!

The developers of these libraries go through all this trouble because of performance. Python as a language was designed for usability. Because of how it is designed and because it’s a highly dynamic language, it’s very easy to write Python code. But that comes at the cost of Python doing a lot of work behind the scenes that your computer may not appreciate.

The simplest example of this is the garbage collector. All variables we assign and use in our code take up memory. To ensure that we don’t run out of memory, the OS needs to know when it can free up some of it. For example, if you load a file, read some information from it, and move on to another file to repeat the procedure, you could free the memory of the first file once you’ve processed it. In the past, all this memory allocating and freeing was performed manually by the programmer.

But later in the 20th century, we realized that this process was very tedious, and constantly thinking about memory can lead to a loss in programmer productivity. So, most modern high-level languages include some built-in mechanism for deciding when to free up memory. In Python, that implementation is a garbage collector, a small task that runs in the background and constantly monitors the state of variables to see if they are currently in use.

Conclusion

In this chapter, we looked at the first stage of a lower-level view of the NLP pipeline—tokenizers. Tokenizers are not the stage of the stack that most people should be optimizing because different tokenizers won’t have a significant impact on your application’s performance in the real world, but they are nonetheless a vital component. In practice, you should use spacy or tokenizers since they’ll have the latest versions of the newest tokenizers from research implemented. If you have a custom dataset with a lot of domain-specific vocabulary (like in legal or medical applications) it makes sense to retrain an established tokenizer algorithm like WordPiece or SentencePiece.

We also explored some of the nuances of developing fast tokenizers. Specifically, we explored how the choice of programming language can have an impact on the performance of your tokenizer.

Now we have a working low-level understanding of tokenizers, and if you want you should be able to build your own from scratch (in practice, of course, it’s not that useful). This allows us to take large text files and generate tokens that our model can use to solve complex NLP problems.

But we can’t pass raw tokens into the model. Tokens are still essentially indices in dictionaries, which is not semantically useful for a deep learning model. Instead, we pass what are called “embeddings” of the tokens, which is what the next chapter is all about.

Get Applied Natural Language Processing in the Enterprise now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Applied Natural Language Processing in the Enterprise by Ankur A. Patel, Ajay Uppili Arasanipalai

Chapter 4. Tokenization

A Minimal Tokenizer

Tip

Warning

Hugging Face Tokenizers

Subword Tokenization

Figure 4-1. Subtokenization

Note

Building Your Own Tokenizer

Note

Conclusion

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly