Chapter 5. Introduction to Natural Language Processing

Natural language processing (NLP) is a technique in artificial intelligence that deals with the understanding of human-based language. It involves programming techniques to create a model that can understand language, classify content, and even generate and create new compositions in human-based language. We’ll be exploring these techniques over the next few chapters. There are also lots of services that use NLP to create applications such as chatbots, but that’s not in the scope of this book—instead, we’ll be looking at the foundations of NLP and how to model language so that you can train neural networks to understand and classify text. For a little fun, you’ll also see how to use the predictive elements of a machine learning model to write some poetry!

We’ll start this chapter by looking at how to decompose language into numbers, and how those numbers can then be used in neural networks.

Encoding Language into Numbers

You can encode language into numbers in many ways. The most common is to encode by letters, as is done naturally when strings are stored in your program. In memory, however, you don’t store the letter a but an encoding of it—perhaps an ASCII or Unicode value, or something else. For example, consider the word listen. This can be encoded with ASCII into the numbers 76, 73, 83, 84, 69, and 78. This is good, in that you can now use numerics to represent the word. But then consider the word silent, which is an ...

Get AI and Machine Learning for Coders now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.