Leveraging NLP and Word Embeddings in Machine Learning Projects

Published by Pearson

Intermediate

Intermediate Natural Language Processing

Manipulating text data is a critical component of any data professional's toolkit. The accessibility of language models has made it much easier to improve the performance of machine learning algorithms based on text data.

In this live training, we will cover how to use word embeddings in supervised machine learning tasks. We will describe key considerations in word representations, with a discussion of different algorithms to generate word vectors. We will also discuss approaches to represent documents as embeddings (including doc2vec vs averaging). Finally, we discuss implementation of embeddings into common machine learning models (as implemented in scikit-learn) in addition to issues with the bias-variance trade-off in embeddings.

The focus of this course will be on tools for English language models, although many of the principles can be applied to other languages.

What you’ll learn and how you can apply it

Decide which language model to use
Represent a document via word embeddings
Apply a machine learning algorithm to text data

This live event is for you because...

You are a data analyst, data scientist or software engineer who:

Has a working understanding of the fundamentals of natural language processing (tokenization, part of speech tagging, topic modeling)
Wants to be able to use word embeddings in machine learning models

Prerequisites

Python 3 proficiency with some familiarity with working in interactive Python environments including Notebooks (Jupyter / Google Colab / Kaggle Kernels).
Familiarity with the basics of text preprocessing including tokenization, stemming/lemmatization
Familiarity with basic methods to represent text including one-hot encoding, term frequencies.

Course Set-up:

The Course GitHub Repo contains links to:

A hosted notebook instance, and,
Instructions on how to set up these environments locally.

Recommended Preparation:

Video: Python Programming by David Beazley
Video: Modern Python LiveLessons: Big Ideas and Little Code in Python by Raymond Hettinger https://learning.oreilly.com/videos/modern-python-livelessons/9780134743400
Live Online Training: Text Analytics Pipelines (with Python and spaCy): Intermediate Natural Language Processing by Maryam Jahanshahi

Recommended Follow-up:

Live Online Training: Developing Custom Word Embeddings: Intermediate Natural Language Processing by Maryam Jahanshahi on the O'Reilly Learning Platform
Book: Natural Language Processing with Python and spaCy by Yuli Vasiliev

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Segment 1: Introduction to Language Models (30 minutes)

Intuition behind vector space modeling (vs one hot encoding or TF-IDF)
Comparisons of different word embedding algorithms and models (i.e. word2vec, GLoVE PPMI, SVD)

Segment 2: Translating text into numerical representations (40 minutes)

Ease of using pretrained embeddings
Design considerations in using pretrained models, including noise, sentiments, generalization
Break (10 minutes)

Segment 3: Classifying documents using embeddings (70 minutes)

Comparing the impact of averaging/summing vs generating document vectors
Downstream and performance implications of these decisions
Comparing bag of words to embeddings in different ML tasks
Break (10 minutes)

Segment 4: Issues in using pretrained embeddings in ML tasks (30 minutes)

Dominant word sense and context
Measuring and testing bias
Implications of poor embedding fit on ML models

Q&A (10 minutes)

Your Instructor

Maryam Jahanshahi
Maryam Jahanshahi is a Research Scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She holds a PhD in Cancer Biology from the Icahn School of Medicine at Mount Sinai. Maryam’s long-term research goal is to reduce bias in decision making by using a combination of NLP, Data Science and Decision Science. She lives in New York NY.

linkedin link search