Comparing production-grade NLP libraries: Training Spark-NLP and spaCy pipelines

A step-by-step guide to initialize the libraries, load the data, and train a tokenizer model using Spark-NLP and spaCy.

By Saif Addin Ellafi

February 28, 2018

Golden pipes (source: PublicDomainPictures.net)

The goal of this blog series is to run a realistic natural language processing (NLP) scenario by utilizing and comparing the leading production-grade linguistic programming libraries: John Snow Labs’ NLP for Apache Spark and Explosion AI’s spaCy. Both libraries are open source with commercially permissive licenses (Apache 2.0 and MIT, respectively). Both are under active development with frequent releases and a growing community.

The intention is to analyze and identify the strengths of each library, how they compare for data scientists and developers, and into which situations it may be more convenient to use one or the other. This analysis aims to be an objective run-through and (as in every natural language understanding application, by definition) involves a good amount of subjective decision-making in several stages.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

As simple as it may sound, it is tremendously challenging to compare two different libraries and make comparable benchmarking. Remember that Your application will have a different use case, data pipeline, text characteristics, hardware setup, and non-functional requirements than what’s done here.

I’ll be assuming the reader is familiar with NLP concepts and programming. Even without knowledge of the involved tools, I aim to make the code as self-explanatory as possible in order to make it readable without bogging into too much detail. Both libraries have public documentation and are completely open source, so consider reading through spaCy 101 and the Spark-NLP Quick Start documentation first.

The libraries

Spark-NLP was open sourced in October 2017. It is a native extension of Apache Spark as a Spark library. It brings a suite of Spark ML Pipeline stages, in the shape of estimators and transformers, to process distributed data sets. Spark NLP Annotators go from fundamentals like tokenization, normalization, and part of speech tagging, to advanced sentiment analysis, spell checking, assertion status, and others. These are put to work within the Spark ML framework. The library is written in Scala, runs within the JVM, and takes advantage of Spark optimizations and execution planning. The library currently has API’s in Scala and in Python.

spaCy is a popular and easy-to-use natural language processing library in Python. It recently released version 2.0, which incorporates neural network models, entity recognition models, and much more. It provides current state-of-the-art accuracy and speed levels, and has an active open source community. spaCy been here for at least three years, with its first releases on GitHub tracking back to early 2015.

Spark-NLP does not yet come with a set of pretrained models. spaCy offers pre-trained models in seven (European) languages, so the user can quickly inject target sentences and get results back without having to train models. This includes tokens, lemmas, part-of-speech (POS), similarity, entity recognition, and more.

Both libraries offer customization through parameters in some level or another, allow the saving of trained pipelines in disk, and require the developer to wrap around a program that makes use of the library in a certain use case. Spark NLP makes it easier to embed an NLP pipeline as part of a Spark ML machine learning pipeline, which also enables faster execution since Spark can optimize the entire execution—from data load, NLP, feature engineering, model training, hyper-parameter optimization, and measurement—together at once.

The benchmark application

The programs I am writing here, will predict part-of-speech tags in raw .txt files. A lot of data cleaning and preparation are in order. Both applications will train on the same data and predict on the same data, to achieve the maximum possible common ground.

My intention here is to verify two pillars of any statistical program:

Accuracy, which measures how good a program can predict linguistic features
Performance, which means how long I’ll have to wait to achieve such accuracy, and how much input data I can throw at the program before it either collapses or my grandkids grow old.

In order to compare these metrics, I need to make sure both libraries share a common ground. I have the following at my disposal:

A desktop PC, running Linux Mint with 16GB of RAM on an SSD storage, and an Intel core i5-6600K processor running 4 cores at 3.5GHz
Training, target, and correct results data, which follow NLTK POS format (see below)
Jupyter Python 3 Notebook with spaCy 2.0.5 installed
Apache Zeppelin 0.7.3 Notebook with Spark-NLP 1.3.0 and Apache Spark 2.1.1 installed

The data

Data for training, testing, and measuring has been taken from the National American Corpus, utilizing their MASC 3.0.2 written corpora from the newspaper section.

Data is wrangled with one of their tools (ANCtool) and, though I could have worked with CoNLL data format, which contains a lot of tagged information such as Lemma, indexes, and entity recognition, I preferred to utilize an NLTK data format with Penn POS Tags, which in this article serves my purposes enough. It looks like this:

As you can see, the content in the training data is:

Sentence boundary detected (new line, new sentence)
Tokenized (space separated)
POS detected (pipe delimited)

Whereas in the raw text files, everything comes mixed up, dirty, and without any standard bounds

Here are key metrics about the benchmarks we’ll run:

The benchmark data sets

We’ll use two benchmark data sets throughout this article. The first is a very small one, enabling interactive debugging and experimentation:

Training data: 36 .txt files, totaling 77 KB
Testing data: 14 .txt files, totaling 114 KB
21,362 words to predict

The second data set is still not “big data” by any means, but is a larger data set and intended to evaluate a typical single-machine use case:

Training data: 72 .txt files, totaling 150 KB
Two testing data sets: 9225 .txt files, totaling 75 MB; and 1,125, totaling 15 MB
13+ million words

Note that we have not evaluated “big data” data sets here. This is because while spaCy can take advantage of multicore CPU’s, it cannot take advantage of a cluster in the way Spark NLP natively does. Therefore, Spark NLP is orders of magnitude faster on terabyte-size data sets using a cluster—in the same way a large-scale MPP database will greatly outperform a locally installed MySQL server. Our goal here is to evaluate these libraries on a single machine, using the multicore functionality of both libraries. This is a common scenario for systems under development, and also for applications that do not need to process large data sets.

Getting started

Let’s get our hands dirty, then. First things first, we’ve got to bring the necessary imports and start them up.

spaCy

import os
import io
import time

import re
import random

import pandas as pd
import spacy

nlp_model = spacy.load('en', disable=['parser', 'ner'])
nlp_blank = spacy.blank('en', disable=['parser', 'ner'])

I’ve disabled some pipelines in spaCy in order to not bloat it with unnecessary parsers. I have also kept an nlp_model for reference, which is a pre-trained NLP model provided by spaCy, but I am going to use nlp_blank, which will be more representative, as it will be the one I’ll be training myself.

Spark-NLP

import org.apache.spark.sql.expressions.Window
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp._
import com.johnsnowlabs.nlp.annotators._
import com.johnsnowlabs.nlp.annotators.pos.perceptron._
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic._
import com.johnsnowlabs.nlp.util.io.ResourceHelper
import com.johnsnowlabs.util.Benchmark

The first challenge I face is that I am dealing with three types of tokenization results that are completely different, and will make it difficult to identify whether a word matched both the token and the POS tag:

spaCy’s tokenizer, which works on a rule-based approach with an included vocabulary that saves many common abbreviations from breaking up
SparkNLP tokenizer, which also has its own rules for tokenization
My training and testing data, which is tokenized by ANC’s standard and, in many cases, it will be splitting the words quite differently than our tokenizers

So, to overcome this, I need to decide how I am going to compare POS tags that refer to a completely different set of tags. For Spark-NLP, I am leaving as it is, which matches somewhat the ANC open standard tokenization format with its default rules. For spaCy, I need to relax the infix rule so I can increase token accuracy matching by not breaking words by a dash “-“.

spaCy

class DummyTokenMatch:
    def __init__(self, content):
        self.start = lambda : 0
        self.end = lambda : len(content)
def do_nothing(content):
    return [DummyTokenMatch(content)]

model_tokenizer = nlp_model.tokenizer

nlp_blank.tokenizer = spacy.tokenizer.Tokenizer(nlp_blank.vocab, prefix_search=model_tokenizer.prefix_search,
                                suffix_search=model_tokenizer.suffix_search,
                                infix_finditer=do_nothing,
                                token_match=model_tokenizer.token_match)

Note: I am passing vocab from nlp_blank, which is not really blank. This vocab object has English language rules and strategies that help our blank model tag POS and tokenize English words—so, spaCy begins with a slight advantage. Spark-NLP doesn’t know anything about the English language beforehand.

Training pipelines

Proceeding with the training, in spaCy I need to provide a specific training data format, which follows this shape:

TRAIN_DATA = [
("I like green eggs", {'tags': ['N', 'V', 'J', 'N']}),
("Eat blue ham", {'tags': ['V', 'J', 'N']})
]

Whereas in Spark-NLP, I have to provide a folder of .txt files containing delimited word|tag data, which looks just like ANC training data. So, I am just passing the path to the POS tagger, which is called PerceptronApproach.

Let’s load the training data for spaCy. Bear with me, as I have to add a few manual exceptions and rules with some characters since spaCy’s training set is expecting clean content.

spaCy

start = time.time()
train_path = "./target/training/"
train_files = sorted([train_path + f for f in os.listdir(train_path) if os.path.isfile(os.path.join(train_path, f))])
TRAIN_DATA = []
for file in train_files:
    fo = io.open(file, mode='r', encoding='utf-8')
    for line in fo.readlines():
        line = line.strip()
        if line == '':
            continue
        line_words = []
        line_tags = []
        for pair in re.split("\\s+", line):
            tag = pair.strip().split("|")
            line_words.append(re.sub('(\w+)\.', '\1', tag[0].replace('$', '').replace('-', '').replace('\'', '')))
            line_tags.append(tag[-1])
        TRAIN_DATA.append((' '.join(line_words), {'tags': line_tags}))
    fo.close()
TRAIN_DATA[240] = ('The company said the one  time provision would substantially eliminate all future losses at the unit .', {'tags': ['DT', 'NN', 'VBD', 'DT', 'JJ', '-', 'NN', 'NN', 'MD', 'RB', 'VB', 'DT', 'JJ', 'NNS', 'IN', 'DT', 'NN', '.']})

n_iter=5
tagger = nlp_blank.create_pipe('tagger')
tagger.add_label('-')
tagger.add_label('(')
tagger.add_label(')')
tagger.add_label('#')
tagger.add_label('...')
tagger.add_label("one-time")
nlp_blank.add_pipe(tagger)

optimizer = nlp_blank.begin_training()
for i in range(n_iter):
    random.shuffle(TRAIN_DATA)
    losses = {}
    for text, annotations in TRAIN_DATA:
        nlp_blank.update([text], [annotations], sgd=optimizer, losses=losses)
    print(losses)
print (time.time() - start)

Runtime

{'tagger': 5.773235303101046}
{'tagger': 1.138113870966123}
{'tagger': 0.46656132966405683}
{'tagger': 0.5513760568314119}
{'tagger': 0.2541630900934435}
Time to run: 122.11359786987305 seconds

I had to do some field work in order to bypass a few hurdles. The training wouldn’t let me pass my tokenizer words, which contain some ugly characters within (e.g., it won’t let you train a sentence with a token “large-screen” or “No.” unless it exists in vocab labels. Then, I had to add those characters to the list of labels for it to work once found during the training.

Let see how it is to construct a pipeline in Spark-NLP.

Spark-NLP

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")
    .setPrefixPattern("\\A([^\\s\\p{L}\\d\\$\\.#]*)")
    .addInfixPattern("(\\$?\\d+(?:[^\\s\\d]{1}\\d+)*)")
    
val posTagger = new PerceptronApproach()
    .setInputCols("document", "token")
    .setOutputCol("pos")
    .setCorpusPath("/home/saif/nlp/comparison/target/training")
    .setNIterations(5)
    
val finisher = new Finisher()
    .setInputCols("token", "pos")
    .setOutputAsArray(true)

val pipeline = new Pipeline()
    .setStages(Array(
        documentAssembler,
        tokenizer,
        posTagger,
        finisher
    ))

val model = Benchmark.time("Time to train model") {
    pipeline.fit(data)
}

As you can see, constructing a pipeline is a quite linear process: you set the document assembling, which makes the target text column a target for the next annotator, which is the tokenizer; then, the PerceptronApproach is the POS model, which will take as inputs both the document text and the tokenized form.

I had to update the prefix pattern and add a new infix pattern to match dates and numbers the same way ANC does (this will probably be made default in the next release). As you can see, every component of the pipeline is under control of the user; there is no implicit vocab or English knowledge, as opposed to spaCy.

The corpusPath from PerceptronApproach is passed to the folder containing the pipe-separated text files, and the finisher annotator wraps up the results of the POS and tokens for it to be useful next. SetOutputAsArray() will return, as it says, an array instead of a concatenated string, although that has some cost in processing.

The data passed to fit() does not really matter since the only NLP annotator being trained is the PerceptronApproach, and this one is trained with external POS Corpora.

Runtime

Time to train model: 3.167619593sec

As a side note, it would be possible to inject in the pipeline a SentenceDetector or a SpellChecker, which in some scenarios might help the accuracy of the POS by letting the model know where a sentence ends.

What’s next?

So far, we have initialized the libraries, loaded the data, and trained a tokenizer model using each one. Note that spaCy comes with pretrained tokenizers, so this step may not be necessary if your text data is from a language (i.e., English) and domain (i.e., news articles) that it was trained on, though the tokenization infix alteration is significant in order to more likely match tokens to our ANC corpus. Training was more than 38 times faster on Spark-NLP for about five iterations.

In the next installment in the blog series, we will walk through the code, accuracy, and performance for running this NLP pipeline using the models we’ve just trained.

Post topics: Data science

Post tags: NLP Libraries