Comparing production-grade NLP libraries: Training Spark-NLP and spaCy pipelines
A step-by-step guide to initialize the libraries, load the data, and train a tokenizer model using Spark-NLP and spaCy.
The goal of this blog series is to run a realistic natural language processing (NLP) scenario by utilizing and comparing the leading production-grade linguistic programming libraries: John Snow Labs’ NLP for Apache Spark and Explosion AI’s spaCy. Both libraries are open source with commercially permissive licenses (Apache 2.0 and MIT, respectively). Both are under active development with frequent releases and a growing community.
The intention is to analyze and identify the strengths of each library, how they compare for data scientists and developers, and into which situations it may be more convenient to use one or the other. This analysis aims to be an objective run-through and (as in every natural language understanding application, by definition) involves a good amount of subjective decision-making in several stages.
As simple as it may sound, it is tremendously challenging to compare two different libraries and make comparable benchmarking. Remember that Your application will have a different use case, data pipeline, text characteristics, hardware setup, and non-functional requirements than what’s done here.
I’ll be assuming the reader is familiar with NLP concepts and programming. Even without knowledge of the involved tools, I aim to make the code as self-explanatory as possible in order to make it readable without bogging into too much detail. Both libraries have public documentation and are completely open source, so consider reading through spaCy 101 and the Spark-NLP Quick Start documentation first.
The libraries
Spark-NLP was open sourced in October 2017. It is a native extension of Apache Spark as a Spark library. It brings a suite of Spark ML Pipeline stages, in the shape of estimators and transformers, to process distributed data sets. Spark NLP Annotators go from fundamentals like tokenization, normalization, and part of speech tagging, to advanced sentiment analysis, spell checking, assertion status, and others. These are put to work within the Spark ML framework. The library is written in Scala, runs within the JVM, and takes advantage of Spark optimizations and execution planning. The library currently has API’s in Scala and in Python.
spaCy is a popular and easy-to-use natural language processing library in Python. It recently released version 2.0, which incorporates neural network models, entity recognition models, and much more. It provides current state-of-the-art accuracy and speed levels, and has an active open source community. spaCy been here for at least three years, with its first releases on GitHub tracking back to early 2015.
Spark-NLP does not yet come with a set of pretrained models. spaCy offers pre-trained models in seven (European) languages, so the user can quickly inject target sentences and get results back without having to train models. This includes tokens, lemmas, part-of-speech (POS), similarity, entity recognition, and more.
Both libraries offer customization through parameters in some level or another, allow the saving of trained pipelines in disk, and require the developer to wrap around a program that makes use of the library in a certain use case. Spark NLP makes it easier to embed an NLP pipeline as part of a Spark ML machine learning pipeline, which also enables faster execution since Spark can optimize the entire execution—from data load, NLP, feature engineering, model training, hyper-parameter optimization, and measurement—together at once.
The benchmark application
The programs I am writing here, will predict part-of-speech tags in raw .txt files. A lot of data cleaning and preparation are in order. Both applications will train on the same data and predict on the same data, to achieve the maximum possible common ground.
My intention here is to verify two pillars of any statistical program:
- Accuracy, which measures how good a program can predict linguistic features
- Performance, which means how long I’ll have to wait to achieve such accuracy, and how much input data I can throw at the program before it either collapses or my grandkids grow old.
In order to compare these metrics, I need to make sure both libraries share a common ground. I have the following at my disposal:
- A desktop PC, running Linux Mint with 16GB of RAM on an SSD storage, and an Intel core i5-6600K processor running 4 cores at 3.5GHz
- Training, target, and correct results data, which follow NLTK POS format (see below)
- Jupyter Python 3 Notebook with spaCy 2.0.5 installed
- Apache Zeppelin 0.7.3 Notebook with Spark-NLP 1.3.0 and Apache Spark 2.1.1 installed
The data
Data for training, testing, and measuring has been taken from the National American Corpus, utilizing their MASC 3.0.2 written corpora from the newspaper section.
Data is wrangled with one of their tools (ANCtool) and, though I could have worked with CoNLL data format, which contains a lot of tagged information such as Lemma, indexes, and entity recognition, I preferred to utilize an NLTK data format with Penn POS Tags, which in this article serves my purposes enough. It looks like this:
Neither|DT Davison|NNP nor|CC most|RBS other|JJ RxP|NNP opponents|NNS doubt|VBP the|DT efficacy|NN of|IN medications|NNS .|.
As you can see, the content in the training data is:
- Sentence boundary detected (new line, new sentence)
- Tokenized (space separated)
- POS detected (pipe delimited)
Whereas in the raw text files, everything comes mixed up, dirty, and without any standard bounds
Here are key metrics about the benchmarks we’ll run:
The benchmark data sets
We’ll use two benchmark data sets throughout this article. The first is a very small one, enabling interactive debugging and experimentation:
- Training data: 36 .txt files, totaling 77 KB
- Testing data: 14 .txt files, totaling 114 KB
- 21,362 words to predict
The second data set is still not “big data” by any means, but is a larger data set and intended to evaluate a typical single-machine use case:
- Training data: 72 .txt files, totaling 150 KB
- Two testing data sets: 9225 .txt files, totaling 75 MB; and 1,125, totaling 15 MB
- 13+ million words
Note that we have not evaluated “big data” data sets here. This is because while spaCy can take advantage of multicore CPU’s, it cannot take advantage of a cluster in the way Spark NLP natively does. Therefore, Spark NLP is orders of magnitude faster on terabyte-size data sets using a cluster—in the same way a large-scale MPP database will greatly outperform a locally installed MySQL server. Our goal here is to evaluate these libraries on a single machine, using the multicore functionality of both libraries. This is a common scenario for systems under development, and also for applications that do not need to process large data sets.
Getting started
Let’s get our hands dirty, then. First things first, we’ve got to bring the necessary imports and start them up.
spaCy
import os import io import time import re import random import pandas as pd import spacy nlp_model = spacy.load('en', disable=['parser', 'ner']) nlp_blank = spacy.blank('en', disable=['parser', 'ner'])
I’ve disabled some pipelines in spaCy in order to not bloat it with unnecessary parsers. I have also kept an nlp_model
for reference, which is a pre-trained NLP model provided by spaCy, but I am going to use nlp_blank
, which will be more representative, as it will be the one I’ll be training myself.
Spark-NLP
import org.apache.spark.sql.expressions.Window import org.apache.spark.ml.Pipeline import com.johnsnowlabs.nlp._ import com.johnsnowlabs.nlp.annotators._ import com.johnsnowlabs.nlp.annotators.pos.perceptron._ import com.johnsnowlabs.nlp.annotators.sbd.pragmatic._ import com.johnsnowlabs.nlp.util.io.ResourceHelper import com.johnsnowlabs.util.Benchmark
The first challenge I face is that I am dealing with three types of tokenization results that are completely different, and will make it difficult to identify whether a word matched both the token and the POS tag:
- spaCy’s tokenizer, which works on a rule-based approach with an included vocabulary that saves many common abbreviations from breaking up
- SparkNLP tokenizer, which also has its own rules for tokenization
- My training and testing data, which is tokenized by ANC’s standard and, in many cases, it will be splitting the words quite differently than our tokenizers
So, to overcome this, I need to decide how I am going to compare POS tags that refer to a completely different set of tags. For Spark-NLP, I am leaving as it is, which matches somewhat the ANC open standard tokenization format with its default rules. For spaCy, I need to relax the infix rule so I can increase token accuracy matching by not breaking words by a dash “-“.
spaCy
class DummyTokenMatch: def __init__(self, content): self.start = lambda : 0 self.end = lambda : len(content) def do_nothing(content): return [DummyTokenMatch(content)] model_tokenizer = nlp_model.tokenizer nlp_blank.tokenizer = spacy.tokenizer.Tokenizer(nlp_blank.vocab, prefix_search=model_tokenizer.prefix_search, suffix_search=model_tokenizer.suffix_search, infix_finditer=do_nothing, token_match=model_tokenizer.token_match)
Note: I am passing vocab
from nlp_blank
, which is not really blank. This vocab object has English language rules and strategies that help our blank model tag POS and tokenize English words—so, spaCy begins with a slight advantage. Spark-NLP doesn’t know anything about the English language beforehand.
Training pipelines
Proceeding with the training, in spaCy I need to provide a specific training data format, which follows this shape:
TRAIN_DATA = [ ("I like green eggs", {'tags': ['N', 'V', 'J', 'N']}), ("Eat blue ham", {'tags': ['V', 'J', 'N']}) ]
Whereas in Spark-NLP, I have to provide a folder of .txt files containing delimited word|tag data, which looks just like ANC training data. So, I am just passing the path to the POS tagger, which is called PerceptronApproach
.
Let’s load the training data for spaCy. Bear with me, as I have to add a few manual exceptions and rules with some characters since spaCy’s training set is expecting clean content.
spaCy
start = time.time() train_path = "./target/training/" train_files = sorted([train_path + f for f in os.listdir(train_path) if os.path.isfile(os.path.join(train_path, f))]) TRAIN_DATA = [] for file in train_files: fo = io.open(file, mode='r', encoding='utf-8') for line in fo.readlines(): line = line.strip() if line == '': continue line_words = [] line_tags = [] for pair in re.split("\\s+", line): tag = pair.strip().split("|") line_words.append(re.sub('(\w+)\.', '\1', tag[0].replace('$', '').replace('-', '').replace('\'', ''))) line_tags.append(tag[-1]) TRAIN_DATA.append((' '.join(line_words), {'tags': line_tags})) fo.close() TRAIN_DATA[240] = ('The company said the one time provision would substantially eliminate all future losses at the unit .', {'tags': ['DT', 'NN', 'VBD', 'DT', 'JJ', '-', 'NN', 'NN', 'MD', 'RB', 'VB', 'DT', 'JJ', 'NNS', 'IN', 'DT', 'NN', '.']}) n_iter=5 tagger = nlp_blank.create_pipe('tagger') tagger.add_label('-') tagger.add_label('(') tagger.add_label(')') tagger.add_label('#') tagger.add_label('...') tagger.add_label("one-time") nlp_blank.add_pipe(tagger) optimizer = nlp_blank.begin_training() for i in range(n_iter): random.shuffle(TRAIN_DATA) losses = {} for text, annotations in TRAIN_DATA: nlp_blank.update([text], [annotations], sgd=optimizer, losses=losses) print(losses) print (time.time() - start)
Runtime
{'tagger': 5.773235303101046} {'tagger': 1.138113870966123} {'tagger': 0.46656132966405683} {'tagger': 0.5513760568314119} {'tagger': 0.2541630900934435} Time to run: 122.11359786987305 seconds
I had to do some field work in order to bypass a few hurdles. The training wouldn’t let me pass my tokenizer words, which contain some ugly characters within (e.g., it won’t let you train a sentence with a token “large-screen” or “No.” unless it exists in vocab
labels. Then, I had to add those characters to the list of labels for it to work once found during the training.
Let see how it is to construct a pipeline in Spark-NLP.
Spark-NLP
val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") .setPrefixPattern("\\A([^\\s\\p{L}\\d\\$\\.#]*)") .addInfixPattern("(\\$?\\d+(?:[^\\s\\d]{1}\\d+)*)") val posTagger = new PerceptronApproach() .setInputCols("document", "token") .setOutputCol("pos") .setCorpusPath("/home/saif/nlp/comparison/target/training") .setNIterations(5) val finisher = new Finisher() .setInputCols("token", "pos") .setOutputAsArray(true) val pipeline = new Pipeline() .setStages(Array( documentAssembler, tokenizer, posTagger, finisher )) val model = Benchmark.time("Time to train model") { pipeline.fit(data) }
As you can see, constructing a pipeline is a quite linear process: you set the document assembling, which makes the target text column a target for the next annotator, which is the tokenizer; then, the PerceptronApproach
is the POS model, which will take as inputs both the document text and the tokenized form.
I had to update the prefix pattern and add a new infix pattern to match dates and numbers the same way ANC does (this will probably be made default in the next release). As you can see, every component of the pipeline is under control of the user; there is no implicit vocab
or English knowledge, as opposed to spaCy.
The corpusPath
from PerceptronApproach
is passed to the folder containing the pipe-separated text files, and the finisher
annotator wraps up the results of the POS and tokens for it to be useful next. SetOutputAsArray()
will return, as it says, an array instead of a concatenated string, although that has some cost in processing.
The data passed to fit()
does not really matter since the only NLP annotator being trained is the PerceptronApproach
, and this one is trained with external POS Corpora.
Runtime
Time to train model: 3.167619593sec
As a side note, it would be possible to inject in the pipeline a SentenceDetector
or a SpellChecker
, which in some scenarios might help the accuracy of the POS by letting the model know where a sentence ends.
What’s next?
So far, we have initialized the libraries, loaded the data, and trained a tokenizer model using each one. Note that spaCy comes with pretrained tokenizers, so this step may not be necessary if your text data is from a language (i.e., English) and domain (i.e., news articles) that it was trained on, though the tokenization infix alteration is significant in order to more likely match tokens to our ANC corpus. Training was more than 38 times faster on Spark-NLP for about five iterations.
In the next installment in the blog series, we will walk through the code, accuracy, and performance for running this NLP pipeline using the models we’ve just trained.