Chapter 1. The World of Text Search

Words frequently have different meanings, and this is evident even in the short description of Sphinx itself. We used to call it a full-text search engine, which is a standard term in the IT knowledge domain. Nevertheless, this occasionally delivered the wrong impression of Sphinx being either a Google-competing web service, or an embeddable software library that only hardened C++ programmers would ever manage to implement and use. So nowadays, we tend to call Sphinx a search server to stress that it’s a suite of programs running on your hardware that you use to implement and maintain full-text searches, similar to how you use a database server to store and manipulate your data. Sphinx can serve you in a variety of different ways and help with quite a number of search-related tasks, and then some. The data sets range from indexing just a few blog posts to web-scale collections that contain billions of documents; workload levels vary from just a few searches per day on a deserted personal website to about 200 million queries per day on Craigslist; and query types fluctuate between simple quick queries that need to return top 10 matches on a given keyword and sophisticated analytical queries used for data mining tasks that combine thousands of keywords into a complex text query and add a few nontext conditions on top. So, there’s a lot of things that Sphinx can do, and therefore a lot to discuss. But before we begin, let’s ensure that we’re on the same page in our dictionaries, and that the words I use mean the same to you, the reader.

Terms and Concepts in Search

Before exploring Sphinx in particular, let’s begin with a quick overview of searching in general, and make sure we share an understanding of the common terms.

Searching in general can be formally defined as choosing a subset of entries that match given criteria from a complete data set. This is clearly too vague for any practical use, so let’s look at the field to create a slightly more specific job description.

Thinking in Documents Versus Databases

Whatever unit of text you want to return is your document. A newspaper or journal may have articles, a government agency may have memoranda and notices, a content management system may have blogs and comments, and a forum may have threads and messages. Furthermore, depending on what people want in their search results, searchable documents can be defined differently. It might be desirable to find blog postings by comments, and so a document on a blog would include not just the post body but also the comments. On the other hand, matching an entire book by keywords is not of much use, and using a subsection or a page as a searchable unit of text makes much more sense. Each individual item that can come up in a search result is a document.

Instead of storing the actual text it indexes, Sphinx creates a full-text index that lets it efficiently search through that text. Sphinx can also store a limited amount of attached string data if you explicitly tell it to. Such data could contain the document’s author, format, date of creation, and similar information. But, by default, the indexed text itself does not get stored. Under certain circumstances, it’s possible to reconstruct the original text from the Sphinx index, but that’s a complicated and computationally intensive task.

Thus, Sphinx stores a special data structure that represents the things we want to know about the document in a compressed form. For instance, because the word “programmer” appears over and over in this chapter, we wouldn’t want to store each occurrence in the database. That not only would be a waste of space, but also would fail to record the information we’re most interested in. Instead, our database would store the word “programmer” along with some useful statistics, such as the number of times it occurs in the document or the position it occupies each time.

Those journal articles, blog posts and comments, and other entities would normally be stored in a database. And, in fact, relational database terminology correlates well with a notion of the document in a full-text search system.

In a database, your data is stored in tables where you predefine a set of columns (ID, author, content, price, etc.) and then insert, update, or delete rows with data for those columns. Some of the data you store—such as author, price, or publication date—might not be part of the text itself; this metadata is called an attribute in Sphinx. Sphinx’s full-text index is roughly equivalent to your data table, the full-text document is your row, and the document’s searchable fields and attached attributes are your columns.

Database table ≈ Sphinx index

Database rows ≈ Sphinx documents

Database columns ≈ Sphinx fields and attributes

So, in these terms, how does a search query basically work—from a really high-level perspective?

When processing the user’s request, Sphinx uses a full-text index to quickly look at each full-text match, that is, a document that matches all the specified keywords. It can then examine additional, nonkeyword-based searching conditions, if any, such as a restriction by blog post year, product price range, and so forth, to see whether it should be returned. The current document being examined is called a candidate document. Candidates that satisfy all the search criteria, whether keywords or not, are called matches. (Obviously, if there are no additional restrictions, all full-text matches just become matches.) Matches are then ranked, that is, Sphinx computes and attaches a certain relevance value, orders matches by that value, and returns the top N best matches to a calling application. Those top N most relevant matches (the top 1,000 by default) are collectively called a result set.

Why Do We Need Full-Text Indexes?

Why not just store the document data and then look for keywords in it when doing the searching? The answer is very simple: performance.

Looking for a keyword in document data is like reading an entire book cover to cover while watching out for keywords you are interested in. Books with concordances are much more convenient: with a concordance you can look up pages and sentences you need by keyword in no time.

The full-text index over a document collection is exactly such a concordance. Interestingly, that’s not just a metaphor, but a pretty accurate or even literally correct description. The most efficient approach to maintaining full-text indexes, called inverted files and used in Sphinx as well as most other systems, works exactly like a book’s index: for every given keyword, the inverted file maintains a sorted list of document identifiers, and uses that to match documents by keyword very quickly.

Query Languages

In order to meet modern users’ expectations, search engines must offer more than searches for a string of words. They allow relationships to be specified through a query language whose syntax allows for special search operators.

For instance, virtually all search engines recognize the keywords AND and NOT as Boolean operators. Other examples of query language syntax will appear as we move through this chapter.

There is no standard query language, especially when it comes to more advanced features. Every search system uses its own syntax and defaults. For example, Google and Sphinx default to AND as an implicit operator, that is, they try to match all keywords by default; Lucene defaults to OR and matches any of the keywords submitted.

Logical Versus Full-Text Conditions

Search engines use two types of criteria for matching documents to the user’s search.

Logical conditions

Logical conditions return a Boolean result based on an expression supplied by the user.

Logical expressions can get quite complex, potentially involving multiple columns, mathematical operations on columns, functions, and so on. Examples include:

price<100
LENGTH(title)>=20
(author_id=123 AND YEAROF(date_added)>=2000)

Both text, such as the title in the second example, and metadata, such as the date_added in the third example, can be manipulated by logical expressions. The third example illustrates the sophistication permitted by logical expressions. It includes the AND Boolean operator, the YEAROF function that presumably extracts the year from a date, and two mathematical comparisons.

Optional additional conditions of a full-text criterion can be imposed based on either the existence or the nonexistence of a keyword within a row (cat AND dog BUT NOT mouse), or on the positions of the matching keywords within a matching row (a phrase searching for “John Doe”).

Because a logical expression evaluates to a Boolean true or false result, we can compute that result for every candidate row we’re processing, and then either include or exclude it from the result set.

Full-text queries

The full-text type of search breaks down into a number of subtypes, applicable in different scenarios. These all fall under the general category of keyword searching.

Boolean search

This is a kind of logical expression, but full-text queries use a narrower range of conditions that simply check whether a keyword occurs in the document. For example, cat AND dog, where AND is a Boolean operator, matches every document that mentions both “cat” and “dog,” no matter where the keywords occur in the document. Similarly, cat AND NOT dog, where NOT is also an operator, will match every document that mentions “cat” but does not mention “dog” anywhere.

Phrase search

This helps when you are looking for an exact match of a multiple-keyword quote such as “To be or not to be,” instead of just trying to find each keyword by itself in no particular order. The de facto standard syntax for phrase searches, supported across all modern search systems, is to put quotes around the query (e.g., “black cat”). Note how, in this case, unlike in just Boolean searching, we need to know not only that the keyword occurred in the document, but also where it occurred. Otherwise, we wouldn’t know whether “black” and “cat” are adjacent. So, for phrase searching to work, we need our full-text index to store not just keyword-to-document mappings, but keyword positions within documents as well.

Proximity search

This is even more flexible than phrase searching, using positions to match documents where the keywords occur within a given distance to one another. Specific proximity query syntaxes differ across systems. For example, a proximity query in Sphinx would look like this:

"cat dog"~5

This means “find all documents where ‘cat’ and ‘dog’ occur within the same five keywords.”

Field-based search

This is also known as field searching. Documents almost always have more than one field, and programmers frequently want to limit parts of a search to a given field. For example, you might want to find all email messages from someone named Peter that mention MySQL in the subject line. Syntaxes for this differ; the Sphinx phrase for this one would be:

@from Peter @subject MySQL

Most search systems let you combine these query types (or subquery types, as they are sometimes called) in the query language.

Differences between logical and full-text searches

One can think of these two types of searches as follows: logical criteria use entire columns as values, while full-text criteria implicitly split the text columns into arrays of words, and then work with those words and their position, matching them to a text query.

This isn’t a mathematically correct definition. One could immediately argue that, as long as our “logical” criterion definition allows us to use functions, we can introduce a function EXPLODE() that takes the entire column as its argument and returns an array of word-position pairs. We could then express all full-text conditions in terms of set-theoretical operations over the results of EXPLODE(), therefore showing that all “full-text” criteria are in fact “logical.” A completely unambiguous distinction in the mathematical sense would be 10 pages long, but because this book is not a Ph.D. dissertation, I will omit the 10-page definition of an EXPLODE() class of functions, and just keep my fingers crossed that the difference between logical and full-text conditions is clear enough here.

Natural Language Processing

Natural language processing (NLP) works very differently from keyword searches. NLP tries to capture the meaning of a user query, and answer the question instead of merely matching the keywords. For example, the query what POTUS number was JFK would ideally match a document saying “John Fitzgerald Kennedy, 35^th U.S. president,” even though it does not have any of the query keywords.

Natural language searching is a field with a long history that is still evolving rapidly. Ultimately, it is all about so-called semantic analysis, which means making the machine understand the general meaning of documents and queries, an algorithmically complex and computationally difficult problem. (The hardest part is the general semantic analysis of lengthy documents when indexing them, as search queries are typically rather short, making them a lot easier to process.)

NLP is a field of science worth a bookshelf in itself, and it is not the topic of this book. But a high-level overview may help to shine light on general trends in search. Despite the sheer general complexity of a problem, a number of different techniques to tackle it have already been developed.

Of course, general-purpose AI that can read a text and understand it is very hard, but a number of handy and simple tricks based on regular keyword searching and logical conditions can go a long way. For instance, we might detect “what is X” queries and rewrite them in “X is” form. We can also capture well-known synonyms, such as JFK, and replace them with jfk OR (john AND kennedy) internally. We can make even more assumptions when implementing a specific vertical search. For instance, the query 2br in reading on a property search website is pretty unambiguous: we can be fairly sure that “2br” means a two-bedroom apartment, and that the “in reading” part refers to a town named Reading rather than the act of reading a book, so we can adjust our query accordingly—say, replace “2br” with a logical condition on a number of bedrooms, and limit “reading” to location-related fields so that “reading room” in a description would not interfere.

Technically, this kind of query processing is already a form of query-level NLP, even though it is very simple.

From Text to Words

Search engines break down both documents and query text into particular keywords. This is called tokenization, and the part of the program doing it is called a tokenizer (or, sometimes, word breaker). Seemingly straightforward at first glance, tokenization has, in fact, so many nuances that, for example, Sphinx’s tokenizer is one of its most complex parts.

The complexity arises out of a number of cases that must be handled. The tokenizer can’t simply pay attention to English letters (or letters in any language), and consider everything else to be a separator. That would be too naïve for practical use. So the tokenizer also handles punctuation, special query syntax characters, special characters that need to be fully ignored, keyword length limits, and character translation tables for different languages, among other things.

We’re saving the discussion of Sphinx’s tokenizer features for later (a few of the most common features are covered in Chapter 3; a full discussion of all the advanced features is beyond the scope of this book), but one generic feature deserves to be mentioned here: tokenizing exceptions. These are individual words that you can anticipate must be treated in an unusual way. Examples are “C++” and “C#,” which would normally be ignored because individual letters aren’t recognized as search terms by most search engines, while punctuation such as plus signs and number signs are ignored. You want people to be able to search on C++ and C#, so you flag them as exceptions. A search system might or might not let you specify exceptions. This is no small issue for a jobs website whose search engine needs to distinguish C++ vacancies from C# vacancies and from pure C ones, or a local business search engine that does not want to match an “AT&T” query to the document “T-Mobile office AT corner of Jackson Rd. and Johnson Dr.”

Linguistics Crash Course

Sphinx currently supports most common linguistics requirements, such as stemming (finding the root in words) and keyword substitution dictionaries. In this section, we’ll explain what a language processor such as Sphinx can do for you so that you understand how to configure it and make the best use of its existing features as well as extend them if needed.

One important step toward better language support is morphology processing. We frequently want to match not only the exact keyword form, but also other forms that are related to our keyword—not just “cat” but also “cats”; not just “mouse” but also “mice”; not just “going” but also “go,” “goes,” “went,” and so on. The set of all the word forms that share the same meaning is called the lexeme; the canonical word form that the search engine uses to represent the lexeme is called the lemma. In the three examples just listed, the lemmas would be “cat,” “mouse,” and “go,” respectively. All the other variants of the root are said to “ascend” to this root. The process of converting a word to its lemma is called lemmatization (no wonder).

Lemmatization is not a trivial problem in itself, because natural languages do not strictly follow fixed rules, meaning they are rife with exceptions (“mice were caught”), tend to evolve over time (“i am blogging this”), and last but not least, are ambiguous, sometimes requiring the engine to analyze not only the word itself, but also a surrounding context (“the dove flew away” versus “she dove into the pool”). So an ideal lemmatizer would need to combine part-of-speech tagging, a number of algorithmic transformation rules, and a dictionary of exceptions.

That’s pretty complex, so frequently, people use something simpler—namely, so-called stemmers. Unlike a lemmatizer, a stemmer intentionally does not aim to normalize a word into an exactly correct lemma. Instead, it aims to output a so-called stem, which is not even necessarily a correct word, but is chosen to be the same for all the words—and only those words—that ascend to a given morphological root. Stemmers, for the sake of performance, typically apply only a small number of processing rules; have only a few, if any, prerecorded exceptions; and ultimately do not aim to achieve 100 percent correct normalization.

The most popular stemmer for the English language is the Porter stemmer, developed by Martin Porter in 1979. Although pretty efficient and easy to implement, it suffers from normalization errors. One notorious example is the stemmer’s reduction of “business” and “busy” to the same stem “busi,” even though they have very different meanings and we’d rather keep them separate. This is, by the way, an example of how exceptions in natural language win the fight against rules: many other words are formed from a verb using a “-ness” suffix (“awareness”, “forgiveness”, etc.) and properly reduce to an original verb, but “business” is an exception. A smart lemmatizer would be able to keep “business” as a form on its own.

An even smarter lemmatizer would know that “the dove flew away” talks about a pigeon, and not diving. And this seemingly simple sample brings in a number of other linguistic concepts.

First, “dove” is a synonym for “pigeon.” The words are different, but the meaning is similar or even almost identical, and that’s exactly what synonyms are. Ornithologists can quibble, but in popular usage, these words are used interchangeably for many of the same kinds of birds. Synonyms can be less exact, such as “sick” and “ill” and “acquisitions” and “purchases,” or they can be as complex an example as “put up the white flag” and “surrender.”

Second, “dove” the noun is also a homonym for the simple past form of “dive” the verb. Homonyms are words that are spelled the same but have different meanings.

Third, in this example, we can’t really detect whether it’s “dove” the noun or “dove” the verb by the word itself. To do that, we need to perform part-of-speech (POS) tagging. That is, we need to analyze the entire sentence and find out whether the “dove” was a subject, a predicate, or something else—all of that to normalize our “dove” to a proper form.

Homonyms can, in fact, be an even bigger problem. POS tagging will not help to distinguish a “river bank” from a “savings bank” because both banks here are nouns. The process of telling one bank from the other is called word-sense disambiguation (WSD) and is (you bet) another open problem in computational linguistics.

Text processing of this depth is, of course, rather expensive in terms of both development costs and performance. So most of the currently available systems are limited to simpler functionality such as stemming or lemmatization, and do not do complex linguistic processing such as POS tagging or WSD. Major web search engines are one notable exception, as they strive for extreme quality—which brings us to the subject of relevance ranking.

Relevance, As Seen from Outer Space

Assume that we just found 1 million documents that match our query. We can’t even glance at all of them, so we need to further narrow down our search somehow. We might want the documents that match the query “better” to be displayed first. But how does the search engine know that document A is better than document B with regard to query Q?

It does so with the aid of relevance ranking, which computes a certain relevance value, or weight, for every given document and given query. This weight can then be used to order matching documents.

Ranking is an open problem, and actually a rather tough one. Basically, different people can and do judge different documents as relevant or irrelevant to the same query. That means there can’t be a single ideal suit-all relevance function that will always put an “ideal” result in the first position. It also means that generally better ranking can ultimately be achieved only by looking at lots of human-submitted grades, and trying to learn from them.

On the high end, the amount of data to process can be vast, with every document having hundreds or even thousands of ranking factors, some of which vary with every query, multiplied by millions of prerecorded human assessors’ judgments, yielding billions of values to crunch on every given iteration of a gradient descent quest for a Holy Grail of 0.01 percent better relevance. So, manually examining the grade data cannot possibly work and an improved relevance function can realistically be computed only with the aid of state-of-the-art machine learning algorithms. Then the resultant function itself has to be analyzed using so-called quality metrics, because playing “hot or not” through a million grades assigned to each document and query isn’t exactly realistic either. The bottom line is that if you want to join the Bing search quality group, learn some math, preferably lots of it, and get used to running lots of human factors labs.

On lower levels of search, not everyone needs all that complexity and a simple grokable relevance function could suffice. You still want to know how it works in Sphinx, what can be tweaked, and how to evaluate your tweaking results.

There’s a lot to relevance in general, so I’ll dedicate a separate chapter to discussing all things ranking, and all the nitty-gritty details about Sphinx ranking. For the purposes of providing an overview here, let me limit myself to mentioning that Sphinx supports several ranking functions, lets you choose among them on the fly, lets you tweak the outcome, and is friendly to people trying to hack new such functions into it. Oh yes, in some of the rankers it plays a few tricks to ensure quality, as per-quality metrics are closer to the high end than most search engines.

Result Set Postprocessing

Exaggerating a bit, relevance ranking is the only thing that general web search engine developers care about, because their end users only want a few pages that answer their query best, and that’s it. Nobody sorts web pages by dates, right?

But for applications that most of us work on, embedded in more complex end-user tasks, additional result set processing is also frequently involved. You don’t want to display a random iPhone to your product search engine user; he looks for the cheapest one in his area. You don’t display a highly relevant article archived from before you were born as your number one news search result, at least not on the front page; the end user is likely searching for slightly fresher data. When there are 10,000 matches from a given site, you might want to cluster them. Searches might need to be restricted to a particular subforum, or an author, or a site. And so on.

All this calls for result set postprocessing. We find the matches and rank them, like a web search engine, but we also need to filter, sort, and group them. Or in SQL syntax, we frequently need additional WHERE, ORDER BY, and GROUP BY clauses on top of our search results.

Search engines frequently grow from web pages’ tasks of indexing and searching, and might not support postprocessing at all, might support only an insufficient subset, might perform poorly, or might consume too many resources. Such search engines focus on, and mostly optimize for, relevance-based ordering. But in practice, it’s definitely not enough to benchmark whether the engine quickly returns the first 10 matches sorted by relevance. Scanning 10,000 matches and ordering them by, say, price can result in a jaw-dropping difference in performance figures.

Sphinx, on the other hand, was designed to index content stored in a database from day one, and now it supports arithmetic expressions, WHERE, ORDER BY, and GROUP BY in full, very efficiently. In fact, Sphinx supports those functions literally: you can use good old SQL syntax to express your queries (refer to Chapter 4 for a detailed discussion). Moreover, Sphinx-side processing is so efficient that it can outperform a database on certain general (not just full-text!) SQL query types.

Full-Text Indexes

A search engine must maintain a special data structure in order to process search queries quickly. This type of structure is called a full-text index. Unsurprisingly, there’s more than one way to implement this.

In terms of storage, the index can be stored on disk or exist only in RAM. When on disk, it is typically stored in a custom file format, but sometimes engines choose to use a database as a storage backend. The latter usually performs worse because of the additional database overhead.

The most popular conceptual data structure is a so-called inverted file, which consists of a dictionary of all keywords, a list of document IDs, and a list of the positions in the documents for every keyword. All this data is kept in sorted and compressed form, allowing for efficient queries.

The reason for keeping the position is to find out, for instance, that “John” and “Kennedy” occur side by side or very close to each other, and therefore are likely to satisfy a search for that name. Inverted files that keep keyword positions are called word-level indexes, while those that omit the positions are document-level indexes. Both kinds can store additional data along with document IDs—for instance, storing the number of keyword occurrences lets us compute statistical text rankings such as BM25. However, to implement phrase queries, proximity queries, and more advanced ranking, a word-level index is required.

Lists of keyword positions can also be called occurrence lists, postings lists, or hit lists. We will mostly use “document lists” and “hit lists” in the following description.

Note

Another index structure, nowadays more of a historical than a practical interest, is a signature file, which keeps a bit vector of matching documents for every keyword. Signature files are very quick at answering Boolean queries with frequent keywords. However, for all the other types of queries, inverted files perform better. Also, signature files cannot contain keyword positions, meaning they don’t support phrase queries and they have very limited support for text-based ranking (even the simple and classic BM25 is barely possible). That’s a major constraint.

Depending on the compression scheme used, document-level indexes can be as compact as 7 to 10 percent of the original text size, and word-level indexes 30 to 40 percent of the text size. But in a full-text index, smaller is not necessarily better. First, more complex compression schemes take more CPU time to decompress, and might result in overall slower querying despite the savings in I/O traffic. Second, a bigger index might contain redundant information that helps specific query types. For instance, Sphinx keeps a redundant field mask in its document lists that consumes extra disk space and I/O time, but lets a fielded query quickly reject documents that match the keyword in the wrong field. So the Sphinx index format is not as compact as possible, consuming up to 60 to 70 percent of the text size at the time of this writing, but that’s a conscious trade-off to get better querying speed.

Indexes also might carry additional per-keyword payloads such as morphological information (e.g., a payload attached to a root form can be an identifier of a particular specific word form that was reduced to this root), or keyword context such as font size, width, or color. Such payloads are normally used to improve relevance ranking.

Last but not least, an index format might allow for either incremental updates of the indexed data, or nonincremental index rebuilds only. An incremental index format can take partial data updates after it’s built; a nonincremental one is essentially read-only after it’s built. That’s yet another trade-off, because structures allowing incremental updates are harder to implement and maintain, and therefore experience lower performance during both indexing and searching.

Sphinx currently supports two indexing backends that combine several of the features we have just discussed:

Our most frequently used “regular” disk index format defaults to an on-disk, nonincremental, word-level inverted file. To avoid tedious rebuilds, you can combine multiple indexes in a single search, and do frequent rebuilds only on a small index with recently changed rows. Setting that up is discussed in detail in Chapter 5.
That disk index format also lets you omit hit lists for either some or all keywords, leading to either a partial word-level index or a document-level index, respectively. This is essentially a performance versus quality trade-off.
The other Sphinx indexing backend, called the RT (for “real time”) index, is a hybrid solution that builds upon regular disk indexes, but also adds support for in-memory, incremental, word-level inverted files. So we try to combine the best of both worlds, that is, the instant incremental update speed of in-RAM indexes and the large-scale searching efficiency of on-disk nonincremental indexes.

Search Workflows

We’ve just done a 30,000-foot overview of different search-related areas. A modern scientific discipline called Information Retrieval (IR) studies all the areas we mentioned, and more. So, if you’re interested in learning about the theory and technology of the modern search engines, including Sphinx, all the way down to the slightest details, IR books and papers are what you should refer to.

In this book we’re focusing more on practice than on theory, that is, how to use Sphinx in scenarios of every kind. So, let’s briefly review those scenarios.

Kinds of Data

Sphinx is a search engine and not a full-blown database just yet, so the raw data to be indexed is generally stored elsewhere. Usually you’d have an existing SQL database, or a collection of XML documents that you need indexed. When SQL and XML aren’t efficient enough, the data might be stored in a custom data warehouse. In all these cases, we’re talking about structured data that has preidentified text fields and nontext attributes. The columns in an SQL database and the elements in an XML document both impose some structure. The Sphinx document model is also structured, making it very easy to index and search such data. For instance, if your documents are in SQL, you just tell Sphinx what rows to fetch and what columns to index.

In the case of unstructured data, you will have to impose some structure yourself. When given a bunch of DOC, PDF, MP3, and AVI files, Sphinx is not able to automatically identify types, extract text based on type, and index that text. Instead, Sphinx needs you to pass the text and assign the field and attribute names. So you can still use it with unstructured data, but extracting the structure is up to you.

One extra requirement that Sphinx puts on data is that the units of data must have a unique integer document identifier, a.k.a. docID. The docID has to be a unique integer, not a string. Rows in the database frequently come with the necessary identifier when their primary key (PK) is an integer. It’s not a big deal when they don’t; you can generate some docIDs for Sphinx on the fly and store your string PK from the database (or XML document name) as an attribute.

Indexing Approaches

Different indexing approaches are best for different workflows. In a great many scenarios, it’s sufficient to perform batch indexing, that is, to occasionally index a chunk of data. The batches being indexed might contain either the complete data, which is called full reindexing, or just the recently changed data, which is delta reindexing.

Although batching sounds slow, it really isn’t. Reindexing a delta batch with a cron job every minute, for instance, means that new rows will become searchable in 30 seconds on average, and no more than 60 seconds. That’s usually fine, even for such a dynamic application as an auction website.

When even a few seconds of delay is not an option, and data must become searchable instantly, you need online indexing, a.k.a. real-time indexing. Sometimes this is referred to as incremental indexing—though that isn’t entirely formally correct.

Sphinx supports both approaches. Batch indexing is generally more efficient, but real-time indexing comes with a smaller indexing delay, and can be easier to maintain.

When there’s just too much data for a single CPU core to handle, indexes will need to be sharded or partitioned into several smaller indexes. When there’s way too much data for a single machine to handle, some of the data will have to be moved to other machines, and an index will have to become distributed across machines. This isn’t fully automatic with Sphinx, but it’s pretty easy to set up.

Finally, batch indexing does not necessarily need to be done on the same machine as the searches. It can be moved to a separate indexing server—either to avoid impacting searches while indexing takes place, or to avoid redundant indexing when several index replicas are needed for failover.

Full-Text Indexes and Attributes

Sphinx appends a few items to the regular RDBMS vocabulary, and it’s essential to understand them. A relational database basically has tables, which consist of rows, which in turn consist of columns, where every column has a certain type, and that’s pretty much it. Sphinx’s full-text index also has rows, but they are called documents, and—unlike in the database—they are required to have a unique integer primary key (a.k.a. ID).

As we’ve seen, documents often come with a lot of metadata such as author information, publication data, or reviewer ranking. I’ve also explained that using this metadata to retrieve and order documents usefully is one of the great advantages of using a specialized search engine such as Sphinx. The metadata, or “attributes,” as we’ve seen, are stored simply as extra fields next to the fields representing text.

Sphinx doesn’t store the exact text of a document, but indexes it and stores the necessary data to match queries against it. In contrast, attributes are handled fairly simply: they are stored in their index fields verbatim, and can later be used for additional result set manipulation, such as sorting or grouping.

Thus, if you are indexing a table of book abstracts, you probably want to declare the book title and the abstract as full-text fields (to search through them using keywords), while declaring the book price, the year it was published, and similar metadata as attributes (to sort keyword search results by price or filter them by year).

Approaches to Searching

The way searches are performed is closely tied to the indexing architecture, and vice versa. In the simplest case, you would “just search”—that is, run a single search query on a single locally available index. When there are multiple indexes to be searched, the search engine needs to handle a multi-index query. Performing multiple search queries in one batch is a multi-query.

Search queries that utilize multiple cores on a single machine are parallelized—not to be confused with plain queries running in parallel with each other. Queries that need to reach out to other machines over the network are distributed.

Sphinx can do two major functional groups of search queries. First and foremost are full-text queries that match documents to keywords. Second are full scans, or scan queries, which loop through the attributes of all indexed documents and match them by attributes instead of keywords. An example of a scan is searching by just date range or author identifier and no keywords. When there are keywords to search for, Sphinx uses a full-text query.

One can emulate scans by attaching a special keyword to every row and searching for that row. Scans were introduced by user request when it turned out that, in some cases, even that emulated approach was more efficient than an equivalent SQL query against a database server.

Full-text queries can, in turn, either be just simple bags of words, or utilize the query syntax that Sphinx provides.

Kinds of Results

Queries that Sphinx sees are not necessarily exactly what the end user types in the search box. And correspondingly, both the search box and the results the end user sees might not be exactly what come out of Sphinx. You might choose to preprocess the raw queries coming from end users somehow.

For instance, when a search for all the words does not match, the application might analyze the query, pick keywords that did not match any documents, and rerun a rewritten query built without them. An application could also automatically perform corrections to keywords in which a typo is suspected.

Sometimes magic happens even before the query is received. This is often displayed as query suggestions in a search box as you type.

Search results aren’t a list of numeric IDs either. When documents are less than ideally described by their title, abstract, or what have you, it’s useful to display snippets (a.k.a. excerpts) in the search results. Showing additional navigational information (document types, price brackets, vendors, etc.), known as facets, can also come in handy.

Get Introduction to Search with Sphinx now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Introduction to Search with Sphinx by Andrew Aksyonoff