Two principal stages of text searching are to specify what text to match, and then to put the result set of matches into the desired shape, which includes filtering, grouping, and ordering. In this chapter, we discuss everything that Sphinx offers to support these goals: legacy matching modes, full-text query syntax, and additional nonfull-text tools for processing result sets.
In the beginning, there were keywords, and nothing but keywords, and no query syntax was supported, and Sphinx just matched all keywords, and that was good. But even in that innocent antediluvian age, diverse people were asking for various querying patterns, and ranking methods, and we heard them, and thus and so matching modes were cast upon Sphinx. And they were four, and accessible via SphinxAPI and its younger brother, SphinxSE they were.
Nowadays, matching modes are just a legacy. Even the very concept of a “matching mode” is already deprecated internally. But we still have to quickly cover them, as two out of three searching APIs (SphinxAPI and SphinxSE) support them and default to a certain legacy mode for compatibility reasons.
Legacy modes were a predefined combination of (very simple) query parsing rules, query-to-document matching rules, and a specific ranking method (called a ranker).
There are four legacy matching modes: ALL, ANY, PHRASE, and BOOLEAN.
You could switch between modes using the SetMatchMode()
call in SphinxAPI. For instance,
the following call in PHP sets the PHRASE mode:
$client->SetMatchMode ( SPH_MATCH_PHRASE );
In ALL, ANY, and PHRASE modes, queries were interpreted as “bags of
keywords” and then matched and ranked as specified by the mode. BOOLEAN,
in addition, supported the basic
Boolean operators (AND
, OR
, NOT
, and
parentheses).
- ALL
Documents that match all of the keywords are returned. Documents are ranked in the order reflecting how closely the matched words resemble the query (phrase proximity to the query).
- ANY
Documents that match any of the keywords are returned. Documents are ranked based on the degree of the phrase proximity to the query, and the number of unique matching documents in every field.
- PHRASE
Documents that match the query as an exact phrase are returned. Documents are ranked based on the fields in which the phrase occurs, and their respective user weights.
- BOOLEAN
Documents that match a Boolean expression built from keywords, parentheses, and the
AND
,OR
, andNOT
operators are returned. Documents are not ranked. It was expected that you will sort them based on a criterion other than relevance.
In addition, there’s one nonlegacy matching mode:
- EXTENDED
Documents that match an expression in Sphinx query syntax are returned. (Query syntax supports keywords, parentheses, Boolean operators, field limits, grouping keywords into phrases, proximity operators, and many more things that we will discuss in detail shortly.) Documents are ranked according to one of the available ranking functions that you can choose on the fly.
There were several problems with the legacy matching modes.
First, they were very limited. There was no way to do anything even slightly fancy, like, say, matching “Barack Obama” as an exact phrase and “senator” and “Illinois” as plain keywords at the same time.
Second, they essentially tightly coupled query syntax and a ranking function. So, for instance, when using the ALL mode, you could not ask Sphinx to just apply lightweight ranking and skip keyword positions for speed. In that mode, Sphinx always computes a rather expensive proximity rank. Or the other way around, if you liked the ranking that ANY yielded, you couldn’t get it while matching all words or matching a phrase, on the grounds that the ANY ranking function was nailed onto its matching mode with nine-inch titanium nails.
Third, once we introduced query syntax support, all the matching modes became just limited, particular subcases of that generic, all-encompassing syntax. That’s the course of progress and redundancy in the modern world. The milkman’s lot isn’t as sought after as it once was...
Last but not least, Sphinx used to have a different code path internally for every matching mode, and that was of little help when maintaining and improving it.
The EXTENDED mode fixes all of this. It decouples query syntax from
ranking; you can choose a ranking function separately (using either the
SetRankingMode()
API call or the
OPTION
ranker=
XXX
SphinxQL clause). And
adding new full-text querying features does not involve a new “matching
mode” anymore; you just change your queries.
So, in version 0.9.9, we internally switched everything to use a
unified matching engine, formerly exposed only under the EXTENDED matching
mode. When you use one of the legacy modes, Sphinx internally converts the
query to the appropriate new syntax and chooses the appropriate ranker.
For instance, the query one two
three
will be internally rewritten as follows:
- ALL
Query:
one two three
Ranker: PROXIMITY
- ANY
Query:
"one two three"/1
Ranker: PROXIMITY
- PHRASE
Query:
"one two three"
Ranker: PROXIMITY
- BOOLEAN
Query:
one two three
Ranker: NONE
Special characters such as quotes and slashes that are reserved in query syntax will also be escaped in rewritten queries.
For compatibility reasons, SphinxAPI and SphinxSE default to the ALL matching mode, so to use query syntax or fancier new ranking functions, you have to explicitly switch to EXTENDED mode:
$client->SetMatchMode ( SPH_MATCH_EXTENDED );
The MATCH()
operator in SphinxQL
always uses EXTENDED mode, so you don’t have to do anything there to get
query syntax.
Sphinx text query syntax builds upon three major cornerstones:
Keywords
Operators
Modifiers
Keywords are just the words you search for,
which are treated as atoms, the most basic query building blocks. The
chief magic happens courtesy of operators that
combine keywords in different ways. Combining keywords with Boolean
AND
and OR
, limiting searching to a given field, and
phrase and proximity matching are all operators in Sphinx’s book.
Operators take keywords (and sometimes other expressions) as their
arguments, and transform them into expressions. We
might refer to those as full-text expressions to
avoid ambiguity with arithmetic expressions such as 1+2/3. Finally,
modifiers are attached to keywords, and affect the
way keywords match.
At the time of this writing, Sphinx supports the following operators:
- Operator
AND
Default implicit operator. Matches when both of its two arguments match. Example (with three keywords and two implicit
AND
operators between them):lettuce bacon tomato
- Operator
OR
Matches when any of its two arguments match. Example:
one | two "gloomy Sunday" | "Scarborough fair"
- Operator
NOT
Matches when the first argument matches, but the second one does not. For compatibility reasons, both
!
and-
are recognized asNOT
. Examples:shaken !stirred shaken -stirred
- Grouping operator (parentheses)
Explicitly denotes the argument boundaries. Example:
(red | green | blue) pixel bond -(eurodollar bond)
- Field limit operator
Matches when its entire argument expression matches within a specified field, or a part of a field, or a set of fields. The operator is
@
and is followed by the field name (in the most basic version). Examples:@title hello @title[50] cruel world @(title,content) one (two | three) @* match me anywhere
- Phrase operator
Matches when argument keywords match as an exact phrase. Takes only keywords as arguments. Example:
"Richard of York gave battle in vain" "All your base are belong to us"
- Keyword proximity operator
Matches when all argument keywords that match are found within a given limited distance. Takes only keywords as arguments. Example:
"breakfast Tiffany"~5 "Achilles tortoise"~10
- Quorum operator
Matches when at least N argument keywords match, where N is a given threshold. Takes only keywords as arguments. Example:
"good fast cheap"/2 "single sane sexy smart"/3
- Strict order operator (operator
BEFORE
) Matches when its two arguments not only match, but also occur in exactly the same order as in the operator. Example:
ladies << first north << (east | west)
NEAR
operatorMatches when its two arguments not only match, but also occur within a given limited distance from each other. Example:
bill NEAR/5 monica (red | black) NEAR/5 (hat | coat)
SENTENCE
operatorMatches when its two arguments not only match, but also occur within the same sentence. Takes only keywords and phrases as arguments. Requires the sentence and paragraph indexing feature (the
index_sp
directive) to be enabled. Example:pizza SENTENCE anchovies acquisitions SENTENCE "fiscal year"
PARAGRAPH
operatorMatches when its two arguments not only match, but also occur within the same paragraph. Takes only keywords and phrases as arguments. Requires the sentence and paragraph indexing feature to be enabled. Example:
light PARAGRAPH darkness "harley davidson" PARAGRAPH "marlboro man"
ZONE
limit operatorMatches when its entire argument expression matches within a specified document zone, or a set of zones. Requires the zone indexing feature (the
index_zones
directive) to be enabled. Example:ZONE:h1 john doe ZONE:(h1,h2) jane doe
More operators might be implemented over time, so this list isn’t carved in stone, and you should refer to the documentation for the most recent version of Sphinx for updates.
Query syntax reserves several characters as special operators:
parentheses, braces, quotes, vertical pipe, dash, exclamation point, and
slash, among others. But sometimes you’ll want the query parser to treat
them as ordinary characters in keywords. For example, say you want to
index tweets and still be able to search for @sphinxsearch
in them. How do you avoid a
conflict with the field limit operator?
The answer is the standard one used in Unix utilities and generally in programming languages: escape the characters using backslashes.
@tweetdata \@sphinxsearch rocks
In this example, the first @
is
a bona fide field operator, whereas the second is treated as a character
to search for in the document. (And Sphinx discards the backslash
itself.)
But watch out: you have to be careful lest your environment or programming language consumes the escaping character. Depending on what environments your query passes through, you might need extra backslashes. For instance, when you run this in your MySQL client:
mysql> SELECT * FROM test1 WHERE MATCH('hey \@sphinxsearch');
the MySQL client actually processes the escaping backslash itself, leaving Sphinx to receive the query without the backslash and to think that you wanted to reference a field, resulting in an error:
ERROR 1064 (42000): index test1: query error: no field 'sphinxsearch' found in schema
So, you need to have two backslashes, one for the MySQL client and one for Sphinx:
mysql> SELECT * FROM test1 WHERE MATCH('hey \\@sphinxsearch');
And speaking of programming environments, SphinxAPI provides a
special call named EscapeString()
that escapes all known special characters. What it does is nothing more
than a straightforward string replacement, so you can reimplement it
yourself if needed, but notice that the version in SphinxAPI gets
continuously updated as more features and special characters are
added.
As you can see, most of the operators allow not just keywords, but
also expressions as their arguments. Parentheses can be arbitrarily
nested, phrases or quorums can be put into NEAR
or BEFORE
operators, and so on.
This immediately brings up the question of operator precedence.
For instance, is the query (one two |
three)
going to be interpreted as ((one two) | three)
or as (one (two | three))
? The correct answer is
the latter, because OR
has a higher
priority than AND
. The list of
operators in the order of their precedence levels (highest to lowest)
is:
SENTENCE
,PARAGRAPH
, phrase, proximity, quorum, parenthesesOR
NOT
BEFORE
,NEAR
AND
One particularly important implication of this is how OR
and AND
work together. It’s a frequent mistake to have a query such as:
turtle doves | French hens | calling birds | golden rings
that will, because of higher OR
priority, disambiguate to:
turtle (doves|French) (hens|calling) (birds|golden) rings
Not quite the expected result! We definitely didn’t have any “turtle rings” in mind, combined with either “hens” or “calling.” The right way to express that query would be to use parentheses and explicitly group word pairs together:
(turtle doves) | (French hens) | (calling birds) | (golden rings)
Along the same lines:
(stray SENTENCE cat|dog)
would, in fact, be interpreted as:
((stray SENTENCE cat) | dog)
In this case, the “stronger” SENTENCE
operator wins over the “weaker”
OR
operator, exactly as the “stronger” OR
won over the “weaker” AND
in the preceding example. So, to match
“stray” in the same sentence with either “cat” or “dog”, one would need
to use:
(stray SENTENCE (cat | dog))
The NOT
operator likes company.
So much, in fact, that it can’t bear being alone, and querying for just
!alone
will result in an error
message mentioning that the “query is non-computable.” (There are
several different flavors of this message, but they all basically mean the same thing.)
A more appropriate (but less readable) name for this operator
would be AND NOT
, because that’s how it actually works. The
problem with handling a single-argument NOT
X
is that
matching all documents that do not match X
means matching all documents in the first place,
then omitting those that do match X
from the
list. And that can be a huge number of documents. Although Sphinx can do
this, such queries are most likely the result of human error, so Sphinx
chooses to protect you from wasting a lot of server effort matching a
gazillion documents by mistake.
In case you want exactly that behavior, there’s an easy workaround. Just attach and index a magic keyword to every document, and query for it explicitly:
allthedocuments !X
Another subtlety is that NOT
comes in two flavors. Both the hyphen (-) and the exclamation point (!) can be used as
NOT
interchangeably. Well, almost.
There’s a difference in behavior between the two when NOT
’s special character occurs in the middle
of a word. Hyphens sometimes do occur in keywords, whereas exclamation
points normally do not. So to follow the principle of least confusion,
Sphinx never treats a hyphen within a keyword as an operator. However,
it always handles an exclamation point as such.
For example, let’s say you’re querying for foo-bar
. Sphinx will treat the hyphen as
keyword data. Thus, when the dash is in charset_table
, Sphinx will simply process this
entire query as a single keyword with a dash. And when it is not, it
will replace the dash with whitespace, resulting in a query equivalent
to (foo bar)
, with two keywords.
On the other hand, foo!bar
will
always be handled as two keywords with a NOT
, equivalent to (foo !bar)
.
The field limit operator is a commonly used operator, because people frequently want to limit their searches to an email subject header, or a forum thread title, or a more complicated set of fields.
The syntax is @
fieldname
—and it
was conceived before Twitter. (Well, maybe not before the first public release of Twttr
by Odeo, but definitely before everyone and his dog was surprised to
discover himself on Twitter by Twitter Inc.). If we could do it over, we
might have gone with something that wouldn’t conflict with the suddenly
popular use of the at sign. Or not.
Formal precedence rules for this operator are rather cumbersome. Luckily, they’re very easy to explain informally:
The field limit operator constrains everything that follows up to either the next field operator or the next right parenthesis.
The operator comes in four different forms:
@
fieldname
Simplest form that constrains matching to a single field
@
fieldname
[
n
]
Where
n
must be an integer, limits matching to the firstn
keywords within a given field@(
field1
,
field2
,...)
Field set form, which limits matching to any of the fields in a comma-separated list
@*
Resets field matching and removes all limits
But what do I mean by “constrains everything” in the definition?
This is perhaps best explained with an example that combines field
limits and OR
. Consider this
query:
Holmes | @author Doyle | Watson | Lestrade
What will this match? As a matter of fact, it will match documents
that either mention “Holmes” anywhere, or mention “Doyle”, “Watson”, or
“Lestrade” in the author
field. Thus,
the field limit operator affects everything after the field name (or
names in field set form), and the only thing that can stop it from doing
so is the right parenthesis. So, if our true intent is to ransack data
for documents that either were written by Sir Doyle, or mention any of
our three illustrious imaginary friends anywhere, but are
not required to appear to be written by the latter
two, it is our understanding that it would be in our best interests to
alter the query in one of the following ways, sir:
Holmes | (@author Doyle) | Watson | Lestrade Holmes | @author Doyle | @* Watson | Lestrade Holmes | Watson | Lestrade | @author Doyle
The first variant uses the grouping operator (parentheses) to limit the scope of the field limit operator. The second one resets field matching back to matching all fields. And the last one just puts the field limit operator at the end of the query so that there’s nothing left for it to constrain.
The phrase operator lets you match exact phrases: verbatim quotes, names of persons and locations, and other word sequences of that ilk. Both the operator and its syntax are de facto standard across modern search systems. The syntax is double quotes that enclose a set of keywords:
"to be or not to be" "Albert Einstein" "New York"
Only keywords are allowed within quote marks, and any special
characters are ignored. For instance, "(red|blue) car"
is equivalent to "red blue car"
. Neither grouping nor
OR
(nor any other operator) happens
inside the quotes.
The phrase operator works with keyword positions, and any
peculiarities that you might experience always boil down to positions of
keywords either in the index or in the query. In the previous chapter,
we discussed how "Microsoft
Office"
and "Microsoft in the
office"
queries produce different results when “in” and
“the” are stop words that do not get indexed but still influence
positions. The latter query actually matches "Microsoft ? ? office"
where any keyword is
allowed to appear in place of the ?
placeholder. As a reminder, setting the stopword_step
directive to 0 when indexing
makes the two queries equivalent again.
But the phrase operator involves a few more complications, related again to the settings in effect when creating the index.
Sphinx defaults to ignoring the punctuation, flat out: periods,
hyphens, question marks, exclamation points, what have you. This means a
"punctuation flat"
query will
match the anteceding sentence, because its keywords are adjacent to each
other. They are separated with a comma in the text, but that comma gets
removed and ignored, and does not affect assigned keyword positions. But
there are two indexing-time directives that change that.
First are the phrase_boundary
and phrase_boundary_step
directives.
They work together to give punctuation some “positional” influence that
you can choose. phrase_
boundary
takes, as its value, a list
of characters in a format charset_table
uses, while phrase_boundary_step
takes a number.
Characters specified in phrase_boundary
incur a position increment as
specified in phrase_boundary_step
when indexing. In hopefully plainer English, it’s as though we insert
several dummy words every time we notice a phrase boundary (a dot,
comma, exclamation point, etc.). After you put common punctuation characters in phrase_boundary
and rebuild the index,
keywords separated by those characters obtain some personal space, and
"punctuation flat"
stops matching
the previous paragraph:
index test1 { # ... other settings here ... phrase_boundary = ., ?, !, U+2C # 2C is a code for comma phrase_boundary_step = 3 }
phrase_boundary
is very
straightforward and merely increments the position counter in the index
every time it sees a character, without paying any attention to context.
Consequently, a document
containing “John D. Doe” gets indexed with that extra position increment
between “D” and “Doe”.
Second, there is an index_sp
directive that enables some smarter sentence and paragraph indexing code that properly
handles a few exceptions to the “separator is a separator is always a
separator” rule when it comes to handling periods. A period will not be
considered a boundary in the following situations:
In the midst of an abbreviation (e.g., “the U.S.A. and Canada”)
At the end of an abbreviation (e.g., “Yoyodine Inc. has existed since 1800”)
After a middle initial (e.g., “John D. Doe”)
More empirical rules will likely be added in the future to handle
further edge cases, but these three yield surprisingly good results. If
nothing else, a "john d doe"
query starts to deliver a phrase match again with the index_sp
feature enabled.
Last but not least, phrase matches can never span multiple fields. Full keyword positions are, in fact, composed of a field ID and a position within that field, so a change in field is a major change in overall position. To emulate matching over field boundaries, you would need to concatenate fields when fetching data:
sql_query = SELECT id, CONCAT(first_name, ' ', last_name) name ... FROM documents
The keyword proximity operator matches groups of words that are not necessarily in exactly the specified order, but happen to be in close proximity to each other. The unit for counting proximity is keywords, so everything about keyword positions that we just discussed applies to the proximity operator as well. Truth be told, the original reason for implementing the phrase boundary feature was to emulate sentence-level matching with the proximity operator.
Like phrases, the keyword proximity operator works only on sets of keywords, so its syntax builds upon the phrase syntax and adds a proximity threshold with a tilde:
"Achilles tortoise"~3
This matches if both “Achilles” and “tortoise” appear in the
document, and if there are no more than two words between them. For
instance, “Achilles catches the tortoise” matches, and so does “Tortoise
caught by Achilles”. If the proximity threshold is
n
, the document does not
match when there are n
or more extra words
between the matching keywords. The document matches if and only if fewer
than n
keywords “dilute” the matching
span.
Here are some rules regarding proximity matching:
The order of the keywords within the operator does not impact matching. It can, however, affect ranking; the degree of phrase matching is generally computed using the order of keywords in the query.
All words—stop words as well as keywords—contribute to the count. For instance, “Achilles never catches the tortoise” has one word too many to match the previous query, because “the” is part of the count even though it’s a stop word and is not in the index.
When more than two words are part of the proximity match, the count applies to all words, and not each group of two. For example,
"rock paper scissors"~1
will match any permutation of the three keywords, but will not match “rock, paper and scissors.” One extra “and” in this document is one filler word too many.
So, the rule of thumb is ~1
allows any permutation of keywords but absolutely no extra words
inserted between them, ~2
allows
fewer than two extra words between keywords, ~3
allows fewer than three extra words
between keywords, and so on.
A quorum specifies how many keywords must be present in a group to qualify for a match. A quorum looks syntactically similar to the phrase and proximity operators, working like they do with a group of keywords, but unlike them, it does not care about keyword positions in the document. It only requires that “enough” distinct keywords are found there.
The syntax is:
"good fast cheap"/2
The preceding code matches documents that have at least two out of the three keywords. Documents that have all three keywords, of course, also match. Documents with just one keyword do not match.
Just like the proximity operator, keyword order does not affect matching, but can affect ranking when the ranking function uses phrase proximity as a factor.
A quorum with a threshold of 1 is completely equivalent to an
OR
of all keywords.
Quorum matching is particularly useful when matching all keywords
does not return any results, but matching any of the keywords with an
OR
can return too many matches and
can be too slow. Using a quorum operator with a threshold greater than 1
provides a nice trade-off for that case.
The strict order operator, also sometimes referred to as the
BEFORE
operator, works on keyword
positions, but does not care how many words come between the specified
keywords. It matches when its arguments occur anywhere in a single
field, so long as they occur in the order specified.
The syntax is two consecutive less-than signs, and arbitrary subexpressions are allowed as arguments. A single less-than sign would not be recognized as an operator and would be silently ignored:
ladies << first north << (east | west)
To match, both arguments must be in the same field. A document
that has “north” in the title field and “east” in the content field does
not match north <<
east
, even if the title field preceded the content field in
sql_query
when indexing.
(Technically, Sphinx retains the ordering of fields, but taking that
into account is rather error-prone.) To match that query, the document
needs both “north” and “east” in one field, and in the right order. A
document titled “North America and Far East” would match. “To the east
of North 19th Street,” however, would not.
The precedence of BEFORE
is
lower than that of OR
, just like
AND
precedence is, so be careful
about the same notorious catch:
turtle << doves | French << hens | calling << birds
The preceding query is, because of the higher precedence of
OR
, equivalent to:
turtle << (doves | French) << (hens | calling) << birds
That’s obviously not as expected, and you should use parentheses
to group the (turtle << doves)
parts
together. AND
, however, has a lower
precedence than BEFORE
. So the
following query:
partridge << turtle doves << French hens
is perhaps unexpectedly equal to:
(partridge << turtle) (doves << French) hens
and should be fixed using parentheses or double quotes:
partridge << (turtle doves) << (French hens) partridge << "turtle doves" << "French hens"
It might help to think of AND
as being like addition, BEFORE
like
multiplication, and OR
like
exponentiation. In an expression such as 1+2*3^4, you raise 3 to the
power of 4 first, then multiply that by 2, then finally add 1.
Similarly, in a query such as this:
one two << three | four
you compute the OR
first, then
plug the result into the BEFORE
, and
then finally do the AND
:
(one & (two << (three | four)))
The NEAR
operator is a
generalized version of the keyword proximity operator that works with
two arguments, which can be arbitrary expressions. (The proximity
operator accepts only plain old
keywords.)
The syntax is NEAR/
n
, where
n
is an integer representing the maximum
distance allowed. The syntax is case-sensitive and space-sensitive, that
is, near
or Near
is recognized as a keyword, and NEAR / 3
(with any whitespace around the
slash) is also recognized as a keyword along with extraneous
punctuation, but not a valid operator.
Distance is counted in keywords and works exactly as in keyword
proximity, that is, NEAR/1
allows any
permutation of the arguments but no keywords between them, NEAR/3
allows fewer than three keywords
between them, and so on.
Because phrases with multiple words may appear in NEAR
, think of n
as
the maximum distance allowed between the end of the leftmost argument’s
occurrence and the beginning of
the rightmost one. So, "Saturday night"
NEAR/3 "Big prizes"
will match “Big prizes given away
Saturday night” because there are fewer than three words between the end of “Big prizes” and the
beginning of “Saturday night”.
Even though NEAR
is a
generalized version of proximity, we need to note how the two behave
differently. Specifically, the query one
NEAR/4 two NEAR/4 three
is not
equivalent to "one two three"~4
because of differences in gap handling. Indeed, the first query allows
up to three keywords between “one” and “two” and then up to three more
between “two” and “three”. So a document such as “one x x x two y y y
three” matches it. In contrast, the second query just allows up to three
filler words between all of the matched keywords.
The sample document we just considered has six, and thus will not match
the second query.
The SENTENCE
and PARAGRAPH
operators require the full-text
index to be built with sentence
and paragraph detection enabled, using the index_sp=1
directive. Paragraph boundaries are
detected by the HTML stripper, so they additionally require the html_strip=1
directive. Without the proper
indexing options, the resultant index will fail to store sentence or
paragraph location information, and these operators will be reduced to
an AND
.
The syntax is:
pizza SENTENCE anchovies
SENTENCE
and PARAGRAPH
must be written in uppercase.
Otherwise, they’re handled as keywords to search for instead of as query
syntax. Our example would match documents in which “pizza” and
“anchovies” occur within a sentence.
Operator arguments are limited to individual keywords, phrases, and instances of the same operator. So the following two queries are valid:
wall SENTENCE "red right hand" apples PARAGRAPH oranges PARAGRAPH bananas
But these two are not:
(red | black) PARAGRAPH tree apples SENTENCE oranges PARAGRAPH bananas
Sentence and paragraph boundaries get detected at indexing time using a number of predefined rules. To catch sentence boundaries, punctuation is processed in the tokenizer using the following rules:
The question mark (?) and exclamation point (!) always indicate a boundary.
A dot (.) is usually a boundary, with a few exceptions, which I mentioned earlier when discussing phrase boundaries. A dot is not considered the end of a sentence:
In the midst of an abbreviation, as in “the U.S.A. and Canada”; defined as a single inline dot followed by a capital letter
At the end of an abbreviation, as in “Yoyodine Inc. has existed since 1800”; defined as a single inline dot followed by whitespace and a lowercase letter
After a middle initial, as in “John D. Doe”; defined as a single capital letter with whitespace to the left and a dot and whitespace to the right
Every HTML tag defined as block-level in the standard triggers a
paragraph boundary. In HTML 4, those tags are ADDRESS
, BLOCKQUOTE
, CAPTION
, CENTER
, DD
,
DIV
, DL
, DT
,
H1
, H2
, H3
,
H4
, H5
, LI
,
MENU
, OL
, P
,
PRE
, TABLE
, TBODY
, TD
,
TFOOT
, TH
, THEAD
,
TR
, and UL
.
Keyword positions get incremented on a boundary: that is, no keyword shares a position with a boundary. Sphinx then stores boundary positions in the index and uses those at query time to check whether there was a separating position between any keywords.
Zone indexing and searching essentially adds support for hierarchical document structure, as opposed to the linear structure imposed by text fields.
Zones exist within fields, and map to HTML or XML markup elements.
A zone might be everything between <TITLE>
and </TITLE>
, or <H1>
and </H1>
, or any other XML tags. Zones can
be nested, as the following XML sample illustrates:
<chapter> <title>Compressing Inverted Files</title> This chapter discusses a variety of <emphasis>compressions techniques</emphasis>. </chapter> <intermezzo>Text in the middle!</intermezzo> <chapter> Another chapter content. </chapter>
Note that it isn’t necessary to declare all
tags as zones. You can choose, say, to index chapter
and title
as zones, but not intermezzo
and emphasis
.
Zones are named. Valid identifiers are accepted as names, with a limit of 127 bytes per name. Every collection and every document can contain an arbitrary number of different zones, and an arbitrary number of occurrences (spans) of any given zone. Sphinx cares only about the start and end of each span, so they can nest arbitrarily, and, technically, nothing prevents them from overlapping (which is forbidden in valid HTML and XML), as long as all open spans get closed.
You define what tags are to be indexed as zones in the
configuration file using the index_zones
directive. Once indexed, zones can
be used to limit matching to specified zones only, just like
fields.
Unlike full-text fields, zones can overlap and nest, and are not limited in number. However, thanks to their simple fixed structure, fields map ideally to SQL columns and also are generally much more efficient to process. So, zones complement fields but don’t render them obsolete.
Two supported variants of zone limit syntax are:
ZONE:h1 only in header ZONE:(h1,h2) only in header
Syntax is case-sensitive and whitespace-sensitive, so ZONE
must be in uppercase, and spaces are
forbidden.
Precisely like searches with fields, searches with zones can be
limited to either a single zone or several zones at once, and the
ZONE
operator affects the rest of the
query, until either another ZONE
operator or a closing right parenthesis occurs. The ZONE
limit operator behavior mimics the field limit
operator in syntax as closely as possible.
Searches within a zone match everything in any individual span of
that zone, including anything that is in a nested subzone. Consider the
example document earlier in this section, indexed with chapter
and title
defined as zones. Querying for
ZONE:chapter inverted
matches
because even though the most enclosing zone for “inverted” is title
that keyword is nevertheless enclosed by
the parent chapter
zone as well.
Querying for ZONE
:chapter another variety
also matches.
Although “another” and “variety” occur in different instances of the
chapter
zone, they both occur in
some chapter, and therefore match in that query for
the chapter
zone. Finally, querying
for ZONE
:chapter middle
does not match because none
of the keyword’s parent zones are chapters.
In addition to operators, Sphinx also supports the notion of
keyword modifiers. Some of the full-text operators
(notably phrase, proximity, and quorum) allow only keywords for
arguments, and exclude other operators. For instance, parentheses are
not allowed within a quorum. Modifiers, however, can appear on keywords
everywhere, including within a quorum, a phrase, or a SENTENCE
operator. There are three such
modifiers:
- Exact form modifier (
=
) Matches if the keyword occurs in that exact form, as opposed to matching stems. Requires both stemming and
index_exact_words=1
enabled in index settings; has no effect otherwise. Example:=runs
- Field start modifier (
^
) Matches if the keyword occurs in the very start of a text field. Example:
^hello
- Field end modifier (
$
) Matches if the keyword occurs in the very end of a text field. Example:
world$
The exact form modifier comes into play when you run your words
through any morphology
processing—say, through stemming that replaces a keyword with its root
form. By default, a stemmed index does not store the original keywords,
so you would also have to explicitly set index_exact_words=1
and rebuild the index to
enable the exact form modifier to work. In an index
with exact words, querying for =runs
matches only documents containing that
particular word, whereas querying for runs
without a modifier would still match any form that
reduces to the same root, be it “running”, “runs”, or “run”. In a sense,
the exact form modifier means “skip search-time stemming for this
keyword.”
We now know pretty much everything about full-text query syntax, but what exactly is Sphinx going to return from a query?
Sphinx’s output is called the result set, and it comes in two parts: matched documents (a.k.a. row data) and metadata. Matched documents are indeed just rows from the Sphinx database. These results always include the document ID and weight, and might also include additional attributes stored in the index and expressions computed on the fly. Metadata provides a few extra things of interest about the result set in general—the number of total matches, per-keyword frequency statistics, and so on.
When you send a query to Sphinx using a programming API, the result combines row data and metadata into a single structure. The specific structure used varies depending on the language you’re using (an associative array in PHP, Perl, and Python; a struct in pure C; a class in Java; etc.), but the structure of member names and their meanings stay the same across APIs. For instance, a result set dump in PHP looks like this:
Array ( [error] => [warning] => [status] => 0 [fields] => Array ( [0] => title [1] => content ) [attrs] => Array ( [group_id] => 1 [date_added] => 2 ) [matches] => Array ( [0] => Array ( [id] => 123 [weight] => 201 [attrs] => Array ( [group_id] => 1 [date_added] => 1293982753 ) ) ) [total] => 1 [total_found] => 1 [time] => 0.002 [words] => Array ( [test] => Array ( [docs] => 3 [hits] => 5 ) [one] => Array ( [docs] => 1 [hits] => 2 ) ) )
Typically, you would walk through matches and process data from them, as the following PHP pretty-printer snippet does. Don’t forget to handle errors, though. Reacting to errors is important.
// PHP SphinxAPI specific, return matches as a plain array // (as opposed to an array indexed with document IDs) $client->SetArrayResult ( true ); // do query $result = $client->Query ( "my test query", "indexname" ); if ( !$result ) { // handle errors print "ERROR: " . $client->GetLastError(); } else { // query OK, pretty-print the result set // begin with general statistics $got = count ( $result["matches"] ); print "Query matched $result[total_found] documents total.\n"; print "Showing matches 1 to $got of $result[total] accessible.\n"; // print out matches themselves now $n = 1; foreach ( $result["matches"] as $match ) { // print number, document ID, and weight print "$n. id=$match[id], weight=$match[weight], "; $n++; // print group_id attribute value print "group_id=$match[attrs][group_id]\n"; } }
We can see quite a number of things in the result besides the match data. Let’s cover them:
error
Error message for this result set. Meaningful only for multiqueries. The standalone
Query()
API call has a different convention (on error, it returns an empty result set, and the error message will be available through theGetLastError()
call).warning
Warning message for this result set. Meaningful only for using multiqueries.
status
searchd status code. Can take one of the following constant values:
SEARCHD_OK
, meaning everything went fineSEARCHD_ERROR
, meaning there was an error processing this query, and no valid result set was returnedSEARCHD_WARNING
, meaning the query completed and a valid result set was returned, but with warningsSEARCHD_RETRY
, meaning there was a temporary error handling the query, and the client should retry the query later
fields
A list of full-text fields in the queried index.
attrs
A list of attributes (columns) returned in this result set, along with their associated type numbers.
This list can be different from the list of attributes stored in the index, because we might choose in our query to have Sphinx not fetch some of the attributes, compute things on the fly and return them as attributes, and so on.
The numbers attached to attribute names, such as 1 and 2 in the sample dump shown earlier, are attribute type identifiers taken from the
SPH_ATTR_
xxx
family of constants (SPH_ATTR_INTEGER
,SPH_ATTR_TIMESTAMP
, etc.). They don’t have to be sequential, although by coincidence they appear that way in the dump shown.matches
A list of matches. Each match has an ID, a weight, and a list of values for the attributes specified in
attrs
.total
The total number of accessible matches. (See the upcoming discussion of totals and limits.)
total_found
The total number of matches found in the index. (Also discussed shortly.)
time
Elapsed time, in seconds, with millisecond precision.
words
Per-keyword statistics. This is a list of keywords extracted from the query, along with the total number of documents that match each keyword (
docs
) and the total number of keyword occurrences in those documents (hits
).
Row data is pretty much covered by attrs
and matches
. Those are the essentials of the search
result. The earlier example contained two attributes, group_id
and date_added
, and their respective types, which
are SPH_ATTR_INTEGER
and SPH_ATTR_TIMESTAMP
. There is just one match (to
keep the example concise), with a document ID of 123, an assigned
relevance weight of 201, and some attribute values.
All attributes defined in the index are returned by default. When
you only need a few of those, use the SetSelect()
API call to specify just the ones
you need. It takes a single string argument, whose syntax is identical to
an SQL select list clause (i.e., everything between SELECT
and FROM
). For example:
$client->SetSelect ( "author_id, year" );
Restricting the attributes to just what you need is useful not only to avoid clutter, but for client-side performance reasons as well. Fetching just 20 rows with 100 redundant attributes per row means unpacking 2,000 extra values and putting them into a result set. And in a slower scripting language such as PHP, Perl, Python, or Ruby, that results in a very noticeable performance impact.
Frequently, either you know the attribute type up front, or your
language can dynamically convert between the
types, so you just access the attributes and refer to them without further
effort. If you need to figure out the type of the attribute dynamically,
you can check the attribute type data in attrs
, enabling you to write a generalized
result set handler, or just verify your type assumptions.
Everything else besides attributes and matches is metadata. You can
check for errors through the error
,
warning
, and status
members of the metadata. The fields
member is rarely (if ever) used in
practice, but is still provided for reference and debugging
purposes.
total
and total_found
are the trickiest part of the
metadata. Formally, total
is defined as
a number of accessible matches, that is, matches that
you can actually request and receive from searchd; and total_found
is defined as the total number of
matches found in the index(es) searched, or in other words, a grand total
count of all the matching documents that searchd just processed. And neither of these is
the number of matches just returned in matches
.
Consider, for the sake of example, the following (real-world) result
set: total
is 1,000, total_found
is 36,123, and matches
only contains 20 entries. How do these
numbers corroborate, really? Are they arbitrary? No. But they depend on a
couple more options that we’ve been sort of hiding up our sleeve so far:
query limits. The limits can be set using the
following API call:
function SetLimits ( $offset, $limit, $max_matches=0, $cutoff=0 )
The limits’ defaults come into play here. offset
and limit
are the offset into the result set on the
searchd side and the number of
matches to pull from there to the application side, and they default to 0
and 20, respectively. (Think of the LIMIT offset,
limit
clause in MySQL.) Coming up next, max_matches
is what controls the result set size
on the searchd side. It defaults to
1,000, meaning searchd will keep
track of the best 1,000 matches at all times, but never a single match
more. cutoff
is beyond the scope of the
current example, but for the sake of completeness, it’s a threshold that
lets you stop searching once it matches that many matches.
So, here’s what happens in the previous example. searchd runs the query, and finds and honestly
processes 36,123 matches. That is reflected in total_found
. However, as required by the max_matches
setting, it only keeps, at most,
1,000 current-best matches in the server-side result set at all times. So,
it can’t return a match number of 3,000; it just does not have it. This is
reflected in total
. Finally, offset
and limit
default to 0 and 20, which is why only 20
rows are returned in matches
in the
client-side result set.
You might be wondering why we even have those defaults—that
confusing max_matches
setting that
effectively caps matching at 1,000 matches unless you explicitly bump it,
and then an equivalent of LIMIT 0,20
on
top. We have the defaults for performance reasons and to ensure clear
memory usage constraints. It’s enormously more efficient to work with the
top 1,000 matches than to process 10 million matches, keep them all in
memory, and then throw them away because the query only wanted to show the
first 10 matches on that first search results page. In our choice of a
default value of 1,000, we were mimicking Google, which never lets you
page past the 1,000th search result. (Sphinx
lets you bump that limit easily, though. Just keep in mind that the limit
needs to be bumped both in sphinx.conf and in the SetLimits()
API call. Otherwise, a server-wide
constraint will take precedence over the API call.)
When should you bump max_matches
and what are the pros and cons of doing so? Our general recommendation is
to keep max_matches
within the 1,000 to
10,000 range. If you set it much higher (e.g., a range of 100,000 to 1
million matches), not only will this result in an immediate performance
impact because Sphinx needs to preinitialize a result set with more
matches, but it’s also an indication that you are highly likely to be
doing something, ahem, suboptimal in your application. Most users only
ever need the first few pages of search results, and nobody is ever going
to actually sit and page through 1 million results. So there’s definitely
no need to set max_matches
higher than
a few thousand when it’s a real, live person who’ll be consuming the
search results. But what if it’s not a person, but a computer program that
needs to additionally process those search results? In our consulting
practice, chances are still high that you can do better than merely
raising max_matches
through the roof.
Result set processing on the Sphinx side, discussed in detail shortly, is
sophisticated enough to either fully replace or at least significantly
offload application-side processing. To supply you with an exaggerated example, there’s absolutely no
sense in pulling 100,000 matches that match ipod
and sorting them on the application side
by price, as Sphinx can sort them for you much faster. That being said,
there still are viable scenarios in which you do have to pull very many
matches. Mostly, those arise from data mining tasks, when Sphinx is
essentially only used as a low-level keyword searching tool and complex
data processing happens outside of it, in the application. Nevertheless,
in many cases Sphinx can do everything you need to that result set. And,
even with those bulky data mining tasks just mentioned, you can frequently
at least do a rough check or preprocessing pass on the Sphinx side and
reduce the number of rows that absolutely have to travel to the
application.
Back to simple little things, time
is the query time elapsed in Sphinx, with
millisecond precision, exactly as logged into the query log. So the sample
search earlier took 0.002 seconds (which is actually slow for a trivial
search that matches one row, but Sphinx was warming up, the machine was
busy swapping a bit, and I needed something other than zero for the
example). Note that it does not include the network round-trip time spent
to send the query to Sphinx and send the result set back. So, time as
measured on the client application may and will vary.
Finally, there are per-keyword statistics in the words
member of a result set. We can deduce that
our query had two keywords, test
and
one
, and that in our full-text indexed
document collection, test
occurs five
times in three different documents, and one
occurs
two times but in just one document. That’s just general prerecorded
statistics for the entire index taken from the dictionary. (And, for
reference, you can very quickly extract those without actually doing any
searching by using the BuildKeywords()
API call.) The numbers of matched occurrences in matched documents only
aren’t provided for performance reasons.
Keyword statistics are there for fun and profit. It might be fun to display them along with search results. And then it’s useful to automatically adjust and rerun queries based on these statistics—say, remove a keyword that matches nothing to prevent a query from matching nothing and making the end user upset. We’ll discuss a few of those query rewriting techniques later.
Both results and metadata are, of course, available via SphinxQL as well. Attributes and rows are returned from the query itself:
mysql> SELECT * FROM test1 WHERE MATCH ('test one');
+------+--------+----------+------------+
| id | weight | group_id | date_added |
+------+--------+----------+------------+
| 1 | 3595 | 1 | 1293982753 |
+------+--------+----------+------------+
1 row in set (0.00 sec)
The SQL result set naturally contains a list of attributes (columns) and matches data (rows), but can’t include the metadata. So, you have to run an additional query to fetch it:
mysql> SHOW META;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.001 |
| keyword[0] | test |
| docs[0] | 3 |
| hits[0] | 5 |
| keyword[1] | one |
| docs[1] | 1 |
| hits[1] | 2 |
+---------------+-------+
9 rows in set (0.00 sec)
The metadata is kept until the next search query (i.e., SELECT
), so you can request it several times as
needed. SHOW META
itself is thus very
quick, but normally results in an extra network round trip, obviously.
However, if your MySQL client library allows for multiqueries and multiple
result sets, you can send the SELECT
and SHOW META
in one batch, and
eliminate that round trip. Recent enough versions of MySQLi in PHP and DBI
in Perl are known to support that.
Under most circumstances, you will at some point need to maintain multiple indexes, but search through all of them simultaneously. The other way around, you’d have to store everything in a single, possibly huge, index. And that can only work well in a scenario with a few very specific conditions—when the document collection does not get updated on a daily basis; when it’s OK to utilize a single core for every given search; when you don’t need to combine multiple entity types when searching; and so on. Most real-world tasks are different, and you will likely need more frequent index updates (counted in minutes rather than weeks), scaling across multiple cores, and so forth. Both updates and scaling, as well as a few fancier tasks, require that you be able to search through multiple indexes and combine (aggregate) results. So, let’s look at how that works.
Searching through multiple indexes can be explicit, when you enumerate several indexes in your query call:
$client->Query ( "John Doe", "index1 index2 index3" );
Separators in the index list are ignored, so you can use spaces, commas, semicolons, or anything else.
Sphinx will internally query every index independently, create a
server-side result set (the top N best matches from
each index, where N equals max_matches
), and then combine the obtained sets, sort the combined matches once
again (to restore the order you requested), and pick the top
N best matches from all the indexes. This
“combination” phase is, by default, very quick, unless you set max_matches
rather high and
there are many actual matches. Sorting several thousand matches in RAM is
pretty quick.
The order of indexes in the query is important, however, because it
can affect searching results under certain occasions. That’s a nonissue
when no rows are shared among indexes, that is, every document ID is
unique and only occurs in exactly one index. But when a document ID is
duplicated and occurs in both result sets—a case that
likely would involve different weights and attribute values!—we have to
pick a single version of that document. Sphinx picks the “newer” version
from the latter index in the list. For instance, if John Doe
matches document 123 in both index1
and index3
, and both matches make it into the
respective result sets, the data from index3
wins. Note, however, that when document 123 isn’t in the
intermediate result set for index3
, the
final combined result set will still contain data from index1
, even if document 123 was actually matched. So, in a sense, matching
documents from indexes specified later in the index list replace “older”
matches. Therefore, in case of a conflicting duplicate row, you always get
a “newer” weight and attribute data in a combined result set.
In made-up pseudo-SQL syntax, this process of eliminating duplicates and combining results can be described as follows:
CREATE TEMPORARY TABLE tmp ... INSERT INTO tmp SELECT * FROM <index1> WHERE <search-condition> ORDER BY <order-condition> LIMIT <max-matches> REPLACE INTO tmp SELECT * FROM <index2> WHERE <search-condition> ORDER BY <order-condition> LIMIT <max-matches> REPLACE INTO tmp SELECT * FROM <index3> WHERE <search-condition> ORDER BY <order-condition> LIMIT <max-matches> ... SELECT * FROM tmp ORDER BY <order-condition> LIMIT <max-matches>
Internal index search order isn’t specified. In theory, Sphinx can decide to rearrange actual searches in whatever way it deems necessary. The final result set, however, is deterministic and guaranteed to stay the same.
But what does this have to do with quicker updates, scaling in general, and everyday use? The thing is, when using the disk-based indexing backend, partitioning data into multiple indexes is essentially the way to achieve both goals.
Basically, to speed up indexing updates, you put most of the data in a rarely updated “main” archive index (or index set) that only needs to be reindexed once in a while, and you put the tiny “dynamic” fraction of the data that changes actively into a separate “delta” index that can then be rebuilt (very) frequently. Then you search through both the “main” and “delta” indexes.
As for scaling, searches against a single index are single-threaded, so you have to set up several indexes to take advantage of multiple cores, CPUs, and disks, and you can search through all those indexes in one go just as well.
So, in one way or another, sooner or later you are going to divide and conquer and search more than one index in one go and have Sphinx combine the results via the routine we just discussed.
Result set processing is among the most powerful of Sphinx’s features. Interestingly, it doesn’t have anything to do with full-text searching. However, it has everything to do with the searching results format that the application sees.
Despite the advertised power, and inevitable tiny devils hidden in
numerous details ensuing from said power, it’s still eerily simple to
explain. Sphinx supports SELECT
.
Literally. Almost all of the SQL stuff, with a few Sphinx-specific
extensions, too. That’s the definition of how Sphinx can process the
result set for you. Admittedly, it’s too general and rather vague, but in
a sense it’s complete. Now, for all those details...
Functionality-wise, there are these five cornerstones:
- Expressions
When querying, you can access document attributes, compute arbitrary arithmetic expressions, and use the resultant values for filtering, grouping, or sorting purposes.
- Filtering (
WHERE
clause) The result set can be limited to matches that satisfy a certain condition.
- Grouping and aggregates (
GROUP BY
clause) The result set can be grouped by a given column. That is, a group of rows that shares a common value in any of the columns can be replaced with a single row representing that group.
- Sorting (
ORDER BY
clause) The result set can be ordered by a given column or a set of columns, in either ascending or descending order.
- Miscellaneous querying options (limits, ranking weights, etc.)
These options let you request different slices of the result set, use different ranking functions, early-stop query processing, and so on.
The preceding rules apply to full-text matches. So, on top of core
text searching operations, you can also add
arbitrary arithmetic, filtering, grouping, aggregate functions (MIN
, MAX
,
AVG
, SUM
), ordering—pretty much everything SQL
allows, and then some.
The classic SQL SELECT
lets you
enumerate columns and calculate things, and so does Sphinx. In SphinxQL,
you also use SELECT
, as usual:
SELECT *, price_usd*1.35 AS price_eur FROM products ...
In SphinxAPI, you would have to use a SetSelect()
call that takes everything you’d
put between SELECT
and FROM
in SQL:
$client->SetSelect ( "*, price_usd*1.35 AS price_eur" );
Expressions can use the document ID, weight (relevance value), and attributes as their arguments. The four rules of arithmetic, standard comparison operators, Boolean and bitwise operators, and a few standard mathematical functions are all supported:
Arithmetic: +, -, *, /
Comparison: =, <>, >, <, >=, <=
Boolean:
AND
,OR
,NOT
Bitwise integer: &, |
Standard mathematical functions: ABS, CEIL, FLOOR, SIN, COS, LN, LOG2, LOG10, EXP, SQRT, MIN, MAX, POW
Comparison operators are valid in a scalar context (as we don’t
really have a Boolean one). So, (a>b)+3
is legal syntax that returns 4 when
the two attributes are equal, or 3 otherwise. The equality and
inequality comparisons (=
and
<>
, respectively) on floats
come with a feature that is rather unique to Sphinx. They compare values
with a small threshold of 1e-6 (that’s approximately how much float
precision is actually there when the absolute value is close to one).
So, when a
is 1e-7 and b
is 1e-8, (a=b)+3
will return 4, even though a/b
will return 10. This might be
inconvenient. On the other hand, sqrt(3)*sqrt(3)=3
returns 1, and without the
threshold it would return 0. This might be convenient. So, be careful
about that if you’re working with extremely small 32-bit float values.
(And in case you absolutely need bitwise comparison for those, IF()
can help, as we will discuss
shortly.)
The result type (and evaluation mode) is automatically deduced
based on argument types and operations and can be a signed 32-bit or
64-bit integer, or a 32-bit floating-point value. That’s loosely based
on how expressions work in the C language. However, Sphinx 32-bit
sql_attr_uint
attributes are
(historically) unsigned integers. When evaluated in 32-bit integer mode,
their values will be implicitly converted to signed when the operation
works on signed, loosely mimicking C again. Then they will be converted
back to unsigned when passing data back to the client. So, you might
need to reinterpret them as signed values on the application side. Also,
if you’re storing 32-bit unsigned integer values that actually utilize
the most significant bit (that’s values over 2^31-1, or 2147483647) and
do not want those to wrap around zero in the calculations, you might
need to forcibly convert them to a signed 64-bit type using the BIGINT()
function:
BIGINT(3123456789)*10
There’s one more conversion function called SINT()
that converts its argument (an unsigned
32-bit integer) to a signed integer, returning a 64-bit value to make
sure large values are preserved:
SINT(1-2)
There’s also one sort of “anti-conversion” function. Unlike C (but
like SQL), integer division such as 3/5 is forcibly computed in floats
(and returns 0.6 instead of the 0 that almost no one but a C programmer
would expect). But returning the truncated integer 0 can also sometimes
be necessary (to please C guys, if nothing else). So Sphinx supports an
IDIV()
function that takes two
integer arguments and divides them as integers:
IDIV(3,5) IDIV(mytimestamp,86400) IDIV(mydatecode,1000)
Sphinx supports a few more functions that do something beyond
fourth grade math, too. Some of them (such as IF()
, IN()
,
and INTERVAL()
) are modeled after
MySQL and should be familiar to MySQL users, but might come with
Sphinx-specific quirks. Other functions, such as GEODIST()
, are entirely
Sphinx-specific.
IF(cond,iftrue,iffalse)
IF()
takes three arguments and returns the second one if the first one is nonzero, or the third one if the first argument is zero. For a floating-point first argument, though,IF()
, unlike equality operators, operates through a simple bit comparison instead of using thresholds. So, these two lines will yield different results:IF ( sqrt(3)*sqrt(3)-3<>0, a, b ) IF ( sqrt(3)*sqrt(3)-3, a, b )
The first one uses a “thresholded” inequality comparison, which tolerates and eliminates the slight floating-point inequality, hence
IF
returnsa
. The second one makesIF
itself do the bitwise comparison to zero, and because of limited floating-point precision and round-off errors, the argument isn’t exactly zero. So, thatIF
returnsb
instead.IN(expr,val1,val2,...)
IN()
takes two or more arguments, and returns either 1 ifexpr
is found in the subsequent list of values, or 0 otherwise. The first argument can be an arbitrary expression, including a multivalue attribute (MVA), but its values must be integer constants. Sphinx presorts the list of constants and does a binary search, so even with a huge list these checks are pretty quick. When the first argument is an MVA,IN
will return 1 when any of its values matches.IN(year,2001,2009,1997) IN(friend_ids_mva,30755,288,27353,19614,29567,3916,23445,5290)
IN(expr,@uservar)
Value lists can occasionally grow to be huge (up to millions of values), making sending them to searchd on every request a costly overhead—interestingly, (much) more expensive than the filtering itself. Sphinx lets you set a server-global user variable once (via SphinxQL, using
SET @uservar=(1,2,3)
syntax) and reuse it later. Global variables are shared among different connections but are not saved between searchd runs, and their contents will be lost on shutdown.INTERVAL(expr,point1,point2,...)
This returns the index of the earliest turning point that is less than the expression in the first argument; that is,
INTERVAL()
returns 0 whenexpr<point1
, returns 1 whenpoint1<=expr<point2
, and so on. The turning point values must be in ascending order (point1<point2<...point
N
) for the function to work properly. This function is useful for partitioning values into buckets and has a few applications. For instance, creating “facets” for a price bracket becomes trivial:SELECT *, INTERVAL(price,30,100,300,1000) AS pricegroup GROUP BY pricegroup
NOW()
This takes no arguments and returns a Unix timestamp representing the moment the query began to execute. (So, it’s a kind of named constant, because it gets calculated only once per query for performance reasons, and doesn’t change from row to row.)
BITDOT(intval,val0,val1,...)
This interprets the first integer argument as a bit set, and sums all the arguments where a corresponding bit is 1 in that bit set. (The second argument corresponds to bit 0, the third to bit 1, etc.) For instance,
BITDOT(5,a,b,c,d)
will returna+c
. Any of the function’s arguments can be an arbitrary expression, but the first argument must be an integer type. The function, in a sense, is “just” syntax sugar, because theoretically, it could be emulated with bitwise integer operations, resulting in something awkward such as this:((intval&1)*val0+(intval&2)*val1+(intval&4)*val2)+...)
BITDOT()
can be useful for ranking when combined with a so-calledFIELDMASK
ranker function (discussed in a later chapter) that creates a bit set of matched fields. Another example is when you have object flags stored as bits and want to attach different weight boosts based on flag presence.GEODIST(lat1,long1,lat2,long2)
This computes a geosphere distance between the two points defined by their latitudes and longitudes, using the WGS84 model. Coordinates need to be in radians and the resultant distance is in meters. Any of the four input coordinates can be an arbitrary expression, and Sphinx internally optimizes them when any of the coordinate pairs is constant.
A very common application requirement is to narrow down search results: for instance, to display books published in the past 10 years, or retrieve friends’ blog posts, or list products available in the local neighborhood. With a text-only search engine that does not support storing user attributes and working with them, you would have to fetch all matches out and pick the ones you need in the application. But in all honesty, that’s an approach coming from the Land of Slow, south of Inefficient Mountains, just north of Terminally Crawling River. This can be an especially unpleasant realization if it happens as you deploy from testing to production. A query seemingly works A-OK when there’s just a fistful of matches, but suddenly there are millions. So, what you really want instead, and what Sphinx lets you easily do, is require the search server to filter the full-text matches based on a condition you specify. Not only does that save on sending gazillions of matches to the application, but it also empowers Sphinx to short-circuit searching as it goes. For one thing, documents that do not satisfy filtering criteria are not relevance-ranked, and relevance ranking is a rather expensive effort. The bottom line is that you should never filter on the application side. Always get the attribute data to Sphinx, and have Sphinx do it.
SphinxAPI exposes the following three calls to perform filtering:
function SetFilter ( $attribute, $values, $exclude=false ) function SetFilterRange ( $attribute, $min, $max, $exclude=false ) function SetFilterFloatRange ( $attribute, $min, $max, $exclude=false )
The SetFilter()
call is the
API’s primary filtering workhorse. It lets you perform equality or presence checks on integer
attributes and MVAs. Specify the name of the attribute (or computed
expression) to check in the attribute
parameter, an array of reference constant values to check against in
values
, and an optional exclude
flag to tell Sphinx whether to include
or exclude matches that pass the check. (By default, the results are
included.)
Here are a few specific examples that illustrate different kinds
of SetFilter()
syntax and the
equivalent WHERE
condition syntax in
SphinxQL:
# equality check $client->SetFilter ( "year", array(2001) ); SELECT ... WHERE year=2001 # non-equality check $client->SetFilter ( "year", array(2001), true ); SELECT ... WHERE year<>2001 # in-set presence check $client->SetFilter ( "year", array(1997,2001,2009) ); SELECT... WHERE year IN (1997,2001,2009) # in-set absence check $client->SetFilter ( "year", array(1997,2001,2009), true ); SELECT... WHERE year NOT IN (1997,2001,2009)
The other two calls, SetFilterRange()
and SetFilterFloatRange()
, let you do comparisons instead of just equality checks. That is,
they check whether the attribute value falls in the allowed range (as
opposed to occurring in an allowed set of values). They are very
similar, the only difference being the expected type of min
and max
arguments: SetFilterRange()
expects
only integers (either 32-bit or 64-bit), while SetFilterFloat
Range()
works with floating-point values. (Having two methods instead of one is
mostly a host language restriction. If there was a reliable method to
tell an integer value from a floating value in each and every popular
scripting language in the world that the API is ported to, a separate
SetFilterFloatRange()
method would be
redundant.)
The attribute
parameter is the
name of an attribute or an expression again, and min
and max
are the allowed boundaries, inclusive. For instance, this is how you
would check that a book was published in the 2000s and that its price is
$50 or less:
$client->SetFilterRange ( "year", 2000, 2009 ); $client->SetFilterFloatRange ( "price_usd", 0, 50 ); SELECT ... WHERE year>=2000 AND year<=2009 AND price_usd>=0 AND price_usd<=50
This example brings us to the question of what happens when you
issue multiple SetFilter()
calls. The
answer is that all of them apply. In other words, all filters that you
set via the API are AND
ed together.
There’s no way to OR
filters.
Also, the filters get appended to any existing set of filters; they never replace previous filters. Therefore, this snippet (taken from real-world buggy code) would effectively never match anything:
$client->SetFilter ( "attr", array(1) ); $client->SetFilter ( "attr", array(2) );
As the two filters stack up, the snippet is equivalent to WHERE attr=1 AND attr=2
, and that condition
never holds true. If the developers simply intended to check that
attr
equals 2, they should have
simply used the second SetFilter()
call. To check that attr
equals
either 1 or 2, enumerate all values in one call:
$client->SetFilter ( "attr", array(1,2) );
Enforced AND
ing of filters can,
at a glance, seem like a showstopper for queries that, say, need to
match either top-rated or recent enough books. But, in fact, that’s a
minor inconvenience at most; you still can do that. Remember that
filters can be applied not just to the prerecorded document attributes,
but to expressions computed on the fly as well.
Hence, nothing prevents us from computing a condition expression, and
filtering on that:
$client->SetSelect ( "rating>=8.0 OR year>=2000 AS mycond" ); $client->SetFilter ( "mycond", array(1) );
That’s the “official” way to perform complex Boolean filtering via SphinxAPI. In classic SQL, the trick we just used is equivalent to this:
SELECT *, rating>=8.0 OR year>=2000 AS mycond FROM books WHERE mycond=1
And that’s exactly the syntax SphinxQL currently supports, too.
However, it’s a bit clunky and pollutes the result set with a redundant
mycond
column that always equals 1.
It would be cleaner to simply put the condition in the WHERE
clause where it belongs:
SELECT * FROM books WHERE rating>8.0 OR year>=2000 AS mycond
At the time of this writing, that syntax is not supported in SphinxQL, but eventually it will be. In the meantime (or if you are somehow locked to one of the earlier versions) you can always use the “filter on expression” approach shown before.
Going back to range filters, there’s another semisubtlety with the
API methods that you may have spotted by now. Minimum and maximum range
boundaries always come in pairs in API calls, and are inclusive. So, how
would one check for a mere attr>3
condition via SphinxAPI? That
depends on the attr
type. When it’s
an integer (we’re concentrating on integers because such precise
boundary conditions don’t usually arise in floating-point situations),
you can just replace “greater than 3” with “greater than or equal to 4”,
and attach a redundant “is less than or equal to a maximum value of
integer” condition:
$client->SetFilterRange ( "attr", 4, 4294967295 );
Alternatively, you can go with the “filter on expressions” approach again.
An amazingly large number of questions in our current reality have more than just one answer, sometimes very many more, and search queries are no exception. And so—what mischievous little creatures they are!—they can very well return more than just one search result. Therefore, we can pose another question: how are those results to be sorted?
Oops, our very question also has more than one answer. The results of a search for a specific product are, beyond a doubt, to be sorted by price—ideally, shipping and handling and taxes included. Sorting news reports, on the other hand, should at least account for how recent the reports are, and if not, they should just sort by day posted. General web search results need to be sorted by relevance. And so on.
Shipping a million matches from Sphinx to the application does not
magically become any less expensive just because we intend to sort them
as opposed to filtering them, so Sphinx supports sorting on its side as
well. The appropriate SphinxAPI method, dubbed SetSortMode()
, comes with a few legacy modes
of its own:
function SetSortMode ( $mode, $sortby="" )
The up-to-date approach to sorting is to use the SPH_SORT_EXTENDED
mode and pass the sorting
condition in its $sortby
argument.
That’s equivalent to an ORDER BY
clause in SphinxQL:
$client->SetSortMode ( SPH_SORT_EXTENDED, "year DESC, @weight DESC" ); ... ORDER BY year DESC, @weight DESC
One can use several attribute or expression names in this sorting
condition, following each with a DESC
or ASC
order specification.
There are also five historic modes that can now be replaced with respective “extended” clauses:
SPH_SORT_RELEVANCE
The default sorting mode. Sorts by relevance, with the most relevant documents first. Equivalent to:
ORDER BY @weight DESC, @id ASC
SPH_SORT_ATTR_DESC
Sorts by the attribute specified in
$sortby
, in descending order. Equivalent to:ORDER BY
$sortby
DESC, @id ASCSPH_SORT_ATTR_ASC
Sorts by the attribute specified in
$sortby
, in ascending order. Equivalent to:ORDER BY
$sortby
ASC, @id ASCSPH_SORT_TIME_SEGMENTS
Sorts by a so-called time segment computed from an attribute specified in
$sortby
and the current time. Equivalent to:SELECT *, INTERVAL
($sortby
, NOW()-90*86400, NOW()-30*86400, NOW()-7*86400, NOW()-86400, NOW()-3600) AS time_seg ... ORDER BY time_seg DESC, @weight DESC, @id ASCTime segments were introduced to sort documents by a combination of freshness and relevance. They split matched documents into six different buckets, which consisted of documents posted less than one hour ago, one hour to one day ago, one day to one week ago, one week to 30 days ago, 30 days to 90 days ago, and more than 90 days ago. Matches are then sorted by a bucket, and then by relevance within the bucket—so that documents posted in the past hour always rank higher than documents posted last week, but within that hour, day, or week, more relevant documents win.
With the advent of expressions, it’s now possible to replace that hardcoded time segment computation with an
INTERVAL()
call and customize the buckets to your liking.
Just for the sake of completeness, there’s a fifth mode, SPH_SORT_EXPR
, which lets you sort by a C
function hardcoded at build time in sphinxcustomsort.inl. That mode was
introduced before expressions to leave some room for “easy”
customization, but now that runtime expressions are in place, the mode
is highly unlikely to yield noticeably better performance, and it is
very difficult to maintain, is deprecated, and is going to be removed
some day. In short, never use it; always use runtime expressions
instead.
And last but not least, on the subject of sorting, do you remember
the max_matches
setting that controls
how many matches searchd keeps in
memory? It’s important to understand that this has no
effect on sorting. No matter what max_matches
is set to, it’s guaranteed that
the very best match (according to your sorting criteria) will be the
number 1; the next best will be the number 2, and so on. So, when you
sort by relevance, the most relevant document among the gazillion
matched documents is always guaranteed to be number 1 in the result set,
no matter whether max_matches
is set
to just 10, to 20,000, or to a gazillion.
The very first thing that springs to mind when talking of
grouping, as in SQL’s GROUP BY
clause, is the different kinds of reports—how many site users registered
that year, how many sales we generated each week, the peak login hours
throughout the past month, et cetera, ad infinitum. So, support for
grouping on the search engine side might, at first glance, seem
peculiar.
But a number of search-related tasks require grouping as well. What’s the average offer price for “ipod” within a 1-mile, 10-mile, or 100-mile radius? How briskly did people blog about Haiti every day last year? How do you display news entries in clusters centered on the same topic and date? The answers involve grouping “raw” matches by this or that, and just as with filtering and sorting, it’s generally much more efficient to have Sphinx process a million matches than drag them outside and work through them.
Grouping via SphinxAPI is provided by the following two calls:
function SetGroupBy ( $attr, $func, $groupsort="@groupby desc" ) function SetGroupDistinct ( $attr2 )
SetGroupBy()
tells Sphinx to
group rows by a value of a function $func
taken from an attribute $attr
, then sort the resultant grouped rows by
a $groupsort
condition. In SphinxQL,
it’s equivalent to this clause:
GROUP BY$func
($attribute
) ORDER BY$groupsort
SetGroupDistinct()
makes Sphinx
count the number of distinct values of attribute $attr2
while grouping by some other attribute.
The equivalent SphinxQL is (you bet) adding COUNT(DISTINCT $attr2)
to the SELECT
expressions list.
The calls add a few magic columns to the result set. SetGroupBy()
adds an @groupby
column that contains the value used
for grouping and an @count
that
contains the number of rows in that group. SetGroupDistinct()
returns the number of
distinct $attr2
values in a magic
@distinct
column.
Here’s what the bundled test.php application displays when “just”
searching and then when grouping (by a group_id
attribute):
$php test.php -i test1 test
Query 'test' retrieved 3 of 3 matches in 0.000 sec. Query stats: 'test' found 5 times in 3 documents Matches: 1. doc_id=1, weight=101, group_id=1, date_added=2011-01-02 18:39:13 2. doc_id=2, weight=101, group_id=1, date_added=2011-01-02 18:39:13 3. doc_id=4, weight=1, group_id=2, date_added=2011-01-02 18:39:13 $php test.php -i test1 -g group_id test
Query 'test' retrieved 2 of 2 matches in 0.000 sec. Query stats: 'test' found 5 times in 3 documents Matches: 1. doc_id=4, weight=1, group_id=2, date_added=2011-01-02 18:39:13, @groupby=2, @count=1 2. doc_id=1, weight=101, group_id=1, date_added=2011-01-02 18:39:13, @groupby=1, @count=2
You can see how, with grouping enabled, we’re getting two groups instead of three documents now, along with associated group-by key and counts.
But hey, why aren’t the results in ascending relevance (weight) order anymore?
Remember that the SetGroupBy()
call maps to both GROUP BY
and
ORDER BY
clauses in SphinxQL. And, by
default, it chooses to order the groups that now
make it into the result set by group-by key. However, the
matches sorting (the one from the previous section,
which can be set with the SetSortMode()
call and defaults to relevance
order), does not get left out either. When grouping is in effect, it’s
used to pick the one row that will represent the group in the final
result set. In SQL terms:
When SetGroupBy()
is enabled,
SetSortMode()
is equivalent
to the WITHIN GROUP ORDER
BY
clause in the SphinxQL dialect.
Otherwise, SetSortMode()
is equivalent to
the ORDER BY
clause.
WITHIN GROUP ORDER BY
is a
SphinxQL-specific extension. The SQL standard does not specify what
representative row to pick for a group, and does not introduce any
syntax to control the choice. So technically, an SQL database can even
return a random row every time you repeat the query without breaking the
standard. Our extension lets you specify which row to choose. For
instance, as the default sorting mode is essentially @weight DESC, @id ASC
, the most relevant row
within each group will be picked by default. (If more than one row has
the same top relevance value, the document with a smaller document ID
wins. This explains why document 1 was returned in the earlier example.)
But you can override that and, say, pick the most recently added row
instead:
SELECT * FROM test1 GROUP BY group_id WITHIN GROUP ORDER BY date_added DESC ORDER BY @weight DESC
Unobtrusively switching from SphinxAPI to SphinxQL now, grouping
isn’t only about removing “duplicate” rows that share a common value in
a given column; it’s also about computing aggregate
functions over such groups of row. Two examples we’ve covered
are COUNT(*)
and COUNT(DISTINCT attr)
, but Sphinx supports
more. Currently, you can use MIN()
,
MAX()
, AVG()
, and SUM()
, which covers the ANSI SQL’92 standard.
A few more sophisticated aggregate functions (e.g., bitwise operations,
standard deviance and variation, etc.) may be added in the future, so
refer to the current documentation for the most recent list.
SELECT *, AVG(price) FROM products WHERE MATCH('ipod') GROUP BY city_id
An important side note is that Sphinx’s current implementation of
aggregate functions can be intentionally imprecise in favor of
performance and constrained RAM use. The degree of precision depends on
the value of the max_matches
option.
Sphinx will only keep track of 4*max_matches
best groups (as per sorting
criteria) at all times. Thus, when the result set contains fewer than
that number of groups the aggregate values are guaranteed to be precise.
Otherwise, the values can be (slightly) off.
Get Introduction to Search with Sphinx now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.