Chapter 5. Analysis
Analysis is the foundation of any search library. It is the process of taking an input field and breaking it up into tokens to be added to the inverted index. So, why did we wait until now to cover this important subject? Most of the time, Ferret’s standard analyzer will do exactly what you need it to do. However, when it doesn’t, Ferret’s analysis API is very easy to extend to your needs. To understand the analysis API, you need to know about three classes:
Token
TokenStream
Analyzer
Token
The Token
is the basic datatype in analysis. It is basically just a Struct
with four attributes:
Text
Start offset
End offset
Position increment
The text attribute is obviously a String
holding the token’s
text. Ferret allows tokens of up to 255 bytes long. Any longer than that
and the text gets truncated to that
length.
The start and end offsets hold the byte positions of the start and end of the token in the original field, the end being the byte immediately after the last byte in the token. For example, in the string “The Old Man and the Sea”, the “Old” token has a start offset of 4 and an end offset of 7. The difference between the start offset and the end offset is usually equal to the length of the token’s text, but not always. For example, Ferret’s standard analyzer strips possessives (’s). In the field “Jamie’s Kitchen”, for instance, the first token will be “Jamie” but the start and end offset will be 0 and 7, respectively, also encompassing the possessive “’s”. This makes ...
Get Ferret now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.