Chapter 5. Analysis

Analysis is the foundation of any search library. It is the process of taking an input field and breaking it up into tokens to be added to the inverted index. So, why did we wait until now to cover this important subject? Most of the time, Ferret’s standard analyzer will do exactly what you need it to do. However, when it doesn’t, Ferret’s analysis API is very easy to extend to your needs. To understand the analysis API, you need to know about three classes:

  • Token

  • TokenStream

  • Analyzer

Token

The Token is the basic datatype in analysis. It is basically just a Struct with four attributes:

  • Text

  • Start offset

  • End offset

  • Position increment

The text attribute is obviously a String holding the token’s text. Ferret allows tokens of up to 255 bytes long. Any longer than that and the text gets truncated to that length.

The start and end offsets hold the byte positions of the start and end of the token in the original field, the end being the byte immediately after the last byte in the token. For example, in the string “The Old Man and the Sea”, the “Old” token has a start offset of 4 and an end offset of 7. The difference between the start offset and the end offset is usually equal to the length of the token’s text, but not always. For example, Ferret’s standard analyzer strips possessives (’s). In the field “Jamie’s Kitchen”, for instance, the first token will be “Jamie” but the start and end offset will be 0 and 7, respectively, also encompassing the possessive “’s”. This makes ...

Get Ferret now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.