Chapter 3. Advanced Indexing
So far, we’ve taken a black-box approach to Ferret. This chapter explains what is really going on during indexing and, in the process, explains how to tune your index for maximum performance. We conclude by explaining how locking works. It is crucial that you understand this, particularly if you want to run Ferret in a multithreaded or multiprocess environment.
How the Indexing Process Works
We are now going to show how a source document—such as an HTML
document from the Web, a row from a database, or an image from your
personal image collection—becomes a Ferret document stored in the index.
Ferret is agnostic about the source document’s type. It doesn’t matter
whether you are indexing an MP3 file, a text document, or your store’s
product, Ferret treats it as a collection of string fields. So, the first
step is to turn source documents into Document
s. This is pretty easy with plain-text
documents. With other text document types, such as PDF or HTML, you’ll
need to write a parser/reader that extracts the searchable text from the
documents. For an image file, you might have a parser that extracts EXIF
tags. Database rows usually map pretty easily to Document
s. See Chapter 6 for a framework for doing exactly
this.
Once you have a Document
, you add
it to an IndexWriter
. This is
where the magic begins. The Document
’s fields are passed through an analyzer (if they are set to be tokenized) that breaks up the fields into searchable tokens however it sees fit (see ...
Get Ferret now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.