With the explosion of the Internet, a huge amount of information has become available to us. But it doesn’t matter how much information is available if we can’t find what we are looking for. Luckily, companies like Google and Yahoo! have come to the rescue by helping us find the information we need with their search engines.
More recently, the same thing has been happening on our personal computers. More and more of our personal lives are being stored on hard drives—everything from work documents and email to multimedia files and family photos. Carefully categorizing all this data and scanning through large hierarchies of folders just doesn’t cut it anymore. We need a fast way to access the data we need. Presently, some of the tools commonly used for this task, such as the built-in search in Windows, leave a lot to be desired. Spotlight on OS X is much closer to what we need.
By the end of this book, you’ll have built a search application that will make searching your hard drive as easy as searching the Web. In this section, we start with plain old text files. Let’s begin by writing a command-line indexing program that takes two arguments: the name of the directory we want to index, and the name of the directory in which the index will be stored. Take a look at Example 1-2.
Example 1-2. index.rb
0
#!/usr/bin/env ruby
1
require
'
rubygems
'
2
require
'
ferret
'
3
require
'
fileutils
'
4
include
Ferret
5
include
Ferret
::
Index
6
7
def
usage
(
message
=
nil
)
8
puts
message
if
message
9
puts
"
ruby
#{File.basename(__FILE__)}
<data dir> <index dir>"
10
exit
(
1
)
11
end
12
13
usage
()
if
ARGV
.
size
!=
2
14
usage
("
Directory '
#{ARGV[0]}
' doesn't exist.")
unless
File
.
directory?
(
ARGV
[
0
])
15
$data_dir
,
$index_dir
=
ARGV
16
begin
17
FileUtils
.
mkdir_p
(
$index_dir
)
18
rescue
19
usage
("
Can't create index directory '
#$index_dir
'.")
20
end
21
22
index
=
Index
.
new
(
:path
=>
$index_dir
,
23
:create
=>
true
)
24
25
Dir
["
#$data_dir
/**/*.txt"].
each
do
|
file_name
|
26
index
<<
{
:file_name
=>
file_name
,
:content
=>
File
.
read
(
file_name
)}
27
end
28
index
.
optimize
()
29
index
.
close
()
Most of this code is for command-line argument handling and can be
safely skimmed over. The interesting part of the code begins on line 22. This is where we create the index. The
:path
parameter clearly specifies where
you want to store the index. Setting the :create
parameter to true
tells Ferret to
create a new index in the specified directory. Any index already residing
in the specified directory will be overwritten, so be careful when setting
:create
to true
. We saw earlier that we can add simple
Strings
to an index. This time we use a
Hash
, as we want each document to have
two fields.
Once the index is created, we need to add documents to it. Line
25 simply scans a directory tree for all
text files. Line 26 is where most of the
action is happening. Since we can add simple Strings
to an index, we use a Hash
because we want each document to have two
fields: a :file_name
field and a
:content
field. Later, we’ll learn
about the Document
class, which
lets us assign weightings (or boosts, as they are known in Ferret) to documents
and fields.
The Index#optimize
method is called on line 28.
This method optimizes the index for searching, and it is a good idea to call it
whenever you do a batch indexing.[1] On the following line, we close the index. Index#close
will make sure that any data held in
RAM is flushed to the index. It then commits the index and closes any
locks that the Index
object might be holding on the index.
Creating an index is now simply a matter of running the indexer from the command line:
dave$ ruby index.rb index_dir/ text_files/
Now that we have an index, we need to be able to search it. That is why we built it, after all. The search code is as simple as the indexing code; take a look at Example 1-3.
Example 1-3. search.rb
0
#!/usr/bin/env ruby
1
require
'
rubygems
'
2
require
'
ferret
'
3
require
'
fileutils
'
4
include
Ferret
5
include
Ferret
::
Index
6
7
def
usage
(
message
=
nil
)
8
puts
message
if
message
9
puts
"
ruby
#{File.basename(__FILE__)}
<index dir> <search phrase>"
10
exit
(
1
)
11
end
12
13
usage
()
if
ARGV
.
size
!=
2
14
usage
("
Index '
#{ARGV[0]}
' doesn't exist.")
unless
File
.
directory?
(
ARGV
[
0
])
15
$index_dir
,
$search_phrase
=
ARGV
16
17
index
=
Index
.
new
(
:path
=>
$index_dir
)
18
19
results
=
[]
20
total_hits
=
index
.
search_each
(
$search_phrase
)
do
|
doc_id
,
score
|
21
results
<<
"
#{score}
-#{index[doc_id][:file_name]}
"
22
end
23
24
puts
"
#{total_hits}
matched your query:\n
"
+
results
.
join
("
\n
")
25
26
index
.
close
()
On line 21 we simply write the
results to a string. You can use the document ID to access the index; the
document itself acts like a Hash
object. If you
would like to build an index of a large number of text files, check out
Project Gutenberg (http://www.gutenberg.org/). Go
ahead and try out the search script:
dave$ ruby search.rb index_dir/ "Moby Dick"
[1] When doing incremental indexing, as you might do in a Rails
application, it is better not to call the optimize
method. You’ll learn more about
this in the Optimizing the Index” section in Chapter 3.
Get Ferret now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.