As of Version 3.23.23, MySQL has support for full-text indexing and searching. Full-text indexes in MySQL are an index of type FULLTEXT. FULLTEXT indexes can be created from VARCHAR and TEXT columns at CREATE TABLE time or added later with ALTER TABLE or CREATE INDEX. For large datasets, it will be much faster to load your data into a table that has no FULLTEXT index, then create the index with ALTER TABLE (or CREATE INDEX). Loading data into a table that already has a FULLTEXT index will be slower.
Full-text searching is performed with the MATCH( ) function:
mysql> CREATE TABLE articles ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ); Query OK, 0 rows affected (0.00 sec) mysql> INSERT INTO articles VALUES -> (0,'MySQL Tutorial', 'DBMS stands for DataBase ...'), -> (0,'How To Use MySQL Efficiently', 'After you went through a ...'), -> (0,'Optimising MySQL','In this tutorial we will show ...'), -> (0,'1001 MySQL Trick','1. Never run mysqld as root. 2. ...'), -> (0,'MySQL vs. YourSQL', 'In the following database comparison ...'), -> (0,'MySQL Security', 'When configured properly, MySQL ...'); Query OK, 6 rows affected (0.00 sec) Records: 6 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM articles -> WHERE MATCH (title,body) AGAINST ('database'); +----+-------------------+------------------------------------------+ | id | title | body | +----+-------------------+------------------------------------------+ | 5 | MySQL vs. YourSQL | In the following database comparison ... | | 1 | MySQL Tutorial | DBMS stands for DataBase ... | +----+-------------------+------------------------------------------+ 2 rows in set (0.00 sec)
The MATCH( ) function performs a natural-language search for a string against a text collection (a set of one or more columns included in a FULLTEXT index). The search string is given as the argument to AGAINST( ). The search is performed in case-insensitive fashion. For every row in the table, MATCH( ) returns a relevance value—that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH( ) list.
When MATCH( ) is used in a WHERE clause (see the previous example) the rows returned are automatically sorted with highest relevance first. Relevance values are non-negative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.
It is also possible to perform a boolean mode search. This is explained later in this section.
The preceding example is a basic illustration showing how to use the MATCH( ) function. Rows are returned in order of decreasing relevance.
The next example shows how to retrieve the relevance values explicitly. As neither WHERE nor ORDER BY clauses are present, returned rows are not ordered.
mysql> SELECT id,MATCH (title,body) AGAINST ('Tutorial') FROM articles; +----+-----------------------------------------+ | id | MATCH (title,body) AGAINST ('Tutorial') | +----+-----------------------------------------+ | 1 | 0.64840710366884 | | 2 | 0 | | 3 | 0.66266459031789 | | 4 | 0 | | 5 | 0 | | 6 | 0 | +----+-----------------------------------------+ 6 rows in set (0.00 sec)
The following example is more complex. The query returns the relevance and still sorts the rows in order of decreasing relevance. To achieve this result, you should specify MATCH( ) twice. This will cause no additional overhead because the MySQL optimiser will notice that the two MATCH( ) calls are identical and invoke the full-text search code only once.
mysql> SELECT id, body, MATCH (title,body) AGAINST -> ('Security implications of running MySQL as root') AS score -> FROM articles WHERE MATCH (title,body) AGAINST -> ('Security implications of running MySQL as root'); +----+-------------------------------------+-----------------+ | id | body | score | +----+-------------------------------------+-----------------+ | 4 | 1. Never run mysqld as root. 2. ... | 1.5055546709332 | | 6 | When configured properly, MySQL ... | 1.31140957288 | +----+-------------------------------------+-----------------+ 2 rows in set (0.00 sec)
MySQL uses a very simple parser to split text into words. A “word”
is any sequence of characters consisting of letters, numbers, ',
and _
. Any “word” that is present in the stopword list or is just
too short (3 characters or less) is ignored.
Every correct word in the collection and in the query is weighted according to its significance in the query or collection. This way, a word that is present in many documents will have lower weight (and may even have a zero weight) because it has lower semantic value in this particular collection. Otherwise, if the word is rare, it will receive a higher weight. The weights of the words are then combined to compute the relevance of the row.
Such a technique works best with large collections (in fact, it was carefully tuned this way). For very small tables, word distribution does not reflect adequately their semantic value, and this model may sometimes produce bizarre results.
mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('MySQL'); Empty set (0.00 sec)
The search for the word MySQL produces no results in the previous example because that word is present in more than half the rows. As such, it is effectively treated as a stopword (that is, a word with zero semantic value). This is the most desirable behavior—a natural language query should not return every second row from a 1GB table.
A word that matches half of the rows in a table is less likely to locate relevant documents. In fact, it will most likely find plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that such rows have been assigned a low semantic value in this particular dataset.
As of Version 4.0.1, MySQL can also perform boolean full-text searches using the IN BOOLEAN MODE modifier.
mysql> SELECT * FROM articles WHERE MATCH (title,body) -> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE); +----+------------------------------+-------------------------------------+ | id | title | body | +----+------------------------------+-------------------------------------+ | 1 | MySQL Tutorial | DBMS stands for DataBase ... | | 2 | How To Use MySQL Efficiently | After you went through a ... | | 3 | Optimising MySQL | In this tutorial we will show ... | | 4 | 1001 MySQL Trick | 1. Never run mysqld as root. 2. ... | | 6 | MySQL Security | When configured properly, MySQL ... | +----+------------------------------+-------------------------------------+
This query retrieved all the rows that contain the word MySQL (note: the 50% threshold is not used), but that do not contain the word YourSQL. Note that a boolean mode search does not auto-magically sort rows in order of decreasing relevance. You can see this from the result of the preceding query, where the row with the highest relevance (the one that contains MySQL twice) is listed last, not first. A boolean full-text search can also work even without a FULLTEXT index, although it would be slow.
The boolean full-text search capability supports the following operators:
- +
A leading plus sign indicates that this word must be present in every row returned.
- -
A leading minus sign indicates that this word must not be present in any row returned.
By default (when neither plus nor minus is specified) the word is optional, but the rows that contain it will be rated higher. This mimics the behaviour of MATCH( ) ... AGAINST( ) without the IN BOOLEAN MODE modifier.
- < >
These two operators are used to change a word’s contribution to the relevance value that is assigned to a row. The < operator decreases the contribution and the > operator increases it.
- ( )
Parentheses are used to group words into subexpressions.
- ~
A leading tilde acts as a negation operator, causing the word’s contribution to the row relevance to be negative. It’s useful for marking noise words. A row that contains such a word will be rated lower than others, but will not be excluded altogether, as it would be with the - operator.
- *
An asterisk is the truncation operator. Unlike the other operators, it should be appended to the word, not prepended.
- "
The phrase, that is enclosed in double quotes ", matches only rows that contain this phrase literally, as it was typed.
And here are some examples:
- apple banana
find rows that contain at least one of these words.
- +apple +juice
... both words.
- +apple macintosh.
... word “apple”, but rank it higher if it also contain “macintosh”.
- +apple -macintosh
... word “apple” but not “macintosh”.
- +apple +(>pie <strudel)
... “apple” and “pie”, or “apple” and “strudel” (in any order), but rank “apple pie” higher than “apple strudel”.
- apple*
... “apple”, “apples”, “applesauce”, and “applet”.
- "some words"
... “some words of wisdom”, but not “some noise words”.
All parameters to the MATCH( ) function must be columns from the same table that is part of the same FULLTEXT index, unless the MATCH( ) is IN BOOLEAN MODE.
The MATCH( ) column list must exactly match the column list in some FULLTEXT index definition for the table, unless this MATCH( ) is IN BOOLEAN MODE.
The argument to AGAINST( ) must be a constant string.
Unfortunately, full-text search has few user-tunable parameters yet, although adding some is very high on the TODO. If you have a MySQL source distribution (see Section 2.3), you can exert more control over full-text searching behavior.
Note that full-text search was carefully tuned for the best searching effectiveness. Modifying the default behavior will, in most cases, only make the search results worse. Do not alter the MySQL sources unless you know what you are doing!
The minimum length of words to be indexed is defined by the MySQL variable ft_min_word_length. See Section 4.5.6.4. Change it to the value you prefer, and rebuild your FULLTEXT indexes.
The stopword list is defined in
myisam/ft_static.c
Modify it to your taste, recompile MySQL, and rebuild your FULLTEXT indexes.The 50% threshold is determined by the particular weighting scheme chosen. To disable it, change the following line in
myisam/ftdefs.h
:#define GWS_IN_USE GWS_PROB
to:
#define GWS_IN_USE GWS_FREQ
Then recompile MySQL. There is no need to rebuild the indexes in this case. Note: by doing this you severely decrease MySQL’s ability to provide adequate relevance values for the MATCH( ) function. If you really need to search for such common words, it would be better to search using IN BOOLEAN MODE instead, which does not observe the 50% threshold.
Sometimes the search engine maintainer would like to change the operators used for boolean full-text searches. These are defined by the ft_boolean_syntax variable. See Section 4.5.6.4. Still, this variable is read-only, so its value is set in
myisam/ft_static.c
.
Make all operations with FULLTEXT index faster.
Proximity operators.
Support for “always-index words”. They could be any strings the user wants to treat as words. Examples are “C++”, “AS/400”, “TCP/IP”, etc.
Support for full-text search in MERGE tables.
Support for multi-byte charsets.
Make stopword list to depend on the language of the data.
Stemming (dependent on the language of the data, of course).
Generic user-suppliable UDF preparser.
Make the model more flexible (by adding some adjustable parameters to FULLTEXT in CREATE/ALTER TABLE).
Get MySQL Reference Manual now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.