Summary: in this tutorial, you will learn how to use MySQL ngram full-text parser to support full-text searches for ideographic languages such as Chinese, Japanese, and Korean.
Introduction to MySQL ngram full-text parser
The built-in MySQL full-text parser determines the start and end of words using white space as delimiters.
However, in the case of ideographic languages such as Chinese, Japanese, and Korean, a limitation arises as these languages do not use word delimiters.
To address this issue, MySQL provided the ngram full-text parser.
Starting from version 5.7.6, MySQL included ngram full-text parser as a built-in server plugin, which means that MySQL loads this plugin automatically when the MySQL database server starts.
MySQL supports ngram full-text parser for both InnoDB and MyISAM storage engines.
By definition, an ngram is a contiguous sequence of characters from a text sequence. The primary function of the ngram full-text parser is to tokenize a text sequence into contiguous n-character sequences.
The following illustrates how the ngram full-text parser tokenizes a sequence of text for different values of n:
n = 1: 'm','y','s','q','l'
n = 2: 'my', 'ys', 'sq','ql'
n = 3: 'mys', 'ysq', 'sql'
n = 4: 'mysq', 'ysql'
n = 5: 'mysql'
Code language: JavaScript (javascript)
Creating FULLTEXT indexes with ngram parser
To create a FULLTEXT
index that uses an ngram full-text parser, you add the WITH PARSER ngram
in the CREATE TABLE
, ALTER TABLE
, or CREATE INDEX
statement. Consider the following example:
First, create a new database called test:
CREATE DATABASE test;
Second, create a new table called posts
with a full-text index that includes the title
and body
columns:
CREATE TABLE posts (
id INT PRIMARY KEY AUTO_INCREMENT,
title VARCHAR(255),
body TEXT,
FULLTEXT ( title , body ) WITH PARSER NGRAM
);
Second, change the character set to utf8mb4
using the SET NAMES
statement:
SET NAMES utf8mb4;
Code language: SQL (Structured Query Language) (sql)
Third, insert a new row into the posts
table:
INSERT INTO posts(title,body)
VALUES('MySQL全文搜索','MySQL提供了具有许多好的功能的内置全文搜索'),
('MySQL教程','学习MySQL快速,简单和有趣');
Code language: SQL (Structured Query Language) (sql)
Finally, show how the ngram tokenizes the text using the following statements:
SET
GLOBAL innodb_ft_aux_table = "test/posts";
SELECT
*
FROM
information_schema.innodb_ft_index_cache
ORDER BY
doc_id,
position;
Code language: SQL (Structured Query Language) (sql)
This query is valuable for troubleshooting purposes. For instance, if a word is not included in the search results, it may not be indexed due to being a stopword or for some other reason.
Setting ngram token size
As seen in the previous example, the default token size (n) for ngram is 2. To modify the token size, you can use the ngram_token_size
configuration option, which accepts values between 1 and 10.
Note that a smaller token size makes a smaller full-text search index and allows you to search faster.
Because ngram_token_size
is a read-only variable, you can only set its value using two options:
First, set the ngram_token_size when the database server starts:
mysqld --ngram_token_size=1
Code language: SQL (Structured Query Language) (sql)
Second, set the ngram_token_size in the configuration file:
[mysqld]
ngram_token_size=1
Code language: SQL (Structured Query Language) (sql)
ngram parser phrase search
MySQL converts a phrase search into ngram phrase searches. For example, "abc"
is converted into "ab bc"
, which returns documents that contain "ab bc"
and "abc"
.
The following example shows you how to search for the phrase 搜索
in the posts
table:
SELECT
id, title, body
FROM
posts
WHERE
MATCH (title , body) AGAINST ('搜索' );
Code language: SQL (Structured Query Language) (sql)
Processing search results with ngram
Natural language mode
In NATURAL LANGUAGE MODE
searches, the search term is converted to a union of ngram values. Suppose the token size is 2 or bigram, the search term mysql
is converted to my
ys
sq
and ql
.
SELECT
*
FROM
posts
WHERE
MATCH (title , body)
AGAINST ('简单和有趣' IN natural language MODE);
Code language: SQL (Structured Query Language) (sql)
Boolean mode
In BOOLEAN MODE
searches, the search term is converted to an ngram phrase search. For example:
SELECT
*
FROM
posts
WHERE
MATCH (title , body)
AGAINST ('简单和有趣' IN BOOLEAN MODE);
Code language: SQL (Structured Query Language) (sql)
ngram wildcard search
The ngram FULLTEXT
index contains only n-grams, and thus, it doesn’t identify the start of terms. As a result, when you conduct wildcard searches, you may get unexpected results.
The following rules are applied to wildcard search using ngram FULLTEXT
search indexes:
If the prefix term in the wildcard is shorter than the ngram token size, the query returns all documents that contain ngram tokens starting with the prefix term. For example:
SELECT
id,
title,
body
FROM
posts
WHERE
MATCH (title , body)
AGAINST ('my*' );
Code language: SQL (Structured Query Language) (sql)
In case the prefix term in the wildcard is longer than the ngram token size, MySQL will convert the prefix term into ngram phrases and ignore the wildcard operator. For example:
SELECT
id,
title,
body
FROM
posts
WHERE
MATCH (title , body)
AGAINST ('mysqld*' );
Code language: SQL (Structured Query Language) (sql)
In this example, the term “mysqld"
is converted into ngram phrases: "my"
"ys"
"sq"
"ql"
"ld"
. Therefore all documents that contain one of these phrases are returned.
Handling stopwords
The ngram parser excludes tokens that contain the stopword in the stopword list.
Suppose the ngram_token_size
is 2 and the document contains "abc"
. The ngram parser will tokenize the document to "ab"
and "bc"
.
If "b"
is a stopword, ngram will exclude both "ab"
and "bc"
because they contain "b"
.
Please note that if the language is other than English, you must define your own stopword list. Additionally, any stopwords with lengths greater than ngram_token_size
are ignored.
In this tutorial, you have learned how to use MySQL ngram full-text parser to handle full-text searches for ideographic languages.