MySQL ngram Full-Text Parser

Summary: in this tutorial, you will learn how to use MySQL ngram full-text parser to support full-text searches for ideographic languages such as Chinese, Japanese, and Korean.

Introduction to MySQL ngram full-text parser

The built-in MySQL full-text parser determines the start and end of words using white space as delimiters.

However, in the case of ideographic languages such as Chinese, Japanese, and Korean, a limitation arises as these languages do not use word delimiters.

To address this issue, MySQL provided the ngram full-text parser.

Starting from version 5.7.6, MySQL included ngram full-text parser as a built-in server plugin, which means that MySQL loads this plugin automatically when the MySQL database server starts.

MySQL supports ngram full-text parser for both InnoDB and MyISAM storage engines.

By definition, an ngram is a contiguous sequence of characters from a text sequence. The primary function of the ngram full-text parser is to tokenize a text sequence into contiguous n-character sequences.

The following illustrates how the ngram full-text parser tokenizes a sequence of text for different values of n:

n = 1: 'm','y','s','q','l'
n = 2: 'my', 'ys', 'sq','ql' 
n = 3: 'mys', 'ysq', 'sql'
n = 4: 'mysq', 'ysql'
n = 5: 'mysql'Code language: JavaScript (javascript)

Creating FULLTEXT indexes with ngram parser

To create a FULLTEXT index that uses an ngram full-text parser, you add the WITH PARSER ngram in the CREATE TABLE, ALTER TABLE, or CREATE INDEX statement. Consider the following example:

First, create a new database called test:

CREATE DATABASE test;

Second, create a new table called posts with a full-text index that includes the title and body columns:

CREATE TABLE posts (
    id INT PRIMARY KEY AUTO_INCREMENT,
    title VARCHAR(255),
    body TEXT,
    FULLTEXT ( title , body ) WITH PARSER NGRAM
);

Second, change the character set to utf8mb4 using the SET NAMES statement:

SET NAMES utf8mb4;Code language: SQL (Structured Query Language) (sql)

Third, insert a new row into the posts table:

INSERT INTO posts(title,body)
VALUES('MySQL全文搜索','MySQL提供了具有许多好的功能的内置全文搜索'),
      ('MySQL教程','学习MySQL快速，简单和有趣');Code language: SQL (Structured Query Language) (sql)

Finally, show how the ngram tokenizes the text using the following statements:

SET 
  GLOBAL innodb_ft_aux_table = "test/posts";

SELECT 
  * 
FROM 
  information_schema.innodb_ft_index_cache 
ORDER BY 
  doc_id, 
  position;
Code language: SQL (Structured Query Language) (sql)

This query is valuable for troubleshooting purposes. For instance, if a word is not included in the search results, it may not be indexed due to being a stopword or for some other reason.

Setting ngram token size

As seen in the previous example, the default token size (n) for ngram is 2. To modify the token size, you can use the ngram_token_size configuration option, which accepts values between 1 and 10.

Note that a smaller token size makes a smaller full-text search index and allows you to search faster.

Because ngram_token_size is a read-only variable, you can only set its value using two options:

First, set the ngram_token_size when the database server starts:

mysqld --ngram_token_size=1Code language: SQL (Structured Query Language) (sql)

Second, set the ngram_token_size in the configuration file:

[mysqld]
ngram_token_size=1Code language: SQL (Structured Query Language) (sql)

ngram parser phrase search

MySQL converts a phrase search into ngram phrase searches. For example, "abc" is converted into "ab bc", which returns documents that contain "ab bc" and "abc".

The following example shows you how to search for the phrase 搜索 in the posts table:

SELECT 
    id, title, body
FROM
    posts
WHERE
    MATCH (title , body) AGAINST ('搜索' );
Code language: SQL (Structured Query Language) (sql)

Processing search results with ngram

Natural language mode

In NATURAL LANGUAGE MODE searches, the search term is converted to a union of ngram values. Suppose the token size is 2 or bigram, the search term mysql is converted to my ys sq and ql.

SELECT 
    *
FROM
    posts
WHERE
    MATCH (title , body)  
    AGAINST ('简单和有趣' IN natural language MODE);
Code language: SQL (Structured Query Language) (sql)

Boolean mode

In BOOLEAN MODE searches, the search term is converted to an ngram phrase search. For example:

SELECT 
    *
FROM
    posts
WHERE
    MATCH (title , body) 
    AGAINST ('简单和有趣' IN BOOLEAN MODE);
Code language: SQL (Structured Query Language) (sql)

ngram wildcard search

The ngram FULLTEXT index contains only n-grams, and thus, it doesn’t identify the start of terms. As a result, when you conduct wildcard searches, you may get unexpected results.

The following rules are applied to wildcard search using ngram FULLTEXT search indexes:

If the prefix term in the wildcard is shorter than the ngram token size, the query returns all documents that contain ngram tokens starting with the prefix term. For example:

SELECT 
    id, 
    title, 
    body
FROM
    posts
WHERE
    MATCH (title , body) 
    AGAINST ('my*' );
Code language: SQL (Structured Query Language) (sql)

In case the prefix term in the wildcard is longer than the ngram token size, MySQL will convert the prefix term into ngram phrases and ignore the wildcard operator. For example:

SELECT 
    id, 
    title, 
    body
FROM
    posts
WHERE
    MATCH (title , body) 
    AGAINST ('mysqld*' );
Code language: SQL (Structured Query Language) (sql)

In this example, the term “mysqld" is converted into ngram phrases: "my" "ys" "sq" "ql" "ld". Therefore all documents that contain one of these phrases are returned.

Handling stopwords

The ngram parser excludes tokens that contain the stopword in the stopword list.

Suppose the ngram_token_size is 2 and the document contains "abc". The ngram parser will tokenize the document to "ab" and "bc".

If "b" is a stopword, ngram will exclude both "ab" and "bc" because they contain "b".

Please note that if the language is other than English, you must define your own stopword list. Additionally, any stopwords with lengths greater than ngram_token_size are ignored.

In this tutorial, you have learned how to use MySQL ngram full-text parser to handle full-text searches for ideographic languages.

Was this tutorial helpful?