Home / Uncategorized / Locale-sensitive text segmentation in JavaScript with Intl.Segmenter

Locale-sensitive text segmentation in JavaScript with Intl.Segmenter

Text segmentation is a way to divide text into units like characters, words, and sentences.
Let’s say you have the following Japanese text and you’d like to perform a word count:

吾輩は猫である。名前はたぬき。

If you’re unfamiliar with Japanese, you might try built-in string methods in your first attempt.
For English strings, a rough way to count the words is to split by space characters:

const str = "How many words. Are there?";
const words = str.split(" ");
console.log(words);
// ["How","many","words.","Are","there?"]
console.log(words.length);
// 5

The punctuation is mixed in with the word matches, and this will be inaccurate, but it’s a good approximation.
The problem is we don’t have any spaces separating the characters in the Japanese string.
Maybe your next idea would be to reach for str.length to count the characters.
Using string length, you’d get 15, and if you remove the full stops () you might guess 13 words.

The problem is we actually have 8 words in the string without punctuation: '吾輩' 'は' '猫' 'で' 'ある' '名前' 'は' 'たぬき'.
If you rely on string methods for a word count, you’ll quickly run into trouble as you can’t reliably split by specific character and you can’t use spaces as separators like you can in English.

This is what locale-sensitive segmentation is built for.
The format for creating a segmenter in the Intl namespace is as follows:

new Intl.Segmenter(locales, options);

Let’s try passing the string into the segmenter with the ja-JP locale for Japanese, and we explicitly set each segment to be of word-level granularity:

const jaSegmenter = new Intl.Segmenter("ja-JP", { granularity: "word" });
const segments = jaSegmenter.segment("吾輩は猫である。名前はたぬき。");

console.log(Array.from(segments));

This example logs the following array to the console:

[
  {
    "segment": "吾輩",
    "index": 0,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
  },
  {
    "segment": "は",
    "index": 2,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
  },
  {
    "segment": "猫",
    "index": 3,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
  },
  // etc.

For each item in the array, we get the segment, it’s index as it appears in the original string, the full input string, and a Boolean isWordLike to disambiguate words from punctuation etc.
Now we have a robust and structured way to interact with the words that is locale-aware.
The segmenter’s granularity is word in this example, so we can filter each item based on whether it’s isWordLike to ignore punctuation:

const jaString = "吾輩は猫である。名前はたぬき。";

const jaSegmenter = new Intl.Segmenter("ja-JP", { granularity: "word" });
const segments = jaSegmenter.segment(jaString);

const words = Array.from(segments)
  .filter((item) => item.isWordLike)
  .map((item) => item.segment);

console.log(words);
// ["吾輩","は","猫","で","ある","名前","は","たぬき"]
console.log(words.length);
// 8

This looks much better.
We have an array with Japanese words using the segmenter, ready for adding locale-aware word count to our application.
We’ll explore that use case a bit more with a small example in the following sections.
Before that, we’ll take a look at the rest of the options that you can pass into a segmenter.

Leave a Reply

Your email address will not be published. Required fields are marked *