How to split JavaScript strings into sentences, words or graphemes with "Intl.Segmenter" (#tilPost)

I’ve been reading Axel Rauschmayer’s post on the new regular expression flag /v, which explains a way to split emoji strings into graphemes using Intl.Segmenter.
I haven’t used this Intl object before. Let’s find out what it’s about…


This content originally appeared on Stefan Judis Web Development and was authored by Stefan Judis

I've been reading Axel Rauschmayer's post on the new regular expression flag /v, which explains a way to split emoji strings into graphemes using Intl.Segmenter.

I haven't used this Intl object before. Let's find out what it's about!

Consider you want to split user input into sentences. It looks like a quick split() task... But there's a lot of nuance in this problem.

Here's a naive approach:

'Hello! How are you?'.split(/[.!?]/);
// ['Hello', ' How are you', '']

Using split(), you'll lose the defined separators and include all these spaces everywhere. And because it's relying on hardcoded delimiters it's not language-sensitive.

I don't speak Japanese, but how would you try to split the following string into words or sentences?

// I am a cat. My name is Tanuki.
'吾輩は猫である。名前はたぬき。'

Common string methods won't be helpful here, but the Intl JavaScript API is always good for a surprise!

Intl.Segmenter to the rescue

According to MDN, Intl.Segmenter allows you to split strings into meaningful parts:

The Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string.

Define a locale and granularity (sentence, word or grapheme) and throw any string at it to split strings into segments.

const segmenterDe = new Intl.Segmenter('de', { 
  granularity: 'word'
});
const segmentsDe = segmenterDe.segment('Was geht ab, Freunde?');

Headsup: Firefox doesn't support Intl.Segmenter at the time of writing. On the server-side, it's supported since Node.js 16.

MDN Compat Data (source)
Browser support info for Intl.Segmenter
chrome chrome_android edge firefox firefox_android safari safari_ios samsunginternet_android webview_android
87 87 87 Nein 14.1 14.1 14.0 87

Play around with a tl;dr demo below. 🫵

[Interactive component: visit the article to see it...]

But let's look at some Intl.Segmenter details.

Segmenter.segment returns an iterable

You might have noticed the Array.from call in the example above. Segmenter.segment doesn't return an array but an iterable. To access all segments, use array spreading, Array.from or a for-of loop.

const segmenterDe = new Intl.Segmenter('de', {
  granularity: 'sentence'
});
const segmentsDe = segmenterDe.segment('Was geht ab?');

// ----
// Access the segments via array spreading
console.log([...segmentsDe]);
// [
//   { segment: 'Was geht ab?', index: 0, input: 'Was geht ab?' }
// ] 

// ----
// Access the segments via Array.from
console.log(Array.from(segmentsDe));
// [
//   { segment: 'Was geht ab?', index: 0, input: 'Was geht ab?' }
// ] 

// ----
// Access the segments via for...of
for (let segment of segmentsDe) {
  console.log(segment);
}
// { segment: 'Was geht ab?', index: 0, input: 'Was geht ab?' }

Each segment includes the original string value, the character index in the original and the actual segment string.

To map the segments to their string values, the demo uses the lesser known 2nd argument of Array.from which built-in mapping.

const segmenterDe = new Intl.Segmenter('de', {
  granularity: 'sentence'
});
const segmentsDe = segmenterDe.segment('Was geht ab?');

console.log(Array.from(segmentsDe, s => s.segment));
// [ 'Was geht ab?' ]

Word granularity comes with an extra isWordLike property

If you split a string into words, all segments include spaces and line breaks. Filter them out using the isWordLike property.

const segmenterDe = new Intl.Segmenter('de', {
  granularity: 'word'
});
const segmentsDe = segmenterDe.segment('Was geht ab?');

console.log([...segmentsDe]);
// [
//   { segment: 'Was', index: 0, input: 'Was geht ab?', isWordLike: true },
//   { segment: ' ', index: 3, input: 'Was geht ab?', isWordLike: false },
//   ...
// ]

console.log([...segmentsDe].filter(s => s.isWordLike));
// [
//   { segment: 'Was', index: 0, input: 'Was geht ab?', isWordLike: true},
//   { segment: 'geht', index: 4, input: 'Was geht ab?', isWordLike: true },
//   { segment: 'ab', index: 9, input: 'Was geht ab?', isWordLike: true }
// ]

Note that filtering by isWordLike removes punctuation such as ., -, or ?.

Use Intl.Segmenter to split emojis

And lastly, here's Axel's example that led me down this rabbit hole. I won't get into Unicode specifics, but if you want to split a string into visual emojis, Intl.Segmenter is a great help, too.

const emojis = '🫣🫵👨‍👨‍👦‍👦';

// ----
// Split by code units
console.log(emojis.split(''));
// ['\uD83E', '\uDEE3', '\uD83E', '\uDEF5', '\uD83D', '\uDE48']

// ----
// Split by code points
console.log([...emojis]);
// ['🫣', '🫵', '👨', '‍', '👨', '‍', '👦', '‍', '👦']

// ----
// Split by graphemes
const segmenter = new Intl.Segmenter('en', {
  granularity: 'grapheme'
});
const segments = segmenter.segment(emojis);

console.log(Array.from(
  segmenter.segment(emojis),
  s => s.segment
));
// ['🫣', '🫵', '👨‍👨‍👦‍👦']

Note that graphemes also include spaces and "normal" characters.

Conclusion

I continue to be amazed by the Intl feature set. There's always new functionality to discover. Intl.Segmenter enables fairly easy string splitting that considers locales and keeps the delimiters. 🎉

It's yet another Intl API to make language-dependent string handling easier! I wonder what I'll discover next!


Reply to Stefan


This content originally appeared on Stefan Judis Web Development and was authored by Stefan Judis


Print Share Comment Cite Upload Translate Updates
APA

Stefan Judis | Sciencx (2022-11-26T23:00:00+00:00) How to split JavaScript strings into sentences, words or graphemes with "Intl.Segmenter" (#tilPost). Retrieved from https://www.scien.cx/2022/11/26/how-to-split-javascript-strings-into-sentences-words-or-graphemes-with-intl-segmenter-tilpost/

MLA
" » How to split JavaScript strings into sentences, words or graphemes with "Intl.Segmenter" (#tilPost)." Stefan Judis | Sciencx - Saturday November 26, 2022, https://www.scien.cx/2022/11/26/how-to-split-javascript-strings-into-sentences-words-or-graphemes-with-intl-segmenter-tilpost/
HARVARD
Stefan Judis | Sciencx Saturday November 26, 2022 » How to split JavaScript strings into sentences, words or graphemes with "Intl.Segmenter" (#tilPost)., viewed ,<https://www.scien.cx/2022/11/26/how-to-split-javascript-strings-into-sentences-words-or-graphemes-with-intl-segmenter-tilpost/>
VANCOUVER
Stefan Judis | Sciencx - » How to split JavaScript strings into sentences, words or graphemes with "Intl.Segmenter" (#tilPost). [Internet]. [Accessed ]. Available from: https://www.scien.cx/2022/11/26/how-to-split-javascript-strings-into-sentences-words-or-graphemes-with-intl-segmenter-tilpost/
CHICAGO
" » How to split JavaScript strings into sentences, words or graphemes with "Intl.Segmenter" (#tilPost)." Stefan Judis | Sciencx - Accessed . https://www.scien.cx/2022/11/26/how-to-split-javascript-strings-into-sentences-words-or-graphemes-with-intl-segmenter-tilpost/
IEEE
" » How to split JavaScript strings into sentences, words or graphemes with "Intl.Segmenter" (#tilPost)." Stefan Judis | Sciencx [Online]. Available: https://www.scien.cx/2022/11/26/how-to-split-javascript-strings-into-sentences-words-or-graphemes-with-intl-segmenter-tilpost/. [Accessed: ]
rf:citation
» How to split JavaScript strings into sentences, words or graphemes with "Intl.Segmenter" (#tilPost) | Stefan Judis | Sciencx | https://www.scien.cx/2022/11/26/how-to-split-javascript-strings-into-sentences-words-or-graphemes-with-intl-segmenter-tilpost/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.