Dealing with Unicode string, done right and better.

TL;DR

I created a library called “unicode-segmenter” to handle Unicode grapheme clusters with good performance and reasonable bundle size. Check it out!

But it is not just about promoting my library, it is also about the process of creating…


This content originally appeared on DEV Community and was authored by Hyeseong Kim

TL;DR

I created a library called "unicode-segmenter" to handle Unicode grapheme clusters with good performance and reasonable bundle size. Check it out!

But it is not just about promoting my library, it is also about the process of creating a high-quality library, optimization, and testing.

Intro

Have you ever heard of the "grapheme cluster"? If you're using non-English languages or supporting emoji in your service, you've probably heard of it.

The lesson from Unicode is that strings always require more space than is displayed.

If we used UCS-4 (or UTF-32), it's enough to represent every character invented by humanity (yet), and it's easy to count each character. But that would waste most of the space!

Therefore, most programming languages process characters in smaller spaces such as UTF-8 or UTF-16. JavaScript uses UTF-16 to represent characters like "👋" as surrogate pairs (two characters). So in JavaScript, its .length property is 2.

Some characters can be even longer by using special characters called "joiners".

  • "🇰🇷" displays 1, but its .length is 4
  • "👩‍👩‍👦‍👦" displays 1, but its .length is 11
  • "अनुच्छेद" displays 4, but its .length is 8
  • Some demonic characters displays 6, but its .length is 75! 🤯

A unit internally expressed as multiple characters but logically treated as one character is called a "Grapheme Cluster."

Unicode® standardizes how to handle these grapheme clusters in UAX#29, the "Unicode Segmentation" rules.

The problem

Handling graphemes may be important when creating custom input, text editors, or code analyzers that count characters.

There is the Web's Intl.Segmenter API that supports the Unicode Segmentation standard. It's available not only in Web browsers like Chrome and Safari, but also in JS runtimes like Node.js, Deno, and Bun.

However, it's so new that I couldn't choose it for my company's app, Karrot.

One day, my colleague was analyzing the bundle size of their production and reported an issue with our design system library as it seemed unexpectedly large.

Bundle size of graphemer, stat size is 459.49 KB and parsed size is 89.08 KB

Actually, it was the "graphemer" library inside, more than 95% of the size of the problematic input component, and is a large enough portion of the overall app that it's even visually noticeable.

Today, graphemer is the most popular way to handle grapheme clusters in JavaScript. (20M weekly downloads on NPM !!)

But graphemer is an old library and is no longer actively maintained. This may result in differences with the latest Unicode version (mostly in Hindi), and no ESM support.

Custom implementations for Unicode are not trivial in bundle size because they include Unicode data. Also, graphemer generates tons of if-else statements for performance reasons. (spoiler: it's not performant)

Well, isn't there a better alternative?

Can WebAssembly (and Rust) save us?

Probably YES!

I already knew that there was a good quality library unicode-segmentation in Rust, and Rust has a great WebAssembly toolchain called wasm-bindgen.

I created a simple binding using wasm-bindgen.

use unicode_segmentation::UnicodeSegmentation;
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn count(val: &str) -> usize {
    val.graphemes(true).count()
}

#[wasm_bindgen]
pub fn collect(val: &str) -> Vec<String> {
    val.graphemes(true).map(|s| s.to_string()).collect()
}

The above code generates a WASM binary with JavaScript binding. It was surprisingly easy.

After a few manual optimizations, I got a WASM binary in 52KB size. When combined with the bindings, it totals around 55KB. Also, WASM binaries compress very well, coming in at a total of 23KB when compressed by gzip!

This is already smaller than the parsed size of the graphemer library, and since WASM supports streaming compilation, it appeared to be a better option to me.

However, feedback from my colleagues wasn’t positive. The reason was that the size was still too large, and using WebAssembly was not yet feasible for us.

RIIJS (Rewrite It In JavaScript)

Rust people have a cultural idiom called RIIR (Rewrite It In Rust).

Although WebAssembly is known for its performance, in mobile apps, its size can sometimes be unacceptable, or the performance might not be as good as expected due to binding overhead.

A high-quality JavaScript library is still valuable. So I decided to do the opposite: Rewrite It in JavaScript.

First, I reviewed the Rust library’s license (it's dual-licensed under MIT and Apache 2.0, I chose MIT).

I started the task from modifying its code generator written in Python, to produce JavaScript instead of Rust. Since I’m not familiar with Python code, this was the most time-consuming task.

By utilizing my both knowledge of Rust and JavaScript, I ported the implementation of the Iterator trait to JavaScript’s iteration protocol.

Since JavaScript doesn't have a pattern-matching like Rust, it could be hard to replicate the same logic. I used the ReScript compiler to maintain the original logic as much as possible. It made me able to port it confidently. Specifically check_pair function can be converted into this.

After some test-fix loops, I got the first working version and the result was way better than I expected. It was 3x smaller in size and 2.5x faster than the graphemer 😲

(It was the initial result and even better now!)

Reducing bundle size

The major win of the library comes from choosing the simplest structure to store the Unicode data table. This wins in both size and performance. Not only that but there are also various techniques and efforts to achieve an even smaller size.

Using JavaScript generator

JavaScript has a syntax for stateful logic known as "Generator". Generator is highly expressive, making them 40% smaller (1682 bytes into 1011 bytes) than an equivalent class with the same functionality!

You can see both class version and generator version in old commits.

Considering compression

JavaScript code is mostly transmitted compressed with either gzip or brotli (zstd soon).

The minified graphemer library is 89.08KB, but after gzip compression, it reduces to 13.15KB. This is because the library size is significantly inflated by a bunch of if-else statements, and compression algorithms excel at such repetitive stuff.

unicode-segmenter aims to be small even before compression, but it also carefully considers the size after compression.

For instance, when dealing with character data like “\u{1F680}”, a minifier like Terser tries to unescape it to "🚀". Unescaping can reduce code size since escape sequences require additional characters. However, in the case of data tables with a significant number of escaped characters, gzip compression is highly effective, making it better not to unescape them. The results are summarized in this pull request.

Considering tree-shaking

JavaScript production apps use bundlers. These tools can exclude unused modules from the final bundle or completely remove unused code through static analysis.

The one goal of unicode-segmenter is to provide a complete polyfill for Intl.Segmenter. However, if only grapheme segmentation is needed, it adopts a well-considered modular structure to ensure that unnecessary parts are not included.

The exposed APIs have independent entries based on their respective topics.

  • unicode-segmenter/emoji (1KB): Provides emoji matchers
  • unicode-segmenter/general (5.9KB): Provides alphabet/number matchers
  • unicode-segmenter/grapheme (9.6KB): Provides text segmentation by extended grapheme cluster
  • unicode-segmenter/intl-polyfill (9.9KB): Provide Intl.Segmenter (currently grapheme only) polyfill
  • unicode-segmenter/utils (300B): Provide utilities for UTF-16 and Unicode code points.

Optimizing performance

Thanks to the unicode-rs' excellent implementation I referenced, it was already 2.5x better than graphemer in its first version. It's mostly a hand-written binary search in a compressed Unicode data table. It's simple but very efficient.

However, for even better results, it is important to fully consider the characteristics of the language.

Leverage the compiler

As mentioned above, the ReScript compiler was very helpful in the initial porting process.

I ended up rewriting it by hand for best performance, but until then it allowed me to focus on other areas first, as it always ensures accurate and sufficient efficiency.

Using the optimal data type

JavaScript engines also have a compilation process called JIT(Just-In-Time compilation). One well-known thing is that they are very efficient when dealing with 32-bit integers (aka SMI; small integers).

A Unicode data is a list of ranges of Unicode code points, which are 32-bit integers. Therefore, I was able to make all operations internally based on SMI.

Unlike Rust, where the compiler checks data types, in JavaScript, this responsibility falls entirely on the developer. I managed to do it with careful attention.

Then, performance has been more than doubled in the second release.

Explicit bindings

When using generators or regular functions instead of classes, it is natural to implicitly capture external variables through "closures". However, implicit bindings can negatively impact optimization and garbage collection.

To avoid this, simply hoisting functions resulted in an additional 10% improvement. (See the change)

Avoiding copying and contructing structs

When dealing with more than one piece of data, there’s a desire to handle it as a structure.

However, JavaScript lacks efficient records/tuples, and both Object and Array always come with a cost. While this isn’t an issue for most applications, it can be critical in the micro-benchmarks of libraries.

It is also very important to avoid implicit copying. If necessary, pass references directly and perform the copy explicitly.

// retained `cache` reference
function cat(cp, cache) {
  ...
    if (cp < cache[0] || cp > cache[1]) {
      let result = searchGraphemeCategory(cp);
      // update its values by explicit copying
      cache[0] = result[0];
      cache[1] = result[1];
      cache[2] = result[2];
    }
    return cache[2];
  }
};

Continuous benchmarking and efforts!

What helped getting faster mostly was the benchmark tool that was set up from the very beginning and the continuous measurements.

I use mitata (pretiously tinybench). If you care about performance, I highly recommend you do it.

However, poorly designed benchmarks can be as dangerous as having none. They might be completely invalidated by the compiler, too small to reflect real-world scenarios, or fail to identify scalability issues.

To avoid these pitfalls, it is crucial to design benchmarks to be as close to the real world as possible. unicode-segmenter gets help from ChatGPT to construct tests that resemble real code!

Benchmarks can yield very different results depending on the environment. Therefore, I run it not only in my local environment but also across various versions of Node.js and Bun, on various OS and CPU architectures, and I also configure them to run in browsers to verify performance in both Chrome and Safari. Through this process, I confirmed that the performance of Intl.Segmenter has been significantly improving in recent versions of Node.js, Chrome, and Safari. This is very exciting news!

But efforts are still worth it since not everyone is using the latest version or ideal environment :)

Testing strategy

Testing is crucial. Since Unicode inevitably involves characters I might not be familiar with, more tests were necessary.

I set up initial test suites with Node.js test runner (It's great!) and manually wrote a few known cases, but it was never enough.

Property-based testing

I tried a more precise method called "Property-based testing".

fast-check is the most popular PBT tool in JavaScript ecosystem. With fast-check, I can easily write test code that defines properties and verifies them based on automated inputs.

Then, how is the verification done? The idea is to check if the results match those of Intl.Segmenter, which is already available at the Node.js runtime!

Through PBT, I found and fixed numerous bugs in the initial implementation. The first bug input that PBT identified was "", an empty string. 😅 This is a surprisingly common mistake.

I completely switched my primary testing method to PBT and set up additional tests to debug the counterexamples found.

Generating test suites

Unicode provides official test data. I used the data to generate test suites and check for spec compliance.

During this process, I discovered that the Rust library I referenced had not yet implemented the “GB9c” rule, so I implemented it myself. (It seems they implemented it recently)

Production testing

I understand that despite all these efforts, it might still be insufficient. Reality can sometimes be more extreme than theory.

I tried to create migration PRs directly to major dependents of graphemer. I found a few issues related to implementation, and many others related to TypeScript configuration, sourcemap setup, etc.

By supporting both large and small-scale projects, I was able to update the library and gain confidence in its final quality.

Conclusion

As a result, my app achieved a smaller bundle size and better performance.

Bundle size of unicode-segmenter/grapheme, stat size is 42.83 KB,  parsed size is 29.5 KB, gzipped size is 9.15 KB

I also learned that what it does is not that simple. This should be a temporary alternative to Intl.Segmenter.

I recommend to use Intl.Segmenter where possible. The unicode-segmenter library might become another graphemer after 10 years. Who knows!

But if you are in a special environment where Intl.Segmenter is not available, try unicode-segmenter. It provides good performance in a reasonable size.

Also, there are more coming shortly

  • It will soon provide a full Intl.Segmenter polyfill, including word/sentence support. (issue #25)
  • It will support more advanced patterns that Intl.Segmenter doesn't, like backward iteration. (issue #26)

Are you struggling with the quality of the library in another area? Consider taking some time to investigate. We still have many opportunities for further improvement in the JavaScript ecosystem.


This content originally appeared on DEV Community and was authored by Hyeseong Kim


Print Share Comment Cite Upload Translate Updates
APA

Hyeseong Kim | Sciencx (2024-06-16T02:59:00+00:00) Dealing with Unicode string, done right and better.. Retrieved from https://www.scien.cx/2024/06/16/dealing-with-unicode-string-done-right-and-better/

MLA
" » Dealing with Unicode string, done right and better.." Hyeseong Kim | Sciencx - Sunday June 16, 2024, https://www.scien.cx/2024/06/16/dealing-with-unicode-string-done-right-and-better/
HARVARD
Hyeseong Kim | Sciencx Sunday June 16, 2024 » Dealing with Unicode string, done right and better.., viewed ,<https://www.scien.cx/2024/06/16/dealing-with-unicode-string-done-right-and-better/>
VANCOUVER
Hyeseong Kim | Sciencx - » Dealing with Unicode string, done right and better.. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/06/16/dealing-with-unicode-string-done-right-and-better/
CHICAGO
" » Dealing with Unicode string, done right and better.." Hyeseong Kim | Sciencx - Accessed . https://www.scien.cx/2024/06/16/dealing-with-unicode-string-done-right-and-better/
IEEE
" » Dealing with Unicode string, done right and better.." Hyeseong Kim | Sciencx [Online]. Available: https://www.scien.cx/2024/06/16/dealing-with-unicode-string-done-right-and-better/. [Accessed: ]
rf:citation
» Dealing with Unicode string, done right and better. | Hyeseong Kim | Sciencx | https://www.scien.cx/2024/06/16/dealing-with-unicode-string-done-right-and-better/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.