Advanced Data Structures and Algorithms: Building Your First Suffix Tree

Efficient pattern matching in JavaScript

Photo by aiokr chen on Unsplash

When it comes to pattern matching inside a string, sometimes doing string.indexOf is not enough.

The method works, but if you’re dealing with a big-enough string (say, a full chapter inside a book, or why not, the entire book), it might take a while to get a valid response, and if you have to do it often, then your algorithm will not be efficient.

Instead, in this article I want to show you an alternative method, one that lets you do efficient pattern matching, without having to worry about the size of the string you’re dealing with.

Let’s take a look at Suffix trees.

What are suffix trees?

Let’s start at the beginning, what are suffix trees?

They are, in fact, fancy tries. If you don’t know what tries are, check out this other article where I show you how to build a search engine using them.

You start by building a trie with all the suffixes of the string, and then you start collapsing certain nodes (more on this in a second) to make a smaller, more compact tree.

Once it’s ready, the main benefit of using a suffix tree, is that you can find any suffix inside an arbitrarily large string in O(m) time, where “m” is the size of the suffix you’re looking for, not the size of the string where you’re looking for it.

Where most of the search algorithms will depend on the size of the string, this one depends on the size of the suffix (or pattern). That’s a huge performance boost.

Let’s take a look at how we could implement one.

If you want to make these standalone functions implementing your trie data structures shareable and reusable, use Bit to track, manage, publish, and automatically version them so other developers can import, install, extend, and use them in their own projects, with zero copying and pasting code involved. You can learn more here, and here.

Implementing a suffix tree

While the theory dictates that we can build a trie, and then go from there, there are easier ways to think about it.

In my case, nailing this implementation took me a while, so I’ll show you how I went about it.

The first thing to do is start with an example. I’ll go with one that you can find everywhere, the word “mississippi”.

Other than being a fun word to say out loud, it also provides some interesting examples.

Let’s first take a look at what the final suffix tree for this string looks like:

(root)-->|---mississippi
|
|---i-->|---ssi-->|---ssippi
| | |
| | |---ppi
| |
| |---ppi
|
|---s-->|---si-->|---ssippi
| | |
| | |---ppi
| |
| |---i-->|---ssippi
| |
| |---ppi
|
|---p-->|---pi
|
|---i

That’s what we have to aim for.

The first thing you do is calculate every single suffix for the word, and then sort them alphabetically.

In the case of the word “mississippi”, you’d get the following list:

i
ippi
issippi
ississippi
mississippi
pi
ppi
sippi
sissippi
ssippi
ssissippi

Now, why did I ask you to sort them? Because that way you can easily visualize that many of these suffixes have common prefixes.

And if you look at the final tree from above, you’ll notice that we’re splitting our suffixes using the biggest common prefix we can find.

So let’s continue. We’ll pick the first suffix, “i”, and we’ll start our algorithm:

for the current prefix
find the largest prefix that matches the most suffixes
if the result is just one prefix (the current one)
then we're dealing with a leaf node (we can't break it up into smaller chunks
if the result provides 2 or more prefixes
then use the largest prefix found as a root node on the tree
under this new node, put all the prefixes you found
make sure you remove the largest prefix from the words you're putting as its children

In case the idea is still not clear, let’s first look at the letter “i”. That’s our current prefix at this point.

So we’ll look at how many other suffixes start with the letter “i” and we’ll strip the starting “i” from those words, and we’ll put them under a new node called “i”.

So far, our tree looks like this:

(root) --> i |--> ppi
|--> ssippi
|--> ssissippi

Good, but not good enough, we can see that “ssippi” and “ssissippi” have a common prefix. We need to keep going down this tree branch and do the same processing:

  • We’ll find the largest common prefix for “ppi”, which is itself.
  • And since only itself starts with those letters, we’ll know “ppi” is a leaf node.
  • We’ll then move to “ssippi” and we’ll find that “ssi” is the largest prefix that helps match both remaining words.
  • We’ll create the node “ssi” under “i”, and we’ll put “ppi” and “ssippi” under this new branch.

If you keep going, you’ll figure out that both, “ppi” and “ssippi” under the “ssi” node are leaf nodes.

And at this point, we should go all the way back to the “i”, since we’ve finished processing this branch, and pick the next suffix: “mississippi”.

Notice that throughout this process, we also remove the used suffixes from our original list. This is key, since we don’t want to process these words anymore.

Now for this word, we can see that there are no common prefixes with any other word, which means it’s a leaf node right at the root level.

Fantastic! Moving on.

If you do the same with “pi” and then with “sippi”, you’ll start seeing how the final tree is formed.

In the end, you’ll end up with a tree like the one I already showed:

(root)-->|---mississippi
|
|---i-->|---ssi-->|---ssippi
| | |
| | |---ppi
| |
| |---ppi
|
|---s-->|---si-->|---ssippi
| | |
| | |---ppi
| |
| |---i-->|---ssippi
| |
| |---ppi
|
|---p-->|---pi
|
|---i

So, let’s see how we can do this, but with code this time.

Writing the code

We’ll start by calculating the list of suffixes for our given word, and then we’ll sort them alphabetically.

Doing that in JavaScript is easy, here is the code:

https://medium.com/media/498cb470076b6ab78dd3dfb4b91ce342/href

I’ve also set up 2 other variables, the tree which is where I’m going to be saving the structure (it’s going to be a JSON essentially), and the root which is going to indicate the current node on the tree.

The getAllSortedSuffixes function simply slices the word from the 0th position up to the last character.

With our list ready, I’m going to call the function that takes care of creating the tree:

https://medium.com/media/03245144a6d1fd03d5a281031eea9267/href

This is the main function, and here is what it does:

  1. Find all the prefixes for the list of suffixes we created initially. We’ll be updating this list often. This calculates the biggest prefix that matches the most suffixes. We then take the list of matched suffixes and make a note of them (we’ll use them in a second).
  2. Since we’ve matched multiple suffixes from the suffixes list, we now remove them (after all, we’ll be dealing with them right now).
  3. We get the root, which is the first word returned by thefindPrefixes function.
  4. And we put the rest of our prefixes, inside a list of “pending” words to be processed for this particular root.
  5. As long as there are “pending” words, we do the following:
  6. We find the largest prefix that matches the most pending words.
  7. If we only find 1 single prefix, that means we’re dealing with a leaf element, so we mark it with a “*” and move on.
  8. If there are more, we grab the first one and call it the new root, and the rest are added as leaf nodes for now.
  9. We then update the list of pending words by removing the ones we grabbed in step 6.

This process will take care of the first “branch” of words. Essentially, for our “mississippi” example, the “i” and everything inside it.

But we still have many words inside our suffixes list.

So we’ll do the following:

https://medium.com/media/b0a2a58d7fb099dcce8a1784b7625c71/href

As createTree keeps updating the suffixes list, once it has run through all words, this loop will end, and we’ll have our complete tree.

If you want to print it out, you’ll get the following JSON:

{
"i": {
"pending": [],
"ppi": "*",
"ssi": {
"ppi": "*",
"ssippi": "*"
}
},
"mississippi": {
"pending": []
},
"p": {
"pending": [],
"i": "*",
"pi": "*"
},
"s": {
"pending": [],
"i": {
"ppi": "*",
"ssippi": "*"
},
"si": {
"ppi": "*",
"ssippi": "*"
}
}
}

You can ignore the pending properties, they’re not used at this point.

Let’s now look at how I find the largest prefix that matches the most suffixes. I’m not gonna line, I had to try several alternatives before getting it right:

https://medium.com/media/528dfe17bed5d3f3c8e912c7d35103c8/href

The process is:

  1. Grab the first word (line 2) and make it our sample .
  2. We’ll start going from the full word and down to the first letter testing all potential prefixes.
  3. For every prefix, we’ll calculate how many words in our list — of suffixes — start with this prefix.
  4. We’ll also track the list of matched suffixes that has the most elements inside and the prefix that generated that list (those are the lastCommonSet and lastValidPrefix ). This way if we generate first only 1 match, then 0, and then 10, we’ll make sure to only keep the list of 10, and the prefix that generated it.
  5. We’ll then return an object with 2 attributes: the group which contains the prefix and all the words that started with it and the usedWords which is a list of the words that matched the prefix, so we can remove them from the suffixes list.

With that, you have seen the major functions around the creation of the tree. But creating the tree is only step 1, we wanted to build this because it’s supposed to be very efficient when it comes to searching inside the string, right?

So let’s look at how we can implement the search feature.

Did you like what you read? Consider subscribing to my FREE newsletter where I share my 2 decades’ worth of wisdom in the IT industry with everyone. Join “The Rambling of an old developer” !

Searching inside a suffix tree

Searching is actually easy when you look at the structure that we’re dealing with (our suffix tree).

Imagine we’re searching for the pattern “ippi” inside the suffix tree we created above.

We have to find the root node that our pattern starts with. In other words, at the first level of our tree, we have the following root nodes:

  • i
  • mississippi
  • p
  • s

We know that our pattern starts with an “i”, so we’ll make that our current root, and we’ll also remove it from the pattern.

We now do the same thing, and we look at the root nodes inside “i”, which are:

  • ppi
  • ssi

I can see that “ppi” is already the rest of my pattern. So we’re done!

Let’s take a look at how we would go about implementing that function:

https://medium.com/media/d21f75802895032e44eb5f39be33e216/href

The function finds the first root node in the current level that our pattern (the word parameter) starts with. It then removes the prefix from the word, and moves the current root of the tree to that node.

It does that until there is no root to be found.

In the end, if we’ve managed to find all prefixes for our pattern, it means we’re left with an empty string.

Otherwise, there is no full match inside the tree.

And as you can see, the do…while loop will run as long as there are chars in word , which is our pattern. So the running time depends on the length of the pattern being searched, and not on the length of the string where the search is being performed.

Fantastic!

Suffix trees are a bit hard to wrap your head around at first, but once you do, the results are great.

Coding the search, once you have the structure ready is really easy, and the speed is wonderful.

Have you ever used a suffix tree for a real-world use case? What was it?

Build apps with reusable components like Lego

Bit’s open-source tool help 250,000+ devs to build apps with components.

Turn any UI, feature, or page into a reusable component — and share it across your applications. It’s easier to collaborate and build faster.

Learn more

Split apps into components to make app development easier, and enjoy the best experience for the workflows you want:

Micro-Frontends

Design System

Code-Sharing and reuse

Monorepo

Learn more


Advanced Data Structures and Algorithms: Building Your First Suffix Tree was originally published in Bits and Pieces on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Bits and Pieces - Medium and was authored by Fernando Doglio

Efficient pattern matching in JavaScript

Photo by aiokr chen on Unsplash

When it comes to pattern matching inside a string, sometimes doing string.indexOf is not enough.

The method works, but if you’re dealing with a big-enough string (say, a full chapter inside a book, or why not, the entire book), it might take a while to get a valid response, and if you have to do it often, then your algorithm will not be efficient.

Instead, in this article I want to show you an alternative method, one that lets you do efficient pattern matching, without having to worry about the size of the string you’re dealing with.

Let’s take a look at Suffix trees.

What are suffix trees?

Let’s start at the beginning, what are suffix trees?

They are, in fact, fancy tries. If you don’t know what tries are, check out this other article where I show you how to build a search engine using them.

You start by building a trie with all the suffixes of the string, and then you start collapsing certain nodes (more on this in a second) to make a smaller, more compact tree.

Once it’s ready, the main benefit of using a suffix tree, is that you can find any suffix inside an arbitrarily large string in O(m) time, where “m” is the size of the suffix you’re looking for, not the size of the string where you’re looking for it.

Where most of the search algorithms will depend on the size of the string, this one depends on the size of the suffix (or pattern). That’s a huge performance boost.

Let’s take a look at how we could implement one.

If you want to make these standalone functions implementing your trie data structures shareable and reusable, use Bit to track, manage, publish, and automatically version them so other developers can import, install, extend, and use them in their own projects, with zero copying and pasting code involved. You can learn more here, and here.

Implementing a suffix tree

While the theory dictates that we can build a trie, and then go from there, there are easier ways to think about it.

In my case, nailing this implementation took me a while, so I’ll show you how I went about it.

The first thing to do is start with an example. I’ll go with one that you can find everywhere, the word “mississippi”.

Other than being a fun word to say out loud, it also provides some interesting examples.

Let’s first take a look at what the final suffix tree for this string looks like:

(root)-->|---mississippi
|
|---i-->|---ssi-->|---ssippi
| | |
| | |---ppi
| |
| |---ppi
|
|---s-->|---si-->|---ssippi
| | |
| | |---ppi
| |
| |---i-->|---ssippi
| |
| |---ppi
|
|---p-->|---pi
|
|---i

That’s what we have to aim for.

The first thing you do is calculate every single suffix for the word, and then sort them alphabetically.

In the case of the word “mississippi”, you’d get the following list:

i
ippi
issippi
ississippi
mississippi
pi
ppi
sippi
sissippi
ssippi
ssissippi

Now, why did I ask you to sort them? Because that way you can easily visualize that many of these suffixes have common prefixes.

And if you look at the final tree from above, you’ll notice that we’re splitting our suffixes using the biggest common prefix we can find.

So let’s continue. We’ll pick the first suffix, “i”, and we’ll start our algorithm:

for the current prefix
find the largest prefix that matches the most suffixes
if the result is just one prefix (the current one)
then we're dealing with a leaf node (we can't break it up into smaller chunks
if the result provides 2 or more prefixes
then use the largest prefix found as a root node on the tree
under this new node, put all the prefixes you found
make sure you remove the largest prefix from the words you're putting as its children

In case the idea is still not clear, let’s first look at the letter “i”. That’s our current prefix at this point.

So we’ll look at how many other suffixes start with the letter “i” and we’ll strip the starting “i” from those words, and we’ll put them under a new node called “i”.

So far, our tree looks like this:

(root) --> i |--> ppi
|--> ssippi
|--> ssissippi

Good, but not good enough, we can see that “ssippi” and “ssissippi” have a common prefix. We need to keep going down this tree branch and do the same processing:

  • We’ll find the largest common prefix for “ppi”, which is itself.
  • And since only itself starts with those letters, we’ll know “ppi” is a leaf node.
  • We’ll then move to “ssippi” and we’ll find that “ssi” is the largest prefix that helps match both remaining words.
  • We’ll create the node “ssi” under “i”, and we’ll put “ppi” and “ssippi” under this new branch.

If you keep going, you’ll figure out that both, “ppi” and “ssippi” under the “ssi” node are leaf nodes.

And at this point, we should go all the way back to the “i”, since we’ve finished processing this branch, and pick the next suffix: “mississippi”.

Notice that throughout this process, we also remove the used suffixes from our original list. This is key, since we don’t want to process these words anymore.

Now for this word, we can see that there are no common prefixes with any other word, which means it’s a leaf node right at the root level.

Fantastic! Moving on.

If you do the same with “pi” and then with “sippi”, you’ll start seeing how the final tree is formed.

In the end, you’ll end up with a tree like the one I already showed:

(root)-->|---mississippi
|
|---i-->|---ssi-->|---ssippi
| | |
| | |---ppi
| |
| |---ppi
|
|---s-->|---si-->|---ssippi
| | |
| | |---ppi
| |
| |---i-->|---ssippi
| |
| |---ppi
|
|---p-->|---pi
|
|---i

So, let’s see how we can do this, but with code this time.

Writing the code

We’ll start by calculating the list of suffixes for our given word, and then we’ll sort them alphabetically.

Doing that in JavaScript is easy, here is the code:

I’ve also set up 2 other variables, the tree which is where I’m going to be saving the structure (it’s going to be a JSON essentially), and the root which is going to indicate the current node on the tree.

The getAllSortedSuffixes function simply slices the word from the 0th position up to the last character.

With our list ready, I’m going to call the function that takes care of creating the tree:

This is the main function, and here is what it does:

  1. Find all the prefixes for the list of suffixes we created initially. We’ll be updating this list often. This calculates the biggest prefix that matches the most suffixes. We then take the list of matched suffixes and make a note of them (we’ll use them in a second).
  2. Since we’ve matched multiple suffixes from the suffixes list, we now remove them (after all, we’ll be dealing with them right now).
  3. We get the root, which is the first word returned by thefindPrefixes function.
  4. And we put the rest of our prefixes, inside a list of “pending” words to be processed for this particular root.
  5. As long as there are “pending” words, we do the following:
  6. We find the largest prefix that matches the most pending words.
  7. If we only find 1 single prefix, that means we’re dealing with a leaf element, so we mark it with a “*” and move on.
  8. If there are more, we grab the first one and call it the new root, and the rest are added as leaf nodes for now.
  9. We then update the list of pending words by removing the ones we grabbed in step 6.

This process will take care of the first “branch” of words. Essentially, for our “mississippi” example, the “i” and everything inside it.

But we still have many words inside our suffixes list.

So we’ll do the following:

As createTree keeps updating the suffixes list, once it has run through all words, this loop will end, and we’ll have our complete tree.

If you want to print it out, you’ll get the following JSON:

{
"i": {
"pending": [],
"ppi": "*",
"ssi": {
"ppi": "*",
"ssippi": "*"
}
},
"mississippi": {
"pending": []
},
"p": {
"pending": [],
"i": "*",
"pi": "*"
},
"s": {
"pending": [],
"i": {
"ppi": "*",
"ssippi": "*"
},
"si": {
"ppi": "*",
"ssippi": "*"
}
}
}

You can ignore the pending properties, they’re not used at this point.

Let’s now look at how I find the largest prefix that matches the most suffixes. I’m not gonna line, I had to try several alternatives before getting it right:

The process is:

  1. Grab the first word (line 2) and make it our sample .
  2. We’ll start going from the full word and down to the first letter testing all potential prefixes.
  3. For every prefix, we’ll calculate how many words in our list — of suffixes — start with this prefix.
  4. We’ll also track the list of matched suffixes that has the most elements inside and the prefix that generated that list (those are the lastCommonSet and lastValidPrefix ). This way if we generate first only 1 match, then 0, and then 10, we’ll make sure to only keep the list of 10, and the prefix that generated it.
  5. We’ll then return an object with 2 attributes: the group which contains the prefix and all the words that started with it and the usedWords which is a list of the words that matched the prefix, so we can remove them from the suffixes list.

With that, you have seen the major functions around the creation of the tree. But creating the tree is only step 1, we wanted to build this because it’s supposed to be very efficient when it comes to searching inside the string, right?

So let’s look at how we can implement the search feature.

Did you like what you read? Consider subscribing to my FREE newsletter where I share my 2 decades’ worth of wisdom in the IT industry with everyone. Join “The Rambling of an old developer” !

Searching inside a suffix tree

Searching is actually easy when you look at the structure that we’re dealing with (our suffix tree).

Imagine we’re searching for the pattern “ippi” inside the suffix tree we created above.

We have to find the root node that our pattern starts with. In other words, at the first level of our tree, we have the following root nodes:

  • i
  • mississippi
  • p
  • s

We know that our pattern starts with an “i”, so we’ll make that our current root, and we’ll also remove it from the pattern.

We now do the same thing, and we look at the root nodes inside “i”, which are:

  • ppi
  • ssi

I can see that “ppi” is already the rest of my pattern. So we’re done!

Let’s take a look at how we would go about implementing that function:

The function finds the first root node in the current level that our pattern (the word parameter) starts with. It then removes the prefix from the word, and moves the current root of the tree to that node.

It does that until there is no root to be found.

In the end, if we’ve managed to find all prefixes for our pattern, it means we’re left with an empty string.

Otherwise, there is no full match inside the tree.

And as you can see, the do...while loop will run as long as there are chars in word , which is our pattern. So the running time depends on the length of the pattern being searched, and not on the length of the string where the search is being performed.

Fantastic!

Suffix trees are a bit hard to wrap your head around at first, but once you do, the results are great.

Coding the search, once you have the structure ready is really easy, and the speed is wonderful.

Have you ever used a suffix tree for a real-world use case? What was it?

Build apps with reusable components like Lego

Bit’s open-source tool help 250,000+ devs to build apps with components.

Turn any UI, feature, or page into a reusable component — and share it across your applications. It’s easier to collaborate and build faster.

Learn more

Split apps into components to make app development easier, and enjoy the best experience for the workflows you want:

Micro-Frontends

Design System

Code-Sharing and reuse

Monorepo

Learn more


Advanced Data Structures and Algorithms: Building Your First Suffix Tree was originally published in Bits and Pieces on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Bits and Pieces - Medium and was authored by Fernando Doglio


Print Share Comment Cite Upload Translate Updates
APA

Fernando Doglio | Sciencx (2023-02-22T06:12:34+00:00) Advanced Data Structures and Algorithms: Building Your First Suffix Tree. Retrieved from https://www.scien.cx/2023/02/22/advanced-data-structures-and-algorithms-building-your-first-suffix-tree/

MLA
" » Advanced Data Structures and Algorithms: Building Your First Suffix Tree." Fernando Doglio | Sciencx - Wednesday February 22, 2023, https://www.scien.cx/2023/02/22/advanced-data-structures-and-algorithms-building-your-first-suffix-tree/
HARVARD
Fernando Doglio | Sciencx Wednesday February 22, 2023 » Advanced Data Structures and Algorithms: Building Your First Suffix Tree., viewed ,<https://www.scien.cx/2023/02/22/advanced-data-structures-and-algorithms-building-your-first-suffix-tree/>
VANCOUVER
Fernando Doglio | Sciencx - » Advanced Data Structures and Algorithms: Building Your First Suffix Tree. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2023/02/22/advanced-data-structures-and-algorithms-building-your-first-suffix-tree/
CHICAGO
" » Advanced Data Structures and Algorithms: Building Your First Suffix Tree." Fernando Doglio | Sciencx - Accessed . https://www.scien.cx/2023/02/22/advanced-data-structures-and-algorithms-building-your-first-suffix-tree/
IEEE
" » Advanced Data Structures and Algorithms: Building Your First Suffix Tree." Fernando Doglio | Sciencx [Online]. Available: https://www.scien.cx/2023/02/22/advanced-data-structures-and-algorithms-building-your-first-suffix-tree/. [Accessed: ]
rf:citation
» Advanced Data Structures and Algorithms: Building Your First Suffix Tree | Fernando Doglio | Sciencx | https://www.scien.cx/2023/02/22/advanced-data-structures-and-algorithms-building-your-first-suffix-tree/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.