Trie, Suffix Tree, Suffix Array

In this Codemonk article, I am going to write about 3 very useful string data structures. The topic is very broad, and since all these data structures are very well studied and explained in external resources, the article will have a different format that previous ones. Rather than describing implementation details, I am going to provide a well structured overview about the topic, link related detailed descriptions to each structure, and give you some intuitions when to use which structure and what are advantages and tradeoffs while using each of them. From nowΣwill denote the alphabet we working with, for example latin lowercase letters, and|Σ|will be the size ofΣ.TrieTrie is probably the most basic and intuitive tree based data structure designed to use with strings.LetSbe a set ofkstrings, in other wordsS = {s1, s2, …, sk}. In addition, letPbe a pattern we want to match with any of strings inS. The question is how to build a very basic tree based data structure, which allows us to decide if givenPmatches any string inS. How to model such a data structure? Well, we can model the setSas a rooted treeTin such a way, that each path from the root ofTto any of its nodes, corresponds to a prefix of at least one string ofS. The best way to present this construction, is to consider an example set. LetS = {abc, abd, ca, ab}. Letεcorresponds to an empty string. Then a trie forSlooks like this:Notice that each edge of an internal nodeuto its childvis marked by a letter from our alphabet denoting the extension of string represented byuto a string represented byv. Moreover, each node corresponding to a string fromSis marked orange. One observation is that if somesiis a prefix of another string fromS, then a node corresponding tosiis an internal node ofT, otherwise it is a leaf. Sometimes, it is useful to have all nodes corresponding to strings fromSas leaves, and it is very common to append to each strings fromSa character which is guaranteed not to be inΣ, in our case, we will denote$as this special character. Then such a modified trie from our example will look like this:Notice that since now there is no string inSwhich is a prefix another string fromS, all nodes inTcorresponding to strings fromSare leaves.Building a trieTis very simple. At the beginning, you start with just one node representing the empty stringε. If you want to insert a stringstoT, you start at the root ofTand consider the first charactercofs. If there is an edge marked withcfrom the current node to any of its children, you consume the charactercand get down to this child. If at some point there is no such an edge and child, you have to create them, consumecand continue the process until wholesis consumed.Searching a stringPinTis even simpler, you just have to iterate through the characters ofPand follow corresponding edges to these characters inTstarting from the root. If at some point there is no transition to a children, or if you consume all the letter ofP, but the node in which you end the process does not correspond to any string fromS, thenPdoes not belong toSeither. Otherwise, you end the process in a node corresponding toP, so we know thatPbelongs toS.One other basic application of a trie is to decide ifPis a substring ofs, in other words, the pattern matching problem. How we can solve it using trie? Well, letSbe the set of all suffixes ofs. Now it is sufficient to build a trieTfromSand searchPinT. Sounds good, the search operation is linear in terms of length ofP. However, buildingTis linear in terms of the sum of lengths of strings inS, but in our case, it is quadratic in terms of length ofs, so the whole algorithm is far from optimal.It is also worth to mention that there are stringsssuch that a trie for all suffixes ofshas quadratic number of nodes in terms of the length ofs. Can you find any suchs? It is a good exercise to find one in order to get familiar with what can a trie solve efficiently and what it cannot.Here are some trie related articles:TopCoder tutorial with related problemsGeeksforGeeks article with sample implementationSolving problems related to bitwise operations with triesSuffix treeA suffix treeTis a natural improvement over trie used in pattern matching problem, the one defined over a set of substrings of a strings. The idea is very simple here. Such a trie can have a long paths without branches. If we only can reduce these long paths into one jump, we will reduce the size of the trie significantly, so this is a great first step in improving the complexity of operations on such a tree. This reduced trie defined over a subset of suffixes of a stringsis called a suffix tree ofsFor better understanding, let’s consider the suffix treeTfor a strings = abakan. A word abakan has 6 suffixes{abakan , bakan, akan, kan, an, n}and its suffix tree looks like this:There is a famousalgorithm by Ukkonenfor building suffix tree forsin linear time in terms of the length ofs. However, because it may look quite complicated at first sight, many people are discouraged to learn how it works. Fortunately, there is a great, I mean an excellent,description of Ukkonen’s algorithm given on StackOverflow. Please refer to it for better understanding what a suffix tree is and how to build it in linear time.Suffix trees can solve many complicated problems, because it contain so many information about the string itself. Fo example, in order to know how many times a patternPoccurs ins, it is sufficient to findPinTand return the size of a subtree corresponding to its node. Another well known application is finding the number of distinct substrings ofs, and it can be solved easily with suffix tree, while the problem looks very complicated at first sight.The post I linked from StackOverflow is so great, that you simple must read it. After that, you will be able to identify problems solvable with suffix trees easily.If you want to know more about when to use a suffix tree, you should read this paper about theapplications of suffix trees.Suffix ArraySuffix array is a very nice array based structure. Basically, it is a lexicographically sorted array of suffixes of a strings. For example, let’s consider a strings = abakan. A word abakan has 6 suffixes{abakan , bakan, akan, kan, an, n}and its suffix tree looks like this:Of course, in order to reduce space, we do not store the exact suffixes. It is sufficient to store their indices.Suffix arrays, especially combined withLCP table(which stands for Longest Common Prefix of neighboring suffixes table), are very very useful for solving many problems. I recommend reading this niceprogramming oriented paper about suffix arrays, their applications and related problemsby Stanford University.Suffix arrays can be build easily inO(n * log2n)time, wherenis the length ofs, using the algorithm proposed in the paper from the previous paragraph. This time can be improved toO(n * log n)using linear time sorting algorithm.However, there is so extraordinary, cool and simplelinear time algorithm for building suffix arrays by Kärkkäinen and Sanders, that reading it is a pure pleasure and you cannot miss it.Correspondence between suffix tree and suffix arrayIt is also worth to mention, that a suffix array can be constructed directly from a suffix tree in linear time using DFS traversal. Suffix tree can be also constructed from the suffix array and LCP table as describedhere.

News Reporter

Leave a Reply

Your email address will not be published. Required fields are marked *