Prefix tree with bitmap indexes

Not so long ago I needed to implement a functionality that was already implemented by other people and more than once, but did not suit for some characteristics. In this case, some kind of data structure was required with the ability to search by a string key or several initial characters of the key. The key is a string of Latin letters in any case, spaces and numbers.

Other implementations did not suit, first of all, on the consumed memory. And in any case, another implementation of the prefix tree will not be superfluous.

The idea was probably inspired by this article , since it was decided to use the bitmap index for indexing links to subsequent nodes in the node.

Obviously, the number of links in one node can be no more than 63 (10 digits, 26 letters in each case, plus a space). Therefore, for a bitmap index, a 64-bit number is used at each node. The set bit in any position means the presence of the corresponding symbol and node (subtree).

The distribution in the bitmap index is as follows: numbers from 0 to 9 occupy from 0 to 9 bits, letters az from 10 to 35 bits, letters AZ from 36 to 62 bits and a space occupies 63 bits. To get the bit number, the ASCII character code is simply used, reduced by a certain number. For each range of characters, the number is different: for the numbers 48 (the range of digits in ASCII codes begins with 48), for the letters az 87 (the range of letters az starts with minus 10 bits already occupied by 10 digits) and so on.

The corresponding table is shown below.

Character distribution
Symbol01...9ab...zAB...Zspace
Bit number01...910eleven...353637...6162

The node is represented by the code below:

type Trie struct { rw sync.RWMutex characters uint64 subTries []*Trie data interface{} } 

Thus, each node stores the bitmap index of the following characters, a slice with links to subtrees and the data itself. I will tell briefly about the mechanisms of search, insertion and removal.

When entering the input, the key symbolically passes through the tree, starting from the root node. For each character from its ASCII code, the bit number is calculated, and on its basis the index is calculated in the slice of links (subTries) to subtrees. The index value is the number of bits before the desired bit. For example, we have such an index: 00 ... 101. This means that in the bitmap index there are numbers 0 and 2. Then the index for the number 0 will be zero, and for the number 2 it will be 1, that is, subTries will store 2 links to subtrees: subTries [0] and subTries [1] .

The code to get the bit number for the character:

 func calcBitNum(char rune) (uint64, bool) { // Characters az use bit positions 10-35 if char > 96 && char < 123 { return uint64(char - 87), true } // Characters AZ use bit positions 36-61 if char > 64 && char < 91 { return uint64(char - 29), true } // digits 0-9 use bit positions 0-9 if char > 47 && char < 58 { return uint64(char - 48), true } // space uses 62 bit position if char == 32 { return 62, true } return 0, false } 

The code to get the node index for the character:

 func (trie *Trie) getSubTreeIndex(bitNum uint64) int { shifted := trie.characters << (64 - bitNum) return bits.OnesCount64(shifted) } 

The code to get the node for the given character:

 func (trie *Trie) getSubTrie(char rune) (uint64, int, *Trie) { bitNum, _ := calcBitNum(char) subTrieId := trie.getSubTreeIndex(bitNum) // There is no subTrie under given character since the corresponding bit is zero if trie.characters&(1<<bitNum) == 0 { return bitNum, subTrieId, nil } return bitNum, subTrieId, trie.subTries[subTrieId] } 

Insert


A tree is traversed to insert an element. If any character (and therefore all subsequent ones) is missing, it is inserted into the bitmap index. A subtree is created, the link to which is inserted into the right place of the slice subTries (the right place is determined by the number of bits to the desired). When they reached the last character of the key, data is inserted into the last subtree.

Search


To search, the key symbolically "runs" through the tree. As soon as it ends, the result is returned. If there is no data for the key, but there are further subtrees, then the results are selected from all subsequent subtrees that have data. The limit on the number of returned data can be set. Thus, it is possible to search for a single node by the coincidence of the entire key, or a set of nodes by the coincidence of several initial characters.

Delete


The removal procedure came out more complicated (as expected) than the rest. To delete a key, a tree is traversed, during which nodes that do not have data are written in a separate slice. If a node with data is found, then the slice is reset to zero, and the passage through the tree continues further. That is, the slice of nodes for deletion should contain a sequence of nodes that do not have data. If in the end the slice is not empty, then it goes in the reverse order and the corresponding bits are removed from the bitmap index and the nodes are deleted.

Total


The result is a prefix tree that is superior to other analogs in speed and memory consumption (an average of 30% less memory and as much faster). Also, I was not too lazy to make multithreaded operations with a tree possible, which significantly increases productivity, especially in search operations. Of course, the solution is quite specialized (limitation in the form of Latin characters, numbers and a space as key components).

Based on the purpose of the prefix tree, this solution can be used to create a dictionary, search index.

I did not cite the insert, search and delete codes, so as not to clutter up the article, in addition, the source code is available on the github .

Source: https://habr.com/ru/post/479244/


All Articles