资源说明:FM-Index full-text index implementation using RRR Wavelet trees (libcds) and fast suffix sorting (libdivsufsort) including experimental results.
FM-Index - Compressed full-text Index ===================================== A simple C++ based FM-Index [1] implementation using RRR [4] wavelet trees [5] which allows to build a full-text index over a given text `T` of size `n` supporting the following operations: * `count(P,m)` : count the number of occurences of pattern `P` of size `m` in `T`. * `locate(P,m)` : locate the text positions of all occurences of `P` of size `m` in `T`. * `extract(A,B)` : extract `T[A,B]` from the index. * `recover()` : recover `T` from the index. The constructed index uses `nH_k + o(n log sigma)` bits of space [3] which is roughly the size of the **compressed** representation of `T` and can perform the above operations without the need to store `T`. An empirical evaluation of the index is sown in the Benchmark section below. Drawbacks of the FM-Index is long construction time and high memory requirements during construction. Usage ----- ### Compiling the Index make ### Building an Index ./fmbuild alice29.txt alice29.txt.fm Builds and writes the FM-Index `alice29.txt.fm`. ### Running count() queries ./fmcount -i alice29.txt.fm alice.qry The queries are stored in a new line seperated file: the house keep Alice and The index returns the **number of occurrences** for each query: ./fmcount -i alice29.txt.fm alice.qry the : 2101 house : 20 keep : 11 Alice : 395 and : 880 ### Running locate() queries ./fmlocate -i alice29.txt.fm alice.qry The index returns a sorted list of the **locations of all occurences** for each query: ./fmlocate -i alice29.txt.fm alice.qry Read 3 queries keep (11) : 46385 51125 69491 74680 81562 83046 104830 105180 133621 149966 151623 poison (3) : 8151 8619 8731 tomorrow (1) : 63637 ### Running extract() queries ./fmextract -i alice29.txt.fm alice.extract The queries are stored in a new line seperated file: 118 147 1213 1245 24 55 The index returns the **extracted text snippet** for each query: ./fmextract -i alice29.txt.fm alice.extract 118 - 147 : 'THE MILLENNIUM FULCRUM EDITION' 1213 - 1245 : 'TOOK A WATCH OUT OF ITS WAISTCOAT' 24 - 55 : 'ALICE'S ADVENTURES IN WONDERLAND' ### Recover the original text from the index ./fmrecover -i alice29.txt.fm The index outputs the original text to `stdout`. ### Verbose output The `-v` command line parameter enables verbose messages: ./fmbuild -v alice29.txt building index. - remapping alphabet. - creating cumulative counts C[]. - performing bwt. - sample SA locations. - creating bwt output. - create RRR wavelet tree over bwt. build FM-Index done. (0.101 sec) space usage: - remap_reverse: 75 bytes (0.07%) - C: 1028 bytes (0.90%) - Suffixes: 9508 bytes (8.31%) - Positions: 9512 bytes (8.31%) - Sampled: 7948 bytes (6.95%) - T_bwt: 86088 bytes (75.23%) input Size n = 152090 bytes index Size = 114431 bytes (0.75 n) writing FM Index to file 'alice29.txt.fm' Using the FM-Index to provide full-text search on a given text `T` ------------------------------------------------------------------ A small code example illustrating the use of the FM class: #include#include #include #include "FM.h" int main(int argc,char** argv) { uint8_t* T = /* read text */ uint32_t n = /* sizeof T */ FM* = new FM(T,n); if(FM) { /* count the occurences of 'house' in T */ uint32_t cnt = FM->count("house",strlen("house")); /* get all locations offsets of 'house' in T */ uint32_t matches; /* number of matches */ uint32_t* locs; /* list of location offsets */ locs = FM->locate("house",strlen("house"),&matches); /* extract text snippet from T */ uint8_t* snippet = FM->extract(5,251); /* recover T from index */ uint32_t nnew; /* size of Tnew */ uint8_t* Tnew = FM->reconstructText(&nnew); delete FM; } } Compile with `g++ -o test main.cpp FM.cpp util.c libcds.a libdivsufsort.a` Benchmarks ---------- Experiments are similar to the ones described in [1]. All experiments were run on a `Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz` using 4GB of RAM. The default samplerate `s=64` was used in all experiments. ### Test data Test data was taken from the [The Pizza&Chili Site](http://pizzachili.di.unipi.it/) and TREC. ### Construction Peak memory usage was measured using the `valgrind --tool=massif` tool. Running time was measured using the `gettimeofday()` system call.
File Description Alphabet size Entropy (bps) wsj English Text taken from the TREC wsj collection 90 4.60 src Concatenated source code (.c,.h,.C,.java) of linux-2.6.11.6 and gcc-4.0.0 230 5.47 proteins Sequence of newline-separated protein sequences 25 4.20 dna Gene DNA sequences 4 1.97 xml XML that provides bibliographic info on compsci pubs(dblp) 96 5.26 Construction time increases linearly `O(n)` with size `n` of `T`. Memory requirement is roughly `6n` independent of the file type.
File Size [MB] Time [sec] Memory [MB] wsj 3 2.157 20.02 wsj 6 4.487 39.73 wsj 12 9.293 79.15 wsj 25 19.028 158.0 wsj 50 38.699 315.7 wsj 100 78.287 631.2 wsj 200 158.341 1233 src 50 39.909 315.7 src 100 80.489 631.2 src 200 163.072 1233 proteins 50 35.374 315.7 proteins 100 72.824 631.2 proteins 200 146.173 1233 dna 50 25.941 315.7 dna 100 53.287 631.2 dna 200 109.449 1233 xml 50 36.601 315.7 xml 100 74.104 631.2 xml 200 150.050 1233 Index size depends on the compressability of `T`. For each specific file type, as `n` increases, the size of the auxillary data becomes less significant which leads to better overall compression ratio. ### Count 50000 random patterns for each length `m=5 to 50` were extracted from each 200MB file.  The graph shows the time in seconds it took the FM-Index to answer all 50000 queries on different file types. Note that the query time grows linearly `O(m)` with the pattern length. Files with smaller alphabet show faster query times overall. Also note that the query time (`m=20`) does **not** depend on the size `n` of the text `T`:
File Size [MB] Index Size [MB] Index Size [relative to file size] wsj 25 15 0.6 wsj 50 29 0.58 wsj 100 57 0.57 wsj 200 112 0.56 src 50 30 0.77 src 100 59 0.77 src 200 117 0.75 proteins 50 36 0.72 proteins 100 71 0.71 proteins 200 137 0.685 dna 50 24 0.48 dna 100 48 0.48 dna 200 95 0.475 xml 50 25 0.50 xml 100 50 0.50 xml 200 99 0.495 Overall, searching for 50000 patterns in `T` using a FM-Index is fast but requieres considerable amount of upfront work (time+space). ### Locate `k` queries pf length `5` are randomly selected from `T` so they roughly amount to a certain number of total occurences over all queries.  Note the running time increases linearly with the number of occurrences. Similar to `count()`, files with small alphabet size perform better most likely due to decreased height of the wavelet tree. Comparing the running times of both `count()` and `locate()` we observe that `locate()` is significantly slower than `count()` as patterns (especially short ones), on large text collections, tend to have many occurrences. The samplerate `s` (default = 64) determines how often the underlying suffix array is sampled to enable faster `locate()` calls. For different samplerates the index achieves different time and space trade-offs during the `locate()` operation:  The graph above shows the time required to retrieve the locations of 50000 occurrences compared to the size of the FM-Index for different suffix samplerates. Smaller samplerate decrease the `locate()` running time but the space used by the index increases. For `s=8` the index is 1.5 times the sizes of the original text. For `s=512`, that is only every 512th position in the underlying suffix array is sampled, the index is less than 50% of the original text. The above graph shows that the best time-space trade-off is achieved for samplerates between 64 and 128. ### Extract Extracting snippets of `T` from the index shows the same properties as the `locate()` operation. Like `locate()`, the running time of `extract()` grows linearly with the size of the size of the snippet. The running time is also influenced by the samplerate `s` which provides the same time and space trade-offs described above. Libraries --------- The following libraries are needed to use the index. Sourcecode for both libaries is included. * [libcds](http://libcds.recoded.cl/) -- succinct low level data structures * [libdivsufsort](http://code.google.com/p/libdivsufsort/) -- fast suffix sorting (requires cmake to build) Licence -------- GPL v3 (see LICENCE file) References ----------- 1. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. 2. Francisco Claude and Gonzalo Navarro. Practical Rank/Select Queries over Arbitrary Sequences. SPIRE'08 176-187, 2008. 3. Veli M"akinen and Gonzalo Navarro. Implicit Compression Boosting with Applications to Self-Indexing. SPIRE'07 214-226. 4. R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. SODA'02, 233-242. 5. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. SODA'03, 841-850. 6. [The Pizza&Chili Site](http://pizzachili.di.unipi.it/). Author ------ Matthias Petri
File Size [MB] Query Time [sec] wsj 3 13.37 wsj 6 11.96 wsj 12 12.74 wsj 25 12.74 wsj 50 12.07 wsj 100 14.59 wsj 200 14.34 see [github](http://github/com/mpetri/FM-Index).
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。
English
