FM-Index
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:FM-Index full-text index implementation using RRR Wavelet trees (libcds) and fast suffix sorting (libdivsufsort) including experimental results.
FM-Index - Compressed full-text Index
=====================================

A simple C++ based FM-Index [1] implementation using RRR [4] wavelet trees [5]
which allows to build a full-text index over a given text `T` of size `n`
supporting the following operations:

  * `count(P,m)`   : count the number of occurences of pattern `P`  of size `m` in `T`.
  * `locate(P,m)`  : locate the text positions of all occurences of `P` of size `m` in `T`.
  * `extract(A,B)` : extract `T[A,B]` from the index.
  * `recover()`    : recover `T` from the index.
  
The constructed index uses `nH_k + o(n log sigma)` bits of space [3] which is roughly the size of
the **compressed** representation of `T` and can perform the above operations without the need
to store `T`. An empirical evaluation of the index is sown in the Benchmark section below.

Drawbacks of the FM-Index is long construction time and high memory requirements during construction.
  
Usage
-----

### Compiling the Index

	make

### Building an Index

	./fmbuild alice29.txt alice29.txt.fm
	
Builds and writes the FM-Index `alice29.txt.fm`.

### Running count() queries

	./fmcount -i alice29.txt.fm alice.qry

The queries are stored in a new line seperated file:

	the
	house
	keep
	Alice
	and
	
The index returns the **number of occurrences** for each query:

	./fmcount -i alice29.txt.fm alice.qry
	the : 2101
	house : 20
	keep : 11
	Alice : 395
	and : 880
	
### Running locate() queries

	./fmlocate -i alice29.txt.fm alice.qry

	
The index returns a sorted list of the **locations of all occurences** for each query:

	./fmlocate -i alice29.txt.fm alice.qry
	Read 3 queries
	keep (11) : 46385 51125 69491 74680 81562 83046 104830 105180 133621 149966 151623
	poison (3) : 8151 8619 8731
	tomorrow (1) : 63637

### Running extract() queries
	
	./fmextract -i alice29.txt.fm alice.extract
	
The queries are stored in a new line seperated file:

	118 147
	1213 1245
	24 55

The index returns the **extracted text snippet** for each query:

	./fmextract -i alice29.txt.fm alice.extract
	118 - 147 : 'THE MILLENNIUM FULCRUM EDITION'
	1213 - 1245 : 'TOOK A WATCH OUT OF ITS WAISTCOAT'
	24 - 55 : 'ALICE'S ADVENTURES IN WONDERLAND'
	
### Recover the original text from the index

	./fmrecover -i alice29.txt.fm 
	
The index outputs the original text to `stdout`.

### Verbose output

The `-v` command line parameter enables verbose messages:

	./fmbuild -v alice29.txt
	building index.
	- remapping alphabet.
	- creating cumulative counts C[].
	- performing bwt.
	- sample SA locations.
	- creating bwt output.
	- create RRR wavelet tree over bwt.
	build FM-Index done. (0.101 sec)
	space usage:
	- remap_reverse: 75 bytes (0.07%)
	- C: 1028 bytes (0.90%)
	- Suffixes: 9508 bytes (8.31%)
	- Positions: 9512 bytes (8.31%)
	- Sampled: 7948 bytes (6.95%)
	- T_bwt: 86088 bytes (75.23%)
	input Size n = 152090 bytes
	index Size = 114431 bytes (0.75 n)
	writing FM Index to file 'alice29.txt.fm'
	
	
Using the FM-Index to provide full-text search on a given text `T`
------------------------------------------------------------------

A small code example illustrating the use of the FM class:

	#include 
	#include 
	#include 
	#include "FM.h"

	int
	main(int argc,char** argv) {
		uint8_t* T = /* read text */
		uint32_t n = /* sizeof T */
		FM* = new FM(T,n);
		if(FM) {
			/* count the occurences of 'house' in T */
			uint32_t cnt = FM->count("house",strlen("house"));
			
			/* get all locations offsets of 'house' in T */
			uint32_t matches; /* number of matches */
			uint32_t* locs; /* list of location offsets */
			locs = FM->locate("house",strlen("house"),&matches);
			
			/* extract text snippet from T */
			uint8_t* snippet = FM->extract(5,251);
			
			/* recover T from index */
			uint32_t nnew; /* size of Tnew */
			uint8_t* Tnew = FM->reconstructText(&nnew);
			
			delete FM;
		}
	}
	
Compile with `g++ -o test main.cpp FM.cpp util.c libcds.a libdivsufsort.a`

Benchmarks
----------

Experiments are similar to the ones described in [1]. All experiments were run on a `Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz` using 4GB of RAM. 
The default samplerate `s=64` was used in all experiments.

### Test data

Test data was taken from the [The Pizza&Chili Site](http://pizzachili.di.unipi.it/) and TREC. 

FileDescriptionAlphabet sizeEntropy (bps)
wsjEnglish Text taken from the TREC wsj collection904.60
srcConcatenated source code (.c,.h,.C,.java) of linux-2.6.11.6 and gcc-4.0.02305.47
proteinsSequence of newline-separated protein sequences254.20
dnaGene DNA sequences41.97
xmlXML that provides bibliographic info on compsci pubs(dblp)965.26
### Construction Peak memory usage was measured using the `valgrind --tool=massif` tool. Running time was measured using the `gettimeofday()` system call.
FileSize [MB]Time [sec]Memory [MB]
wsj32.15720.02
wsj64.48739.73
wsj129.29379.15
wsj2519.028158.0
wsj5038.699315.7
wsj10078.287631.2
wsj200158.3411233
src5039.909315.7
src10080.489631.2
src200163.0721233
proteins5035.374315.7
proteins10072.824631.2
proteins200146.1731233
dna5025.941315.7
dna10053.287631.2
dna200109.4491233
xml5036.601315.7
xml10074.104631.2
xml200150.0501233
Construction time increases linearly `O(n)` with size `n` of `T`. Memory requirement is roughly `6n` independent of the file type.
FileSize [MB]Index Size [MB]Index Size [relative to file size]
wsj25150.6
wsj50290.58
wsj100570.57
wsj2001120.56
src50300.77
src100590.77
src2001170.75
proteins50360.72
proteins100710.71
proteins2001370.685
dna50240.48
dna100480.48
dna200950.475
xml50250.50
xml100500.50
xml200990.495
Index size depends on the compressability of `T`. For each specific file type, as `n` increases, the size of the auxillary data becomes less significant which leads to better overall compression ratio. ### Count 50000 random patterns for each length `m=5 to 50` were extracted from each 200MB file. ![FM-Index count() - 200MB - 50000 Queries](https://github.com/mpetri/FM-Index/raw/master/benchmark/FM_count_200MB_Q50000.png) The graph shows the time in seconds it took the FM-Index to answer all 50000 queries on different file types. Note that the query time grows linearly `O(m)` with the pattern length. Files with smaller alphabet show faster query times overall. Also note that the query time (`m=20`) does **not** depend on the size `n` of the text `T`:
FileSize [MB]Query Time [sec]
wsj313.37
wsj611.96
wsj1212.74
wsj2512.74
wsj5012.07
wsj10014.59
wsj20014.34
Overall, searching for 50000 patterns in `T` using a FM-Index is fast but requieres considerable amount of upfront work (time+space). ### Locate `k` queries pf length `5` are randomly selected from `T` so they roughly amount to a certain number of total occurences over all queries. ![FM-Index locate() - 200MB - Query length 5](https://github.com/mpetri/FM-Index/raw/master/benchmark/FM_locate_200MB_QL5.png) Note the running time increases linearly with the number of occurrences. Similar to `count()`, files with small alphabet size perform better most likely due to decreased height of the wavelet tree. Comparing the running times of both `count()` and `locate()` we observe that `locate()` is significantly slower than `count()` as patterns (especially short ones), on large text collections, tend to have many occurrences. The samplerate `s` (default = 64) determines how often the underlying suffix array is sampled to enable faster `locate()` calls. For different samplerates the index achieves different time and space trade-offs during the `locate()` operation: ![FM-Index locate() - 50MB - Query length 6 - 50000 occurrences - variable samplerate](https://github.com/mpetri/FM-Index/raw/master/benchmark/FM_locate_samplerate_50MB_QL6.png) The graph above shows the time required to retrieve the locations of 50000 occurrences compared to the size of the FM-Index for different suffix samplerates. Smaller samplerate decrease the `locate()` running time but the space used by the index increases. For `s=8` the index is 1.5 times the sizes of the original text. For `s=512`, that is only every 512th position in the underlying suffix array is sampled, the index is less than 50% of the original text. The above graph shows that the best time-space trade-off is achieved for samplerates between 64 and 128. ### Extract Extracting snippets of `T` from the index shows the same properties as the `locate()` operation. Like `locate()`, the running time of `extract()` grows linearly with the size of the size of the snippet. The running time is also influenced by the samplerate `s` which provides the same time and space trade-offs described above. Libraries --------- The following libraries are needed to use the index. Sourcecode for both libaries is included. * [libcds](http://libcds.recoded.cl/) -- succinct low level data structures * [libdivsufsort](http://code.google.com/p/libdivsufsort/) -- fast suffix sorting (requires cmake to build) Licence -------- GPL v3 (see LICENCE file) References ----------- 1. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. 2. Francisco Claude and Gonzalo Navarro. Practical Rank/Select Queries over Arbitrary Sequences. SPIRE'08 176-187, 2008. 3. Veli M"akinen and Gonzalo Navarro. Implicit Compression Boosting with Applications to Self-Indexing. SPIRE'07 214-226. 4. R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. SODA'02, 233-242. 5. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. SODA'03, 841-850. 6. [The Pizza&Chili Site](http://pizzachili.di.unipi.it/). Author ------ Matthias Petri see [github](http://github/com/mpetri/FM-Index).

本源码包内暂不包含可直接显示的源代码文件,请下载源码包。