FM-Index - 源码 - 源码 - 免费下载

FM-Index

文件大小： unknow

源码售价： 5 个金币积分规则积分充值

资源说明：FM-Index full-text index implementation using RRR Wavelet trees (libcds) and fast suffix sorting (libdivsufsort) including experimental results.

FM-Index - Compressed full-text Index
=====================================

A simple C++ based FM-Index [1] implementation using RRR [4] wavelet trees [5]
which allows to build a full-text index over a given text `T` of size `n`
supporting the following operations:

  * `count(P,m)`   : count the number of occurences of pattern `P`  of size `m` in `T`.
  * `locate(P,m)`  : locate the text positions of all occurences of `P` of size `m` in `T`.
  * `extract(A,B)` : extract `T[A,B]` from the index.
  * `recover()`    : recover `T` from the index.
  
The constructed index uses `nH_k + o(n log sigma)` bits of space [3] which is roughly the size of
the **compressed** representation of `T` and can perform the above operations without the need
to store `T`. An empirical evaluation of the index is sown in the Benchmark section below.

Drawbacks of the FM-Index is long construction time and high memory requirements during construction.
  
Usage
-----

### Compiling the Index

	make

### Building an Index

	./fmbuild alice29.txt alice29.txt.fm
	
Builds and writes the FM-Index `alice29.txt.fm`.

### Running count() queries

	./fmcount -i alice29.txt.fm alice.qry

The queries are stored in a new line seperated file:

	the
	house
	keep
	Alice
	and
	
The index returns the **number of occurrences** for each query:

	./fmcount -i alice29.txt.fm alice.qry
	the : 2101
	house : 20
	keep : 11
	Alice : 395
	and : 880
	
### Running locate() queries

	./fmlocate -i alice29.txt.fm alice.qry

	
The index returns a sorted list of the **locations of all occurences** for each query:

	./fmlocate -i alice29.txt.fm alice.qry
	Read 3 queries
	keep (11) : 46385 51125 69491 74680 81562 83046 104830 105180 133621 149966 151623
	poison (3) : 8151 8619 8731
	tomorrow (1) : 63637

### Running extract() queries
	
	./fmextract -i alice29.txt.fm alice.extract
	
The queries are stored in a new line seperated file:

	118 147
	1213 1245
	24 55

The index returns the **extracted text snippet** for each query:

	./fmextract -i alice29.txt.fm alice.extract
	118 - 147 : 'THE MILLENNIUM FULCRUM EDITION'
	1213 - 1245 : 'TOOK A WATCH OUT OF ITS WAISTCOAT'
	24 - 55 : 'ALICE'S ADVENTURES IN WONDERLAND'
	
### Recover the original text from the index

	./fmrecover -i alice29.txt.fm 
	
The index outputs the original text to `stdout`.

### Verbose output

The `-v` command line parameter enables verbose messages:

	./fmbuild -v alice29.txt
	building index.
	- remapping alphabet.
	- creating cumulative counts C[].
	- performing bwt.
	- sample SA locations.
	- creating bwt output.
	- create RRR wavelet tree over bwt.
	build FM-Index done. (0.101 sec)
	space usage:
	- remap_reverse: 75 bytes (0.07%)
	- C: 1028 bytes (0.90%)
	- Suffixes: 9508 bytes (8.31%)
	- Positions: 9512 bytes (8.31%)
	- Sampled: 7948 bytes (6.95%)
	- T_bwt: 86088 bytes (75.23%)
	input Size n = 152090 bytes
	index Size = 114431 bytes (0.75 n)
	writing FM Index to file 'alice29.txt.fm'
	
	
Using the FM-Index to provide full-text search on a given text `T`
------------------------------------------------------------------

A small code example illustrating the use of the FM class:

	#include 
	#include 
	#include 
	#include "FM.h"

	int
	main(int argc,char** argv) {
		uint8_t* T = /* read text */
		uint32_t n = /* sizeof T */
		FM* = new FM(T,n);
		if(FM) {
			/* count the occurences of 'house' in T */
			uint32_t cnt = FM->count("house",strlen("house"));
			
			/* get all locations offsets of 'house' in T */
			uint32_t matches; /* number of matches */
			uint32_t* locs; /* list of location offsets */
			locs = FM->locate("house",strlen("house"),&matches);
			
			/* extract text snippet from T */
			uint8_t* snippet = FM->extract(5,251);
			
			/* recover T from index */
			uint32_t nnew; /* size of Tnew */
			uint8_t* Tnew = FM->reconstructText(&nnew);
			
			delete FM;
		}
	}
	
Compile with `g++ -o test main.cpp FM.cpp util.c libcds.a libdivsufsort.a`

Benchmarks
----------

Experiments are similar to the ones described in [1]. All experiments were run on a `Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz` using 4GB of RAM. 
The default samplerate `s=64` was used in all experiments.

### Test data

Test data was taken from the [The Pizza&Chili Site](http://pizzachili.di.unipi.it/) and TREC. 


  
    File Description Alphabet size Entropy (bps)
  
  
    wsj English Text taken from the TREC wsj collection 90 4.60
  
  
	src Concatenated source code (.c,.h,.C,.java) of linux-2.6.11.6 and gcc-4.0.0 230 5.47
  
  
	proteins Sequence of newline-separated protein sequences 25 4.20
  
  
	dna Gene DNA sequences 4 1.97
  
  
	xml XML that provides bibliographic info on compsci pubs(dblp) 96 5.26
  


### Construction

Peak memory usage was measured using the `valgrind --tool=massif` tool. Running time was measured using the `gettimeofday()` system call.


  
    File Size [MB] Time [sec] Memory [MB]
  
  
    wsj 3 2.157 20.02
  
  
	wsj 6 4.487 39.73
  
  
	wsj 12 9.293 79.15
  
  
	wsj 25 19.028 158.0
  
  
	wsj 50 38.699 315.7
  
  
	wsj 100 78.287 631.2
  
  
	wsj 200 158.341 1233
  
  
	src 50 39.909 315.7
  
  
	src 100 80.489 631.2
  
  
	src 200 163.072 1233
  
  
	proteins 50 35.374 315.7
  
  
	proteins 100 72.824 631.2
  
  
	proteins 200 146.173 1233
  
  
	dna 50 25.941 315.7
  
  
	dna 100 53.287 631.2
  
  
	dna 200 109.449 1233
    
  
	xml 50 36.601 315.7
  
  
	xml 100 74.104 631.2
  
  
	xml 200 150.050 1233
    


Construction time increases linearly `O(n)` with size `n` of `T`. Memory requirement is roughly `6n` independent of the file type.


  
    File Size [MB] Index Size [MB] Index Size [relative to file size]
  
  
    wsj 25 15 0.6
  
  
	wsj 50 29 0.58
  
  
	wsj 100 57 0.57
  
  
	wsj 200 112 0.56
  
  
	src 50 30 0.77
  
  
	src 100 59 0.77
  
  
	src 200 117 0.75
  
  
	proteins 50 36 0.72
  
  
	proteins 100 71 0.71
  
  
	proteins 200 137 0.685
  
  
	dna 50 24 0.48
  
  
	dna 100 48 0.48
  
  
	dna 200 95 0.475
  
  
	xml 50 25 0.50
  
  
	xml 100 50 0.50
  
  
	xml 200 99 0.495	
    


Index size depends on the compressability of `T`. For each specific file type, as `n` increases, 
the size of the auxillary data becomes less significant which leads to better overall compression ratio.

### Count

50000 random patterns for each length `m=5 to 50` were extracted from each 200MB file. 

![FM-Index count() - 200MB - 50000 Queries](https://github.com/mpetri/FM-Index/raw/master/benchmark/FM_count_200MB_Q50000.png)

The graph shows the time in seconds it took the FM-Index to answer all 50000 queries on different file types. Note that the query
time grows linearly `O(m)` with the pattern length. Files with smaller alphabet show faster query times overall.

Also note that the query time (`m=20`) does **not** depend on the size `n` of the text `T`:


  
    File Size [MB] Query Time [sec]
  
  
    wsj 3 13.37
  
  
    wsj 6 11.96
  
  
    wsj 12 12.74
  
  
    wsj 25 12.74
  
  
    wsj 50 12.07
  
  
    wsj 100 14.59
  
  
    wsj 200 14.34
  


Overall, searching for 50000 patterns in `T` using a FM-Index is fast but requieres considerable amount of upfront work (time+space).

### Locate

`k` queries pf length `5` are randomly selected from `T` so they roughly amount to a certain number of total occurences over all queries.

![FM-Index locate() - 200MB - Query length 5](https://github.com/mpetri/FM-Index/raw/master/benchmark/FM_locate_200MB_QL5.png)

Note the running time increases linearly with the number of occurrences. Similar to `count()`, files with small alphabet size perform 
better most likely due to decreased height of the wavelet tree. Comparing the running times of both `count()` and `locate()` we observe
that `locate()` is significantly slower than `count()` as patterns (especially short ones), on large text collections, tend to have many occurrences.

The samplerate `s` (default = 64) determines how often the underlying suffix array is sampled to enable faster `locate()` calls. For different samplerates
the index achieves different time and space trade-offs during the `locate()` operation:

![FM-Index locate() - 50MB - Query length 6 - 50000 occurrences - variable samplerate](https://github.com/mpetri/FM-Index/raw/master/benchmark/FM_locate_samplerate_50MB_QL6.png)

The graph above shows the time required to retrieve the locations of 50000 occurrences compared to the size of the FM-Index for different suffix samplerates.
Smaller samplerate decrease the `locate()` running time but the space used by the index increases. For `s=8` the index is 1.5 times the sizes of the original text. 
For `s=512`, that is only every 512th position in the underlying suffix array is sampled, the index is less than 50% of the original text. The above graph shows that the
best time-space trade-off is achieved for samplerates between 64 and 128.

### Extract

Extracting snippets of `T` from the index shows the same properties as the `locate()` operation. Like `locate()`, the running time of `extract()` grows linearly with the size of the
size of the snippet. The running time is also influenced by the samplerate `s` which provides the same time and space trade-offs described above. 

Libraries
---------

The following libraries are needed to use the index. Sourcecode for both
libaries is included.

 * [libcds](http://libcds.recoded.cl/) -- succinct low level data structures
 * [libdivsufsort](http://code.google.com/p/libdivsufsort/) -- fast suffix sorting (requires cmake to build)

Licence
--------

GPL v3 (see LICENCE file)

References
-----------

 1. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. 
    Journal of the ACM, 52(4):552-581, 2005.
 2. Francisco Claude and Gonzalo Navarro. Practical Rank/Select Queries over Arbitrary Sequences. 
    SPIRE'08 176-187, 2008.
 3. Veli M"akinen and Gonzalo Navarro. Implicit Compression Boosting with Applications to Self-Indexing. 
    SPIRE'07 214-226.
 4. R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. 
    SODA'02, 233-242.
 5. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. 
    SODA'03, 841-850.
 6. [The Pizza&Chili Site](http://pizzachili.di.unipi.it/).

Author
------

Matthias Petri  see [github](http://github/com/mpetri/FM-Index).

File	Description	Alphabet size	Entropy (bps)
wsj	English Text taken from the TREC wsj collection	90	4.60
src	Concatenated source code (.c,.h,.C,.java) of linux-2.6.11.6 and gcc-4.0.0	230	5.47
proteins	Sequence of newline-separated protein sequences	25	4.20
dna	Gene DNA sequences	4	1.97
xml	XML that provides bibliographic info on compsci pubs(dblp)	96	5.26

File	Size [MB]	Time [sec]	Memory [MB]
wsj	3	2.157	20.02
wsj	6	4.487	39.73
wsj	12	9.293	79.15
wsj	25	19.028	158.0
wsj	50	38.699	315.7
wsj	100	78.287	631.2
wsj	200	158.341	1233
src	50	39.909	315.7
src	100	80.489	631.2
src	200	163.072	1233
proteins	50	35.374	315.7
proteins	100	72.824	631.2
proteins	200	146.173	1233
dna	50	25.941	315.7
dna	100	53.287	631.2
dna	200	109.449	1233
xml	50	36.601	315.7
xml	100	74.104	631.2
xml	200	150.050	1233

File	Size [MB]	Index Size [MB]	Index Size [relative to file size]
wsj	25	15	0.6
wsj	50	29	0.58
wsj	100	57	0.57
wsj	200	112	0.56
src	50	30	0.77
src	100	59	0.77
src	200	117	0.75
proteins	50	36	0.72
proteins	100	71	0.71
proteins	200	137	0.685
dna	50	24	0.48
dna	100	48	0.48
dna	200	95	0.475
xml	50	25	0.50
xml	100	50	0.50
xml	200	99	0.495

File	Size [MB]	Query Time [sec]
wsj	3	13.37
wsj	6	11.96
wsj	12	12.74
wsj	25	12.74
wsj	50	12.07
wsj	100	14.59
wsj	200	14.34

部分文件列表（点击文件名可查看文件内容）

					
									本源码包内暂不包含可直接显示的源代码文件，请下载源码包。