Search - 源码 - 源码 - 免费下载

文件大小： unknow

源码售价： 5 个金币积分规则积分充值

资源说明：

Simulation of a search engine on an acronym database.


Module			Unimplemented			Unit tested	Integrated			Multi threaded
======================================================================================================================

thread pool		threadpool kill+more		lightly		with disk search		na
rbt			delete				yes		with dictionary words		no
skip list		add				lightly		with sorted edge list		no
suffix tree		auto complete			yes		needs integration w/ graph	yes
disk search fallback	mostly complete			yes		with thread pool		no
cut-nodes/kruskals 	mostly complete			yes		with graph			no

======
Goals:
======
1. reverse lookup of a dictionary word
2. recommendations of other acronyms
3. places where recommendation list might be poisoned for lexical analysis.
4. regular, non reverse lookup

eg input: federal
1. reverse lookup: fbi, fbh
2. recommendations L337
3. fbh is the only vertex that connectes words in component (fbi, fbh, fsl) to words in component (l337, HACK, cracker)
4. L337: hacker slang for elite (leet)

Additional Goals: scalabitity, roughly linear to nlogn time for operations, multi threaded access to data structures.
Applications: one might imagine such a feature going into phones (message acronyms) or social networks (local slang continuously added to graph, providing example usage from friends public news feed). 
My sole aim however is to understand algorithms and data structures.

==========
Algorithm:
==========
build_graph:
create a graph of all the acronyms
go through acronyms definition and eviscerate filler words
look for each word in suffix tree. each leaf of the suffix tree has a list of acronyms with this word in its definition.
eg: aadvark and aadwolf will have
	 aard
        /   \
      vark  wolf - aardwolf acronym list: (AAD,AAC,AAR..)	

if not found insert word into suffix tree, add acronym to list.
add an edge between current acronym and all other acronyms in leaf node list. 
weight of edge = sum of ascii in word (i wanted something proportional to the length of the word).

Acronym recommendations: 
1. highest weight edges connected to an acronym - these are 'stupid' recommendations because they only contain acronyms with the exact same words.
2. dfs of spanning forest this node is a part of. this will link related words like fbi and fbh in the example.
vague recommendations creep in when there are sets of acronyms that have many links between them, and then only 1 acronym that connects this set with another similar set. the vertex that connects the 2 strongly connected components usually needs closer inspection.  

Data Structures:
1. My computer refuses to accomodate the array for an adjacency list representation of all the acronyms. the base data structure will have to be some non contiguous memory, like a tree. Ive chose a red black tree for reasons related to multi threading. 
2. The 'sorted edge list' in kruskals algorithm gets prohibitively slow on all acronyms if a linked list is used. I replaced it with a skip list.
3. There wasnt anough memory to store all definitions of acronyms on the heap, mostly because words were constantly being repeated. this led me to a suffix tree.
4. The edges connected to a vertex are linked lists. 

Fallback:
we can get acronym search working even if we're not ready to make recommendations yet (say the graph takes 2 minutes to build and users type 20 queries in that time) by looking straight into the database. I do a binary search in the file using a thread/request and 1 thread to initialize the indices.  

Multi threading:
Implemented a basic thread pool using pthreads, a fair amount of the thread pool functionality is yet to be implemented. I use the thread pool to do 'expensive searches' while the other data structures are building themselves. additional functionality might be necessary as i start multi threading different parts.

部分文件列表（点击文件名可查看文件内容）

					
									本源码包内暂不包含可直接显示的源代码文件，请下载源码包。