资源说明:
Simulation of a search engine on an acronym database.
Module Unimplemented Unit tested Integrated Multi threaded
======================================================================================================================
thread pool threadpool kill+more lightly with disk search na
rbt delete yes with dictionary words no
skip list add lightly with sorted edge list no
suffix tree auto complete yes needs integration w/ graph yes
disk search fallback mostly complete yes with thread pool no
cut-nodes/kruskals mostly complete yes with graph no
======
Goals:
======
1. reverse lookup of a dictionary word
2. recommendations of other acronyms
3. places where recommendation list might be poisoned for lexical analysis.
4. regular, non reverse lookup
eg input: federal
1. reverse lookup: fbi, fbh
2. recommendations L337
3. fbh is the only vertex that connectes words in component (fbi, fbh, fsl) to words in component (l337, HACK, cracker)
4. L337: hacker slang for elite (leet)
Additional Goals: scalabitity, roughly linear to nlogn time for operations, multi threaded access to data structures.
Applications: one might imagine such a feature going into phones (message acronyms) or social networks (local slang continuously added to graph, providing example usage from friends public news feed).
My sole aim however is to understand algorithms and data structures.
==========
Algorithm:
==========
build_graph:
create a graph of all the acronyms
go through acronyms definition and eviscerate filler words
look for each word in suffix tree. each leaf of the suffix tree has a list of acronyms with this word in its definition.
eg: aadvark and aadwolf will have
aard
/ \
vark wolf - aardwolf acronym list: (AAD,AAC,AAR..)
if not found insert word into suffix tree, add acronym to list.
add an edge between current acronym and all other acronyms in leaf node list.
weight of edge = sum of ascii in word (i wanted something proportional to the length of the word).
Acronym recommendations:
1. highest weight edges connected to an acronym - these are 'stupid' recommendations because they only contain acronyms with the exact same words.
2. dfs of spanning forest this node is a part of. this will link related words like fbi and fbh in the example.
vague recommendations creep in when there are sets of acronyms that have many links between them, and then only 1 acronym that connects this set with another similar set. the vertex that connects the 2 strongly connected components usually needs closer inspection.
Data Structures:
1. My computer refuses to accomodate the array for an adjacency list representation of all the acronyms. the base data structure will have to be some non contiguous memory, like a tree. Ive chose a red black tree for reasons related to multi threading.
2. The 'sorted edge list' in kruskals algorithm gets prohibitively slow on all acronyms if a linked list is used. I replaced it with a skip list.
3. There wasnt anough memory to store all definitions of acronyms on the heap, mostly because words were constantly being repeated. this led me to a suffix tree.
4. The edges connected to a vertex are linked lists.
Fallback:
we can get acronym search working even if we're not ready to make recommendations yet (say the graph takes 2 minutes to build and users type 20 queries in that time) by looking straight into the database. I do a binary search in the file using a thread/request and 1 thread to initialize the indices.
Multi threading:
Implemented a basic thread pool using pthreads, a fair amount of the thread pool functionality is yet to be implemented. I use the thread pool to do 'expensive searches' while the other data structures are building themselves. additional functionality might be necessary as i start multi threading different parts.
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。
English
