资源说明:*Deprecated* A short and sweet Python web crawler using Redis as the process queue, seen set and Memcache style rate limiter for robots.txt
PubCrawl ======== `As the world doesn't have enough horrible crawlers` This is a minimalist web crawler written in Python for fun over two evenings. PubCrawl currently uses Redis as a persistent backend for the URL frontier (queue), seen URL sets and Memcache style key expiration for use with robots.txt. On a commodity laptop with a standard ADSL2+ connection the download rate is a sustained 25-50 pages per second (or 2-4 million crawled pages per day). - Multiple Python clients can be started and run at the same time either using multiple processes (multiprocessing) or threads - The crawl can be stopped and restarted with only a minimal loss of URLs (zero by the future addition of an in-progress set of URLs) - The web crawler respects `robots.txt` even when the crawl is stopped and restarted (as long as the Redis database layer persists) - PubCrawl only depends on Redis which plays the role of message queue, URL set curator and Memcache server The web crawler respects `robots.txt` through the use of a slightly modified robot exclusion rules parser by Philip Semanchuk. Currently when a website is retrieved the web domain is added as a key to Redis and given an expiry time. The expiry time is either the one provided by `robots.txt` or a one second program default. For actual production use it's suggested to install a local DNS caching server such as `dnsmasq` or `ncsd` for performance reasons. ## Dependencies - Python 2.6 (due to `multiprocessing`) - Redis - redis-py `sudo easy_install redis` ## Todo - Investigate alternative methods for fetching URLs in order to improve the speed and concurrency over `urllib2.urlopen` - Better busy queue implementation for handling links that are constrained due to the `robots.txt` delay - The link graph may be better stored on disk (or at the very least there should be an interface for storage/manipulation) - Implement a flat file storage system for the page contents (as this is currently only useful for the link graph) - The cache for `robots.txt` should be globally accessible and not a localised Python object - Allow for modifying both the structure, database layer and location of the Redis database server - For longer crawls exceptions and their tracebacks are ignored but should be stored for later review - Make `CrawlRequest` easily extendible so that a new layer of processing can be added arbitrarily - Improve the `robots.txt` parser to handle Unicode and generally more complex formats
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。