资源说明:Advanced web browsing, scraping and automation
> == WWWClient 1.0.0
== Advanced web browsing, scraping and automation
-- Author: Sebastien Pierre
-- Created: 21-Sep-2006
-- Updated: 19-Mar-2012
Python has some well-known web automation and processing tools such as
[Mechanize](http://wwwsearch.sourceforge.net/mechanize/), [Twill](http://twill.idyll.org/) and
[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/). All provide
powerful operations to automatically browse and retrieve information from the
web.
However, we experienced limitations using Twill (which is based on Mechanize and
BeautifulSoup), notably the fact that it was difficult to *fine tune the HTTP
requests*, or when the *HTML file was broken* (and you can't imagine how many
HTML files are broken).
We decided to address these limitations by building a library that would allow
to write web clients, using a high-level programming interface, while also
allowing fine-grained control over the HTTP communication level.
WWWClient is a web browsing, scraping and automation client and library that can
easily be used using an interpreter (like 'ipython') or embedded within a
program. WWWClient offers both a high-level API and fine-grain control over
low-level HTTP and web specific elements, as well as a powerful scraping API
that lets you manipulate your HTML document using string, list and tree
operations at the same time.
WWWClient is separated in four main modules:
- The `wwwclient.client` module defines the abstract interface for an HTTP
client. Two implementations are available: one using Python httplib module,
the other using Curl Python bindings.
- The `wwwclient.browse` module defines the high-level, browser-like interface
that allows to easily browse a website. This includes session and cookie
management.
- The `www.scrape` module offers a set of objects and operations to easily
parse and get information from an HTML or XML document. It is made to be
versatile and very tolerant to malformed HTML.
- The `www.forms` module offers a set of objects to represent and manipulate
forms easily, while maintaining the most flexibility.
The forthcoming sections will present the *browsing* and *scraping* modules in
detail. For information on the rest of WWWClient, the best source is the API
documentation (included as [wwwclient-api.html] in the distribution)
Quickstart
==========
You should start by importing the `Session` object, which allows you to
create browsing/scraping sessions.
> from wwwclient import Session
Then start to do your queries:
> print map(lambda _:_.text(), Session(verbose=1),get("http://ffctn.com/services").query("h4"))
Browsing
========
The _browsing module_ (`wwwclient.browse`) is the module you will probably
use the most often, because it allows to mimic a web browser and to post and
retrieve web data.
Before going in more details, it is important to understand the basic
concepts behind HTTP and how they are reflected in the browsing API.
One can express a conceptual model of the elements of an WWW client-server
interaction as follows:
- _Requests_ and _responses_ are the atomic elements of communication.
Requests have a method, a request URL, headers and a body. Responses have
a response code, headers and a body.
- A _transaction_ is the sequence of messages starting with a request and
all the related (provisional and final) responses.
- A _session_ is a set of transaction that are "conceptually linked".
Session cookies are the usual way to express this link in requests and
responses.
These concepts are respectively implemented as `Request`, `Response`,
`Transaction` and `Session` classes within the browse module. The `Session`
class being the highest-level object, this is the one you are the most
likely to use.
Accessing a website
-------------------
To access a website, you first need to create a new `Session` instance, and
give it a URL :
> from wwwclient import browse
> session = browse.Session("www.google.com")
Alternatively, you can create a blank session, and browse later:
> session = browse.Session()
> session.get("www.google.com")
Once you have initiated your session, you usually call any of the following
operations:
- `get()` will send a `GET` request to the given URL
- `post()` wil send a `POST` request to the given URL
- `page()` will return you the HTML data of the current (last) page
- `last()` will return you the current (last) transaction
You also have convenience method such as `url()`, `referer()`, or
`status()`, `headers()` and `cookies()` which give you instant access to
last transaction or session information. See the API for all the details.
So usually, the usage pattern is as follows:
> session = browse.Session("http://www.mysite.com")
> some_page = session.get("some/page.html").data()
> ...
> other_page = session.get("some/other/page.html").data()
> ...
Note that every time that you do a `get` or a `post`, a `Transaction`
instance is returned. Transactions give you interesting information about
"what happened" in the response :
- the `newCookies()` method will tell you if cookies were set in the response
- the `cookies()` method will return you the current cookie jar (the
sesssion cookie merged with the new cookies)
- the `redirect()` method will tell you if the response redirected you to
somewhere.
And you also have a bunch of other useful methods documented in the API.
Note ________________________________________________________________
You can also directly print a transaction or pass it through the `str`
method to get its response data.
If it important to tell that you can give `get` and `post` two parameters
that will influence the way resulting transactions are processed :
- The `follow` parameter tells if redirections should be followed. If not
you will have to do manually something like :
> t = session.get("page.html",follow=False)
> while t.redirect(): t = session.get(t.redirect())
- The `do` parameter tells if the transaction should be executed now or
later. If the transaction is not executed, you can call the `do()`
transaction method at any time. This allows to prepare transactions
and execute them at will.
# TODO: Example
Now that you know the basics of browsing with WWWClient, let's see how to
post data.
Posting data
------------
Posting data is usually the most complex thing you have to do when working
on web automation. Because you can post data in many different ways, and
because the server to which you post may react differently depending on what
and how you post it, we worked hard to ensure that you have the most
flexibility here.
There are different ways to communicate data to an HTTP server. WWWClient
browsing and HTTP client modules offer different ways of doing so,
depending on the type of HTTP request you want to issue:
1) Posting with GET and values as parameters::
|
> session.get("http://www.google.com", params={"name":"value", ...)
> GET http://www.google.com?name=value
|
Here you simply give your parameter as arguments, and they are
automatically url-encoded in the request URL.
2) Posting with POST and values as parameters::
|
> session.post("http://www.google.com", params={"name":"value", ...)
> POST http://www.google.com?name=value
|
Just as for the `GET` request, you give the parameters as arguments, and
they get url-encoded in the request URL.
3) Posting with POST and values as url-encoded data::
|
> session.post("http://www.google.com", data={"name":"value", ...)
> POST http://www.google.com
> ...
> Content-Length: 10
> name=value
|
By giving your values to the `data` argument instead of the `params`
argument, you ensure that they get url-encoded and passed as the request
body.
4) Posting with POST and values as form-encoded data::
|
> session.post("http://www.google.com", fields={"name":"value", ...)
> POST http://www.google.com
> ...
> Content-Type: multipart/form-data; boundary= ...
> ------------fbb6cc131b52e5a980ac702bedde498032a88158$
> Content-Disposition: form-data; name="name"
>
> value
> ------------fbb6cc131b52e5a980ac702bedde498032a88158$
> ...
|
Here the given fields is directly converted as a `multipart/form-data`
body.
5) Posting with POST and values as custom data::
|
> session.post("http://www.google.com", data="name=value")
> POST http://www.google.com
> ...
> Content-Length: 10
> name=value
|
You can always submit your own data manually if you prefer. In this case,
simply give a string with the desired request body.
6) Posting files as attachment::
> attach = session.attach(name="photo", filename="/path/to/myphoto.jpg")
> session.post("http://www.mysite/photo/submit", attach=attach)
This enables sending a file as attachment to the given URL. This is a
rather *low-level* functionaly, and you will most likely want to use the
`submit()` method of session that allows you to submit data. This is the
purpose of the next section.
In some cases, you will want your data/arguments to be posted in a specific
order. To do so, WWWClient offers you the `Pairs` class, which is actually
used to internally represent headers, cookies and parameters.
`Pairs` are simply ordered sets of (key, value) pairs. Using pairs, you can
very easily specify an order for your elements, and then ensure that the
requests you send are *exactly* how you want them to be.
Note ___________________________________________________________________
When specifying the `data` argument to `post()`, you cannot use
the `fields` or `attach` arguments : they are exclusive.
|
Also, for any more detail on the arguments and/or behaviour of any of
these functions, have a look at the `wwwclient.client` API
documentation.
Submitting forms
----------------
We've seen how to post data to web servers, using the session `post()`
method. WWWClient offers in addition to that a `submit()` method that
interfaces with the _scraping module_ to retrieve the forms description and
prepare the data to be posted.
To get the forms available in your current session, you can do:
> >>> session.forms()
> {'formname':> In this respect, it would be very easy to get the table by looking for the `"MOST POPULAR PROJECTS"` string and then for the `"", link.attribute("href") > else: > print "---------" We've seen the basic `cut`, `filter` and `find` functions of the tag tree. They consist in the most useful operations you can do with the tag tree, and the ones you are the most likely to use in your everyday scraping duties. Web automation -------------- In addition to the basic tag tree operations, you can have access to higher level functions that will automatically extract _forms_ and _links_ for you. forms:: The forms method scrapes the HTML document for forms and inputs (including `textarea`, `select`, and all the likes). The scraping algorithm will manage some borderline cases, such as definition of fields outside of a form. To use it, simply do: > HTML.forms(tagtree or taglist or string) links:: Pretty much like the `forms()` method, the `links()` method will return the list of links as `[(tagname, href),...]`. Links cover every HTML element that defines an `href` or `src` attribute, so that it includes images and iframes. In the above example: > >>> HTML.links(link) > [(u'a', u'http://www.gossamer-threads.com/lists/python/python/516801')] The web automation procedures all work with strings, tag list and tag trees, and are optimized for very fast information retrieval. Text processing functions ------------------------- As we've seen, it is important to be able to easily convert between the string, the tag list and the tag tree. When manipulating the string representation, we may want to remove the HTML tags, normalize the spacing or expand the entities. WWWClient offers functions to cover all these needs: expand:: This function takes a string with HTML entities, and returns a version of the string with expanded entities. > >>> scrape.HTML.expand(""""Python") > '"""Python' norm:: The `norm()` function replaces multiple spaces, tabs and newlines by a single space. This ensures that spacing in your text is consistent. text:: The `text()` method automatically strips the tags from your document (it is a bit like stripping the tags from a tag list), and returns the text version of it. You can set the `expand` and `norm` parameters to pass the result to the `expand` and `norm` functions html:: The `html()` method allows you to convert your taglist or tagtree to a string of HTML data. If you join a taglist created from an HTML string, it will be strictly identical to this string. In addition to these basic functions, you also have the following convenient functions: textcut:: Textcut allows to specify `cutfrom` and `cutto` markers (as strings) that will delimit the text range to be returned. For instance if you are looking for the text between `
` and `
`, you simply have to do: > HTML.textcut(text, "", "
") When the markers are not found, the start or end bounds of the text are used instead. textlines:: Texlines allow to split your text in lines, optionally stripping (`strip`) the lines and filtering out empty lines (`empty`). Tips ==== 1) Split your HTML into sections:: Most HTML documents consist of different parts: headers, footers, navigation section, advertising, legal info, and actual content. It will be easier for you if you start by cutting and splitting your HTML document, getting rid of what you don't need and keeping what you are looking for. 2) Get rid of HTML when you don't need structure:: Manipulating HTML data can be difficult, especially when the document is not really well-formed. For instance, if you are scraping a document with table-formatted data, it may be complex to access the elements you need. In this case, it may be a better option to cut and split your HTML so that you have your ''sections of data'' at hand, and then convert them to text using `HTML.text(HTML.expand(html))`, and simply process the rest with the scraper module text processing tools or Python regular ones. Text encoding support ===================== As text data may come from various sources, some may be already encoded or not. To ensure that proper conversion is made, the following WWWClient elements feature an `encoding` property or argument to relevant methods that allows to specify in which encoding the given text data is encoded. This will ensure that the data is properly converted to raw strings before being passed to the transport layer (provided by the Curl library). Session:: Session encoding defines the default encoding for all the requests, cookies, headers, parameters, fields and provided data. Setting a session encoding will set every transaction, request and underlying curl handler encodings to the session one. Form:: when filling values within a form, the values will be converted to string when necessary. The given encoding will tell in which encoded the string should be coded. Acknowledgments =============== I would like to thank Marc Carignan from [Xprima.com](http://www.xprima.com) for giving me the opportunity to work on the WWWClient project and let it be open-sourced. This project could not have happened without his help, thanks Marc ! I would also like to thank the whole Python community for having created such an expressive, powerful language that have pleased me for years. # vim: ts=4 sw=4 et syn=kiwi
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。
English
