ternip
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Temporal Expression Recognition and Normalisation in Python
TERNIP: Temporal Expression Recognition and Normalisation in Python
===================================================================

Created by Chris Northwood as part of an MSc in Computer Science with
Speech and Language Processing at The University of Sheffield's Department
of Computer Science.

[![Build Status](https://travis-ci.org/cnorthwood/ternip.svg?branch=master)](https://travis-ci.org/cnorthwood/ternip)
[![Documentation Status](https://readthedocs.org/projects/ternip/badge/?version=latest)](http://ternip.readthedocs.org/en/latest/?badge=latest)

WHAT IS TERNIP?
---------------

TERNIP is a library which can recognise and normalise temporal expressions
in text. A temporal expression is one such as '8th July 2010' - those which
refer to some form of time (either a point in time, a duration, etc).

The two functions TERNIP performs is first to identify these expressions in
some text, and then to figure out the absolute date (as close as it can)
that is refers to.

TERNIP can handle a number of formats for representing documents and the
metadata associated with a TIMEX. The most common form is TimeML.

INSTALLING
----------

TERNIP was developed on Python 2.7 and has not been tested on earlier
versions nor Python 3.0. Therefore, Python 2.x, where x >= 7 is recommended,
but your mileage my vary on other systems.

TERNIP also depends on NLTK and dateutil. Please ensure both of these Python
packages are installed.

TERNIP uses Python's distutils to install itself. To install TERNIP, please
run:

    python setup.py install

USING ANNOTATE_TIMEX
--------------------

The annotate_timex command provides a simple front-end to TERNIP. Running
'annotate_timex' with no arguments shows you the default usage for the
script.

USING THE API
-------------

### Doing The Recognition/Normalisation

Two functions are provided which returns instances of the default
recognisers and normalisers:

* ternip.recogniser()
* ternip.normaliser()

You can also manually load the recognition and normalisation rule
engines (currently the only modules for recognition and normalisation).

This can be done by instantiating the objects:

* ternip.rule_engine.recognition_rule_engine()
* ternip.rule_engine.normalisation_rule_engine()

And then calling the load_rules(path) with a path to where the rules to
be loaded are stored.

Once this has been done, the recogniser supports a single method:

* tag(sents): This takes a list of sentences (in the format detailed
            in the section below) and returns a list of sentences in
            the same format, with the third element of the token
            tuple filled with ternip.timex objects indicating the
            type and extent of the expression covered.

Once this has been done, the normaliser can be used on the recognised
time expression extents to fill the other attributes. Again, a single
method exists on the normaliser class:

* annotate(sents, dct): This takes sentences where the third element
                      in the token tuple is filled with timex
                      extents and the document creation time (or the
                      context to be considered when computing
                      relative date offsets) and fills the
                      attributes of the timex objects where it can.
                      The DCT is expected to be a string in ISO 8601
                      (basic) format.

### Handling TIMEX-annotated documents

The annotation functions expect input in the format of a list of
sentences, where a sentence is a list of tuples consisting of the token,
the part-of-speech tag and a set of timexes tagged with that object,

i.e., [[('token', 'POS-tag', set([timex1, timex2, ...])), ...], ...]

In the ternip.formats package, a number of classes exist which can
convert to and from document formats and this internal format.

These classes can be instantiated by passing in a string containing the
document to the constructor, with different classes also supporting
optional keyword arguments which can specify additional metadata (as
fully documented in the API documentation). These classes will also use
the NLTK to tokenise and part-of-speech tag the document text.

These document classes then support a standard interface for accessing
the data:

* get_sents(): Get the text from the document in the format required
             for the annotator
* get_dct_sents(): If the document format contains the document
                 creation time, then get that in the format ready
                 for the annotator
* reconcile(sents): Add the TIMEX metadata from the annotated sents to
                  the document - XML documents also allow adding
                  part-of-speech and tokenisation metadata to the
                  document
* reconcile_dct(sents): Add TIMEX metadata to the document creation
                      time information (from get_dct_sents())
* __str__(): Returns the document in a string, ready to be written
           out or similar.

Some classes also support a create static method, which can be used to
create new instances of that document from the internal representation.
This can be useful without the annotator functions to transform TIMEX
annotated documents between 2 formats.

The supported formats included with TERNIP are:

* ternip.formats.tern: An XML parser for the TERN dataset (note, the
                     TERN dataset is SGML, a superset of XML, so
                     some documents may not correctly parse as XML)
* ternip.formats.timeml: Documents in TimeML format
* ternip.formats.timex2: Generic XML documents annotated with the
                       TIMEX2 tag
* ternip.formats.timex3: Generic XML documents annotated with TimeML's
                       TIMEX3 tag
* ternip.formats.tempeval2: The tabulated format used for the
                          TempEval-2 competition

### Changing How TERNIP Handles Warnings

TERNIP logs all warnings as 'warn' level under the ternip namespace
using Python's logger. You are responsible for handling this however
you'd like.

EXTENDING TERNIP
----------------

### Writing Your Own Rules

The rule engines (normalisation and recognition) in TERNIP support three
types of files: single rules, rule blocks and complex rules. Single
rules and rule blocks consist of files with lines in the format:

    Key: Value

Where the acceptable keys and value formats depend on the exact type of
rule (recognition or normalisation) and are defined further below.

Rule blocks can contain many rules, separated by three dashes on a line.
Additionally, the first section of the file is a header for the rule
block.

Complex rules are Python files which contain a class called 'rule' which
is instantiated. These classes must implement an interface depending on
which type of rule it is.

Rule regular expressions undertake some preprocessing. Apart from when
specified using the 'Tokenise' option on normalisation rules, sentences
are converted into the form  with no spaces, so
this is what the rules are matched against. Additionally, < and >, which
indicate token boundaries are preprocessed and the token open bracket
must be at the same parenthesis nesting level as the closing one.

For example,

    ()? is valid
     is not, and will not match as expected

Finally, the quantifiers + and ? on the matching character . will not
match across token boundaries, apart from if matching Deliminated number
word sequences (i.e., NUM_START.+NUM_END).

When number delimination is enabled, then sequences of number words will
be surrounded with NUM_START and NUM_END, and of ordinal sequences with
NUM_ORD_START and NUM_ORD_END, e.g.,

    NUM_STARTNUM_END
    NUM_ORD_STARTNUM_END

Additionally, in regular expressions, the following words will be
replaced with predefined regular expression groups:

* $ORDINAL_WORDS: which consist of word forms of ordinal values,
* $ORDINAL_NUMS: the number forms (including suffixes) of ordinal values,
* $DAYS: day names
* $MONTHS: month names
* $MONTH_ABBRS: three-letter abbreviations of month names
* $RELATIVE_DAYS: relative expressions referring to days
* $DAY_HOLIDAYS: holidays that have "day" in the name
* $NTH_DOW_HOLIDAYS: holidays which always appear on a particular day
                   in the nth week of a given month
* $FIXED_HOLIDAYS: holidays which have a fixed date (including token
                 boundaries)
* $LUNAR_HOLIDAYS: holidays which are relative to Easter (including
                 token boundaries)

The exact format of regular expressions is as implemented in the Python
're' module: http://docs.python.org/library/re.html

When dealing with guard regular expressions, if the first character of
the regular expression is a !, this makes the regular expression
negative - the rule will only execute if that regular expression does
not match.

#### Rule Blocks

Rule blocks consist of sections separated by three dashes (---) on
a line by themselves. The first section in a rule block is the
header of the block and is in the following format, regardless of
whether it's a recognition or normalisation rule. The format of the
following sections is in the format of the single rules described
below, except keys relating to ordering (ID and After) are erroneous
as ordering is defined by the rule block.

The following keys are valid in the header:

* Block-Type: this can be either 'run-all' or 'run-until-success'.
            In the case of run-all, all rules are run regardless
            of whether or not previous rules succeeded or not,
            and 'run-until-success' which will run until the
            first rule successfully applies.
* ID: This is an (optional) string containing an identifier which
    can be referred to by other rules to express ordering.
* After: This can exist multiple times in a header and defines
       an ID which must have executed (successfully or not)
       before this rule block runs.

#### Single Recognition Rule

The following keys are valid in recognition rules:

* ID: This is an (optional) string containing an identifier which
    can be referred to by other rules to express ordering.
* After: This can exist multiple times in a header and defines
       an ID which must have executed (successfully or not)
       before this rule runs.
* Type: This is a compulsory field which indicates the type of
      temporal expression this rule matches.
* Match: A compulsory regular expression, where the part of a
       sentence that matches is marked as the extent of a new
       timex
* Squelch: Defaults to false, but if set to true, then removes any
         timexes in the matched extent. True/false are allowed
         values.
* Case-Sensitive: true/false, defaults to false. Indicates whether
                or not the regular expressions should be case
                sensitive
* Deliminate-Numbers: true/false, defaults to false, whether or
                    not number sequences are deliminated as
                    described above
* Guard: multiple allowed, a regular expression the entire
       sentence should match to be allowed to execute
* Before-Guard: multiple allowed, as a guard, but only matches
              on the tokens before the extent that was matched
              by the 'Match' rule. (Anchors such as $ can be
              useful here)
* After-Guard: multiple allowed. As a Before-Guard, but instead
             matches on the tokens after the extent matched by
             Match (Anchors such as ^ can be useful here).

#### Complex Recognition Rule

Complex recognition rules are Python classes with a single method
and two static variables:

* id: A string (or None) containing an identifier for this rule
    which can be used for ordering
* after: A list of strings containing identifiers which must
       have run before this rule is executed
* apply(sent): This function is called when this rule is executed.
             'sent' is a list of sentences in the internal
             format described above, and the function is
             expected to return a tuple where the first element
             is the sentence with timex objects added and the
             second element is a Boolean indicating whether or
             not the rule altered the sentence or not.

#### Single Normalisation Rule

In the Python expressions described below, you can use the shortcut
text {#X} which is replaced with the matched regular expression
group X, e.g., {#1} will be the part of the sentence that matched
the first parenthesis group in the text. {#0} would be the entire
matched expression (this is equivalent to the group member of the
Python re.match class).

Additionally, a number of variables and support functions are
available to these Python expressions which can assist the writing
of normalisation rules.

The following variables are available:

* timex: The timex object which is currently being annotated.
* cur_context: The ISO 8601 basic string containing the current
             date-time context of this sentence.
* dct: The ISO 8601 basic string containing the document creation
     time of this document.
* body: The part of the sentence which is covered by the extent of
      this timex, in internal format (self._toks_to_str() can be
      useful to convert this into a string format described
      above).
* before: The part of the sentence preceding the timex, in
        internal format
* after: The part of the sentence processing the timex, in
       internal format.

The functions in the ternip.rule_engine.normalisation_functions
package are all imported in the same namespace as the expression
being evaluated, so you can call the functions directly. You can
find more details about these functions and their signatures in the
API documentation.

The timex fields are fully documented in the ternip.timex class,
and are related to their meaning in the TimeML specification.

The following keys are valid in normalisation rule definitions:

* ID: This is an (optional) string containing an identifier which
    can be referred to by other rules to express ordering.
* After: This can exist multiple times in a header and defines
       an ID which must have executed (successfully or not)
       before this rule runs.
* Type: an optional string which the type of the timex must match
      for this rule to be applied
* Match: A regular expression which the body of the timex must
       match for this rule to be applied. The groups in this
       regular expression are available in the annotation
       expressions below.
* Guard: A regular expression which the body of the timex must
       match for this rule to be applied. Unlike 'Match', the
       regular expression groups are not available in other
       expressions.
* After-Guard: A regular expression like Guard, except it matches
             the part of the sentence after the timex.
* Before-Guard: A regular expression like Guard, except it matches
              the part of the sentence before the timex.
* Sent-Guard: A regular expression like Guard, except that it
            matches against the entire sentence.
* Value: A Python expression which the results of evaluating are
       set to the 'value' attribute of the timex.
* Change-Type: A Python expression which, if set, changes the type
             of the timex to what it evaluates to.
* Freq: A Python expression which, if set, sets the freq attribute
      on the timex.
* Quant: A Python expression which, if set, sets the quant
       attribute on the timex.
* Mod: A Python expression which, if set, sets the mod attribute
     on the timex.
* Tokenise: Whether or not to prepare the sentence into the form
          described above for the regular expressions. If set to
          true (the default), then it it converted into the
          tokenised string format described above. Otherwise,
          the value is used as the separator between the tokens
          when detokenising. The special values 'space' and
          'null' can be used to indicate the token separator
          should be the single space, or no gap at all. Note, if
          tokenise is not true, then Deliminate-Numbers can not
          be used, and part-of-speech tags are not available to
          the regular expressions.
* Deliminate-Numbers: If set to true (defaults to false), then
                    sequences of number words are delimited with
                    the tokens NUM_START and NUM_END, and
                    ordinals with NUM_ORD_START and NUM_ORD_END.

#### Complex Normalisation Rule

Complex normalisation rules are Python classes with a single method
and two static variables:

* id: A string (or None) containing an identifier for this rule
    which can be used for ordering
* after: A list of strings containing identifiers which must
       have been executed (successfully or not) before this rule
       is executed
* apply(timex, cur_context, dct, body, before, after):
    The function that is called when this rule is being executed.
    The first argument is the TIMEX to be annotated (the fields
    of the timex object which could be annotated are detailed in
    the API documentation for the ternip.timex class), the
    second argument is a string in ISO 8601 basic format
    representation of the current context of the document. The
    'dct' argument is the creation time of the document and
    'body', 'before' and 'after' contain the list of tokens
    (in the internal form) of the extent of the timex,
    preceding and following the timex extent. This function is
    expected to a return a tuple where the first element
    consists of a Boolean indicating whether or not this rule
    successfully ran, and the second element consists of the
    current date/time context (in ISO 8601 basic form), which
    may have been changed by this rule.

### Writing New Tagging or Annotating Modules

New tagging and annotation modules are expected to implement the same
interface as the rule engines described above.

### Writing New Document Formats

New document formats are expected to contain the same interface as
described above. If you are writing a new document format based around
XML, the ternip.formats.xml_doc.xml_doc class may provide useful
functionality.

### Enabling Debug Functionality

The classes normalisation_rule and recognition_rule have a member called
_DEBUG which is a Boolean to help in debugging rules. When _DEBUG is set
to True, then the comment attribute of the timex is set to the
identifier of the rule which tagged/annotated it.

EXTRAS
------

### sample_data

In the sample_data folder you will find varying corpora of documents
with TIMEX tags annotated in varying formats. You can use these
(perhaps stripping the TIMEX tags first) to test the system, as well as
aiding in development of your own rules/modules/etc

### extras/terneval.py

This handy little script runs TERNIP against the TERN sample data and
reports the performance at the end. This also requires Perl to be
installed and on your path, as that's what the TERN scorer uses.

(NOTE: The TERN scorer appears to give very low results on Linux)

### extras/tempeval2.py

As with the TERN script, this runs TERNIP against the TempEval-2 corpus
provided in the sample data, and reports its performance at the end.
Both this and terneval.py demonstrate samples on how to use the TERNIP
API. 

### extras/add_rule_numbers.py

This handy little script takes a ruleblock file and outputs the same
ruleblock but with a comment at the top of each rule indicating its
index in the file. It is highly recommended to run this if you write
your own rules, as it makes quickly identifying faulty rules easy.

### extras/preprocesstern.py

This will take the TERN corpus and annotate it with tokenisation/part-of
-speech metadata to make document loading quicker.

### extras/performance.py

This takes the pre-processed documents (produced by the script above)
and annotates them all, giving speed statistics at the end.

### runtests.py

This file executes the unit test suite for TERNIP.

### gate/ternip.xgapp

This provides a .xgapp file which can be loaded into GATE to use TERNIP
as a processing resource.

本源码包内暂不包含可直接显示的源代码文件,请下载源码包。