bmconverter.py
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Work with text files describing PDF bookmarks (stable)
# bmconverter.py

[http://github.com/goerz/bmconverter.py](http://github.com/goerz/bmconverter.py)

Author: [Michael Goerz](http://michaelgoerz.net)

`bmconverter.py` converts between the bookmark description formats used by
different pdf and djvu bookmarking tools such as [pdftk][1], the [iText
toolbox][2], [pdfLaTeX][3] [pdfWriteBookmarks][4], [jpdftweak][5], [djvused][6], and the
[DJVU Bookmark Tool][7].

This code is licensed under the [GPL](http://www.gnu.org/licenses/gpl.html)

[1]: http://www.accesspdf.com/pdftk/
[2]: http://sourceforge.net/projects/itext/files/
[3]: http://www.tug.org/texlive/
[4]: http://github.com/goerz/pdfWriteBookmarks
[5]: http://jpdftweak.sourceforge.net/
[6]: http://djvu.sourceforge.net/doc/index.html
[7]: http://sourceforge.net/projects/windjview/files/Bookmark%20Tool/

## Install ##

Store the `bmconverter.py` script anywhere in your `$PATH`.

## Usage ##

The script operates on text files in the various supported formats that
describe the bookmark structure in pdf or djvu files. You can then use the
appropriate tools to add the bookmarks to the pdf or djvu file.

The script includes some rudimentary functionality to extract the bookmarks
directly from a pdf file (writing to any of the supported text formats). This
functionality depends on the [pdfminer][8] library. Advanced features like
formatting of the bookmark titles are not supported in the extraction.

In addition to converting between the different formats, the script can also
shift the page numbers associated with the bookmarks. This is useful if you need
to work on a file obtained from a table of contents, where the page numbers in
the pdf might not match the page numbers in the original document, for example.

When used as a module from python, this script provides a toolbox for making
arbitrary modifications to the bookmark data

    Usage: bmconverter.py options inputfile [outputfile]


    Command Line Options

     --mode in2out        Sets the script's operation mode. This option is required.
     -m  in2out           Short for --mode

     --offset integer     Shifts all pagenumber by integer
     -o integer           Short for --offset

     --long               When used with the 'text' output format, enables the use
                          of full destinations, instead of just page numbers
     -l                   Short for --long

     --pdf FILENAME       Set metadata['pdf'] to the given filename. When used
                          together with the latex output mode, the resulting tex
                          file will reference the given filename.

     --help               Displays full help
     -h                   Short for -help

    In the mode option, 'in' and 'out' can be any of the supported formats:
    'xml', 'text', 'pdftk', 'csv', 'djvused', 'latex', or 'html'

    Additionally, 'in' can be 'pdf', in which case the bookmarks are read directly
    from the given pdf file. The pdfminer library must be installed for this to
    work.

An example usage is

    bmconverter.py --offset 2 --mode xml2text bm.xml bm.txt

All data is read and written in UTF-8 encoding, with the exception of xml files,
which are read in the encoding declared in their header, but always written in
UTF-8

[8]: http://www.unixuser.org/~euske/python/pdfminer/index.html

### The XML Format ###

The XML format supports more of the pdf bookmark than any of the other tools.
It is used by the iText toolbox.

Two examples of such XML files would be

    
    
       root
         <Title Action="GoTo" Open="false" Page="1 FitH 500" >sub 1
           <Title Action="GoTo" Page="1 FitBV 100" >sub 2.1
           sub 2.2
         
         sub 2
       
    


    
    
       Go to the top of the page
       
         Toggle the state of the answers
       Useful links
         <Title Action="URI" URI="http://www.lowagie.com/iText" >
           Bruno's iText site
         
           Paulo's iText site
         
           iText @ SourceForge
       
       빈집
       
         What's on page 2?
    

For more details, see the iText documentation.


### The Text Format ###

The text format's purpose is to provide a format that is easier to write by hand
than the XML format that iText can put in a PDF file. The text format cannot
handle all the features the XML format can. It is intended to be used for only
basic bookmarks: a hirarchy of bookmarks, each pointing a page, without further
formatting, external destinations, etc.

The format is used by the pdfWriteBookmarks tool.

The format of the text file is simple: each bookmark is represented by a single
line. The bookmark's level is taken from the indentation. There must be exactly
4 spaces indentation per level. Next is the title of the Bookmark, then a double
colon, and lastly the pagenumber, optionally followed by a destination.

An example bookmark text file is:

    Page 1 :: 1
    Page 2 :: 2 XYZ null null null
    Page 3 :: 3
    Sublevels on page 4 :: 4
        Sub1 :: 5 XYZ 0 10 null
        Sub2 :: 5 XYZ 0 20 null
        Sub3 :: 6 XYZ 0 30 null
            SubSub1 :: 6
            SubSub2:: 6
        Sub4 :: 7
            SubSub1 :: 8
            SubSub2 :: 9
        Sub5 :: 10
    Page 11 :: 11

Specifically, each line is matched by the following regular expression:

          (?P\s*)
          (?P\S.*)   ::  [ ]*  (?P[0-9]*)
          [ ]* (?P (XYZ.*) | (Fit.*))?  [ ]*

The full destinations (e.g. 'XYZ 0 10 null') are only printed if if the --long
option is used.

Note that this format is very limited: it does not express actions other than
GoTo, preserve leading or trailing whitespace in a title, or express titles
that consist only of whitespaces.


### The pdftk Format ###

In the pdftk format, each bookmark is described by three lines, like this:

    BookmarkTitle: Page1
    BookmarkLevel: 1
    BookmarkPageNumber: 1

Lines not belonging to this structure are discarded.
The format is the direct output of the pdftk utility, when run as
    $ pdftk file.pdf dumpdata


### The html Format ###

This format is a HTML file with a special structure. Such files are produced by
Adobe Acrobat when you export a PDF file to HTML. They are also used as the
input for the DJVU Bookmark Tool.

An example of the format is the following:

      
      
      
      
      


### The csv Format ###

The csv format is read and writen by the jpdftweak program. Each bookmark is a
line of fields seperated by semicolon. Specifically, the structure of each line
is described by the following extended regular expression:

        (?P         -?[0-9]+);
        (?P         O?B?I?);    # open, bold, italic
        (?P         [^;]*);
        (?P<page>          -?[0-9]+)
        (?P<destination>   [ ][^;]+)?  # e.g. FitBV 100
        (?P<moreopts>      ;[^;]*)?    # key1=value1 key2=value2 ...

moreopts keys can be:

      Action        if action is not GoTo
      File          for GoToR actions
      Page          for GoToR actions (the page group is 0 in this case,
                    Page consists of page number and destination)
      URI           for URI actions
      Color

Also, the contents of all fields in the csv is escaped: all nonprintable
characters (`ascii < 32`) and the characters `[\:"']` are replaced by `\HH`,
where HH is the two digit ascii hex code (in upper case) for that character.


### The djvused Format ###

This format is read and written by the djvused program.

The outline syntax is a single list of the form

    (bookmarks ...)

The first element of the list is symbol bookmarks. The subsequent elements are
lists representing the toplevel outline entries. Each outline entry is
represented by a list with the following form:

    (title url ... )

The string title is the title of the outline entry. The string url is composed
of the hash character ("#") followed by either the component file identifier or
the page number corresponding to the outline entry. The remaining expressions
describe subentries of this outline entry.

An example of the format is the following:

    (bookmarks
     ("level1"
      "#1"
      ("level2"
       "#11"
       ("level3"
        "#20" ) ) )
     ("Bookmark \"In Quotes\""
      "../external.djvu#2"
      ("Unicode \303\215\303\261\305\244\304\230\320"
       "www.google.com" ) ) )

Note how the target url can be the pagenumber, an external reference, or a url.
Quotes inside the title have to be escaped. Non-ascii characters are written as
escaped octal UTF-8


### The latex Format ###

The latex format results in  a standalone tex file that adds the bookmarks to
the target pdf when compiled with pdflatex (after a few edits). This is
possible by using the pdfpages, hyperref, and bookmark packages. See especially
the documentation of the bookmark package to see how bookmarks are expressed.

An example of the format is the following:

    \bookmark[view={XYZ null null null}, page=1,level=0]{level 1 bookmark}

For parsing the latex format, each `\bookmark` entry must be written entirely
one a single line


### Interactive Usage ###

This script was designed to provide a toolbox for working on bookmark
structures when used as module from Python.

The Bookmark class is the central data structure, representing a bookmark tree.
Each nodes holds all the attributes of the bookmark, and a list of all its
children. A number of methods is provided to modify the tree structure.

Note that each bookmark tree has a dummy root, which does not hold any data
(and is ignored in output)

The Tree is iterable in a preorder traversion.

For more information, read the Bookmark class documentation.

Apart from the Bookmark data structure, the module provides importers and
exporters for all the supported formats.

An example of an interactive usage is shown below. It reads the bookmark
structure from a text file, sets the appearance of all bookmarks at a level
deeper than 2 to 'closed' in Acrobat Reader, and write the resulting structure
to an iText xml file.

    >>> from bmconverter import *
    >>> bm, md = read_text("bookmarks.txt")
    >>> for node in bm:
    ...     if node.level() > 2:
    ...         node.open = False
    ...     else:
    ...         node.open = True
    ...
    >>> write_xml(bm, "bookmarks.xml")

For the full documentation, run

    >>> import bmconverter
    >>> help(bmconverter)

inside the python interpretor.
</pre>  <br />
		
		
		</div>
		<!-- detail content end -->

		<div class="ad_footer">
			
				<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- 728x90, 创建于 11-8-29 -->
<ins class="adsbygoogle"
     style="display:inline-block;width:728px;height:90px"
     data-ad-client="ca-pub-9609188192387119"
     data-ad-slot="4736870470"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
				
			
		</div>


		<!-- file list begin -->
		<div class="detail_file">
			<div class="menu1_left">
				<div class="menu1_right">
					<div class="menu1_txt">部分文件列表(点击文件名可查看文件内容)</div>
			  </div>
			</div>
			
			<div id="file_content">
					<!--div class="ad_footer">
					</div-->

									本源码包内暂不包含可直接显示的源代码文件,请下载源码包。
							</div>
		</div>
		<div class="clear"></div>
		<!-- file list end -->

			</div>
	<!-- content_right end -->
	<div class="clear"></div>
</div>
<!-- content end -->
	
<!-- footer begin -->
<div id="footer">
  <div id="footer_content">
    <div id="contact">联系我们:verysource_com<img src="/images/character/m_a_i_l.gif" width="51" height="12" align="absmiddle" /></div>
    <div id="copy_right">CopyRight © 2008-2022 verySource.Com All Rights reserved. <a target="_blank" rel="nofollow" href="https://beian.miit.gov.cn/" style="color:#66FFFF; font-size:14px;">京ICP备17048824号-1</a> 京公网安备:11010502034788</div>
  </div>
</div>
<div style="display:none;">
	
	<script>
	var _hmt = _hmt || [];
	(function() {
	  var hm = document.createElement("script");
	  hm.src = "https://hm.baidu.com/hm.js?9c89b037e07a1dbd53937515a5761041";
	  var s = document.getElementsByTagName("script")[0]; 
	  s.parentNode.insertBefore(hm, s);
	})();
	</script>
	
</div>
<!-- footer end -->
</body>
</html>