资源说明:ETL Utilities for Clojure
h1. clj-etl-utils
ETL Utilities for Clojure. This library began with functions that worked with data on disk, such as database dumps and log files, at least that was the original purpose of the library, it has since grown to include other utilities.
h2. Modules
h3. clj-etl-utils.io
IO and File utilities.
h4. string-reader, string-input-stream
Returns a Reader or an InputStream, respectively, that will read from the given string.
h4. read-fixed-length-string
Reads a fixed-length string.
h4. chmod
Changes the permissions on a file by shelling out to the @chmod@ command.
h4. mkdir
Creates the given directory, just returning true if the given directory already exists (as opposed to throwing an exception).
h4. exists?
Tests if a file exists.
h4. symlink
Establishes a symlink for a file.
h4. freeze, thaw
freeze invokes the java serialization and returns a byte array. Thaw does the opposite: takes a byte array and deserializes it.
h4. object->file
Uses Java serialization to write an object to the given file, truncating if it exists.
h4. file->object
Deserializes a serialized object from a file.
h4. ensure-directory
Ensures a directory path exists (recursively), doing nothing if it already exists.
h4. string-gzip
Compress a string, returning the bytes.
h4. byte-partitions-at-line-boundaries
This can be used in divide and conquer scenarios where you want to process different segments of a single file in parallel. It takes an input file name and a desired block size. Block boundaries will be close to the desired size - the size is used as a seek position, any line remnant present at that position is read, such that a given block will end cleanly at a line boundary.
h4. random-access-file-line-seq-with-limit
Returns a lazy sequence of lines from a RandomAccessFile up to a given limit. If a line spans the limit, the entire line will be returned, so that a valid line is always returned.
h4. read-lines-from-file-segment
Returns a sequence of lines from the file across the given starting and ending positions.
h3. clj-etl-utils.landmark_parser
h3. clj-etl-utils.lang
@lang/make-periodic-invoker@ can be used to easily create 'progress' indicators or bars
h4. Example
pre.. (let [total 1000
period 100
progress (lang/make-periodic-invoker
period
(fn [val & [is-done]]
(if (= is-done :done)
(printf "All Done! %d\n" val)
(printf "So far we did %d, we are %3.2f%% complete.\n" val (* 100.0 (/ val 1.0 total))))))]
(dotimes [ii total]
;; do some work / processing here
(progress))
(progress :final :done))
p. Produces the following output:
pre.. So far we did 100, we are 10.00% complete.
So far we did 200, we are 20.00% complete.
So far we did 300, we are 30.00% complete.
So far we did 400, we are 40.00% complete.
So far we did 500, we are 50.00% complete.
So far we did 600, we are 60.00% complete.
So far we did 700, we are 70.00% complete.
So far we did 800, we are 80.00% complete.
So far we did 900, we are 90.00% complete.
So far we did 1000, we are 100.00% complete.
All Done! 1000
h3. clj-etl-utils.ref_data
h3. clj-etl-utils.regex
h3. clj-etl-utils.sequences
h3. clj-etl-utils.text
h3. clj-etl-utils.indexer
Module for working with line-oriented data files in-situ on disk. These tools allow you to create (somewhat) arbitrary indexes into a file and walk through the indexed values.
h4. Example
Given the tab delimited file @file.txt@:
pre.. 99 line with larger key
1 is is the second line
2 this is a line
3 this is another line
99 duplicated line for key
p. We can create an index on the @id@ column id:
pre.. (index-file! "file.txt" ".file.txt.id-idx" #(first (.split % "\t")))
p. That index can then be used to read groups of records from the file with
the same key values:
pre.. (record-blocks-via-index "file.txt" ".file.txt.id-idx")
pre.. ( [ "1\tis is the second line" ]
[ "2\tthis is a line" ]
[ "3\tthis is another line" ]
[ "99\tline with larger key"
"99\tduplicated line for key" ] )
h2. Installation
@clj-etl-utils@ is available via Clojars:
https://clojars.org/com.github.kyleburton/clj-etl-utils
h2. References
UTF and BOM
http://unicode.org/faq/utf_bom.html
h2. Random Sampling
"How to pick a random sample from a list":http://www.javamex.com/tutorials/random_numbers/random_sample.shtml
h2. Reference Data
h3. US Zip5 Codes
"Fun with Zip Codes":http://www.mattcutts.com/blog/fun-with-zip-codes/
"US Census Tigerline Data: Zip Codes":http://www.census.gov/tiger/tms/gazetteer/zips.txt
h2. License
This code is covered under the same as Clojure.
h1. Authors
Kyle Burton
Paul Santa Clara
Tim Visher
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。
English
