kolumny
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Column-based spreadsheet for command line. Advanced multifile procesing and plotting.
kolumny
=======

`kolumny` - declarative multi-file column oriented command line processing engine


What is it?
-----------

It is a simple command line tool that was designed to preprocess and process
scientific data files and to be used together with `gnuplot` program.

It can be used for your various processing needs when you have text files with
numeric data, especially organized in rows. It is excellent for processing `tsv`
and `csv` files, processing outputs of other tools, time series analysis, etc. It
was initially (2010-2012) used for preprocessing and processing spectroscopic
data in biochemical research.

It augments nicely other Unix text processing tools like `grep`, `awk`, `sed`,
`cat`, `cut`, `paste`, `bc` and ad-hoc solutions with `perl` or `python`. And in
some cases completely replaces them for numerical data processing. Most notably
it is easier and more flexible to use than `awk` or `paste`, especially when
using with multiple files at the same time, that should be processes in parallel
with some computations between files need to be performed at the same time.
Usually this can't be done easily in `gnuplot` either (i.e. plot a sum of two
timeseries, with each time series coming from a different file).


Magic
-----

Here is very quick glimpse what it can do, if you are already familiar with
`gnuplot`.

Lets say two files `d1.txt` and `d2.txt`, are both a text files with some
measurement data. They do have some two line metadata header, and then they
contain 1000 rows of data with each row having 6 columns. Columns are separate by
tabs or a single space. First column being a time and rest of columns being
energy in various channels measured in the same units. Lets additionally assume
that each file correspond to a different total amount of measurements (5 and 7
respectively), and they need to be normalized first before being merged.

Something like this maybe:

d1.txt
```raw
# MESUREMENT 1, 2010-01-01 15:00
# Count 5
0.1 0.57 0.59 0.20 0.30 0.99
0.2 0.80 0.33 0.02 0.73 0.74
0.3 0.48 0.22 0.15 0.57 0.81
0.4 0.10 0.10 0.38 0.73 0.36
0.5 0.87 0.85 0.41 0.66 0.85
0.6 0.01 0.65 0.02 0.80 0.22
0.7 0.89 0.89 0.60 0.17 0.82
0.8 0.20 0.78 0.56 0.35 0.30
...
```


d2.txt
```
# MESUREMENT 1, 2010-01-01 15:00
# Count 5
0.1 0.78 0.34 0.50 0.54 0.45
0.2 0.37 0.65 0.52 0.23 0.10
0.3 0.95 0.13 0.17 0.53 0.21
0.4 0.25 0.55 0.99 0.55 0.94
0.5 0.19 1.00 0.01 0.72 0.29
0.6 0.23 0.32 0.91 0.53 0.30
0.7 0.44 0.77 0.76 0.56 0.66
0.8 0.33 0.40 0.61 0.30 0.84
...
```


If we want to merge these two independent measurement data files to obtain better
statistic, and plot it in we could do this:

```gnuplot
plot " output.txt
```

`kolumny` has many optional features and shortcuts, that allow to express
processing shortly and in flexible manner. It also helps with debugging by giving
reasonable error messages, ability to comment out (and disable) any part of the
argument list, or do computations without printing them, to be used as further
input to other expressions.

The interface is heavily influenced by `gnuplot` `plot` and `fit` commands.

`kolumny` is fast and memory efficient (its memory usage is not dependent on the
size of input data).

## Usage

```text
Usage: kolumny [options]... inputspec...

options:
	--begin python-code
	       : execute Python code after startup
	--end python-code
	       : execute Python code at shutdown
	--gnuplot_begin gnuplot-code
	       : execute gnuplot code after startup
	--fit fit-expression
	       : perform simplified internal fiting (linear, with weights)
	--gnuplot_plot gnuplot-plot-expressions
	       : plot using gnuplot
	--gnuplot_fit gnuplot-fit-expressions
	       : fit usign gnuplot
	--tmpdir path
	       : use path as temporary directory if needed
	--
	       : end options processing here

inputspec - one of:
	filespec [skip N] [using usingspec]
	":expression"

filespec - one of:
	"filepath"       : named file
	"-"              : standard input
	">`.

Additional integer bit operators: `&`, `|`, `^`, `~`

Comparison operators: `==`, `!=`, `>`, `>=`, `<`, `<=`, `<>`

Additional operators: `[]`.

Logic operators: `and`, `or`, `not`.

Parantheese: `()`

Data conversion functions:
 * `int`
 * `float`
 * `complex`

Math functions and constants:
 * `sqrt`
 * `sin`, `cos`, `tan`, `asin`, `acos`, `atan`
 * `atan2`, `hypot`
 * `cosh`, `sinh`, `tanh`, `acosh`, `asinh`, `atanh`
 * `pow`, `exp`, `expm1`, `log`, `log10`, `log2`, `log1p`, `ldexp`
 * `erf`, `erfc`, `gamma`, `lgamma`, `factorial`
 * `gcd`
 *  `isinf`, `isnan`, `isfinite`
 * `pi`, `e`, `tau`, `inf`
 * `floor`, `ceil`, `trunc`, `fabs`, `fmod`, `frexp`, `modf`
 * `fabs`, `copysign`
 * `degrees`, `radians`
 * `fsum`

See a [documentation of Python math module](https://docs.python.org/3/library/math.html)
for details.

Additional functions:
 * `sum`
 * `avg`, `stddev`
 * `eq`
 * `vec_add`
 * `min`, `max`, `count`

Numbers can have underscores for grouping, for example: `12_345`,
`1.131_411e+33`.

Hexedecimal integers are supported, for example: `0xdeadbeef`.


Chained comparisons, like `a < b < c` are guaranteed to work. Note that `1 != 2
!= 1` is true and is guaranteed to work in future version. Note that strange
comparisons like `1 < 3 > 2` will work, but are not guaranteed to continue
working the same way (or even work at all) in future major versions of `kolumny`.

### Python compatibility notes

In general the entire power of Python is available in `kolumny`. However, some
features are not guaranteed to work in feature versions, as indicated below.

Other Python operators, data types and functions are available, but are not
guaranteed to work in the future new major versions of `kolumny`. For these
reasons try to keep any custom complex processing to minimum. Including use of
`map`, `filter`, `reduce`, `zip` and custom `lambda` expressions. But these are
useful in many applications, and makes `kolumny` extremally powerful tool, so use
your good judgment.

Note: Array/vector concatenation operator (`+`) is not guaranteed to work in
future new major versions of `kolumny`.

Note: Use of tuples (like `(1,2,3)`) is discouraged, and tuples are not
guaranteed to work in  future new major versions of `kolumny`.

Note: Use of complex numbers (like `1+2j`) is supported, and in general will be
preserved in future new major versions of `kolumny`.

Note: Octal and binary literals, like `0o377` or `0b11001110101`, will work, but
are not guaranteed to work in future major versions of `kolumny` (binary ones are
more likely to be supported for longer time).

Note: `Decimal` and `str` conversions are generally available, if used together
with `--import`, but are not guaranteed to work in new major versions of
`kolumny`.

Note: Use of functions that modify variables (especially vectors) in-place (like
`append`, `sort`, `extend`, `reverse`) is strongly discouraged, and can break,
change behaviour or not work even between minor versions of `kolumny`.

Note: These rules are mainly here to be able to port `kolumny` to different
programming language if needed, without re-implementing all quirks of Python in
it. Depending on user feedback of used features, different priorities will be
assigned to what make supported and what to drop without big user impact.

Note: Comments inside expressions are generally supported, for example
`":a:=sum(x) # Add all columns."` will work.


### No input - command line calculator with variables / spreadsheet

If no input files are specified at all, `kolumny` will process all the expressions
one time and print results as requested:

```
kolumny :x:=sin(pi/7) :y:=cos(pi/7) :x*x+y*y
```

Will display `0.433883739118 0.900968867902 1.0`.


### Dependencies

As seen previously, expressions can use variables defined by other expressions or
`using` variables.

```
kolumny "file.txt" using "a:=1,b:=(column(2)+column(3))" \
	":~x:=a/b"
	":~y:=b/a"
	":x+y"
```


### Reordering

Variables in all modes, can be defined in arbitrary order.

This enables you to use `kolumny` as a calculator or a spreadsheet.

```
kolumny ":x:=3" ":y+x*100" ":y:=4*z" ":z:=100000"
```

Will display `3 400300 400000 100000`, despite `y` using `z` that is defined only
further in the command line.

Note that only one line will be printed, and expressions can be provided in any
order. Values printed will be printed in the exact order specified by the user.


```
kolumny \
	":x+y" \
	":~x:=a/b" \
	":~y:=b/a" \
	"file1.txt" using "~a:=1,~b:=(column(2)+column(3)"
```

Is perfectly valid too, equivalent and will produce exactly same output as the
example in the previous section ([Dependencies](#dependencies)).


### Cycles

Cyclic dependencies are an error:

```
kolumny ":x:=3" ":y:=z+1" ":z:=y+1"
```

Will result in an error becasue `y` and `z` reference each another.


### Shell preprocessing / generation

```
kolumny "= 0)"
	":sqrt(a)"
```

Will check that first column is non-negative, and then perform a square root
operation on it and print the result.


### Checking multiple inputs

When combining multiple files it might be beneficial to make sure that they are
matched correctly for parallel processing.

```
kolumy \
	"file1.txt" using t1:=1,x1:=2 \
	"file2.txt" using t2:=1,x2:=2 \
	":~check(t1==t2)" \
	":t1" \
	":x1" ":x2"
```

Will output 3 columns, a first column (`t1`) from `file1.txt`, and second columns
(`x1`, `x2`) from both input files. It will also make sure that first column
(`t1`) is identical in both input files.

This is a good way to check that multiple files we are combining conform to the
same form (i.e. they were exported from other software the same way, or
measurement equipements to capture data was set up the same way).


### Combining skip and check

Especially when dealing with complex headers, or data that do not have rows
correctly aligned, one would use `skip` and `check`

```
kolumy \
	"file1.txt" skip 11 using t1:=1,x1:=2 \
	"file2.txt" skip 9 using t2:=1,x2:=2 \
	":~check(t1==t2)" \
	":t1" \
	":x1" ":x2"
```

Will do the same as previous example, but initially will skip first 11 and first
9 lines from `file1.txt` and `file2.txt` respectively. This can be useful when
data produced are not consistently starting at the same value.

In many processing scenarios some other code (for example in Bash script) will
determine a correct value of a skip and pass it to `kolumny`.

This example can't be simply recreated using `tail`, `paste` and `awk`, without
creating additional temporary files.


### Termination

`kolumny` will terminate when its input file have no more rows to process.

When using multiple input files `kolumny` will terminate processing as soon as
one of the files finishes, or it can not be processed correctly any longer (i.e.
no more correct columns to be used via `using`).

TODO(baryluk): Add a feature, to allow continuing processing until all input
files finish, and use implicit column values for already finished files.


### Complex numbers

`1j` an imaginary unit. Example of complex number: `3+4j`.

When reading data, one can easily create complex numbers and use them in
expressions:

```
kolumny "data1.txt" u z:=(column(1)+1j*column(2)) ":z**2
```

Will form a complex number from real and imaginary part from column 1 and 2,
display this complex number (like this `(1.2+3.4j)`), and its square.

Standard mathematical functions with support for complex arguments can be
accessed via cmath module using `--import` option (see [Importing Python
modules](#importing-python-modules) section for details).


### Custom initalization

Arbitrary Python code can be executed on the start of the `kolumny` with the use
of `--begin` options, and the results of these execution can be used in
processing stages:

```
kolumny --begin a=3 "file1.txt" u ~x:=1 ":x*a"
```

or

```
kolumny --begin a=3 "file1.txt" u "(a*column(1))"
```

Will both output first column multiplied by a constant `a`, which is equal to 3.

Multiple `--begin` can be specified, and they will be executed in order:

```
kolumny --begin a=3 --begin a=a*a "file1.txt" u "(a*column(1))"
```

Will output first column multiplied by 9.

One can also use semicolons and comments, as in normal Python code:

```
kolumny --begin "a=3; b=4 # Init" "file1.txt" u "(a*column(1)+b)"
```

Or use statments with side effect:

```
kolumny --begin "print('Processing...')" "file1.txt" u 1
```

This feature might be useful with ability to modify global variables defined by
`--begin` in expressions:

```
kolumny --begin "a=0.0" "file1.txt" u "~x:=1" ":x" ":~a+=x" ":a"
```

Will output two columns. First a copy from the input, and second with a running
(cummulative) sum of the first one.


### Importing Python modules

Arbitrary Python modules can be imported to be used in expressions.


### Accumulators and other statistical operations across rows

```
kolumny \
	--begin 'maximum1=float("-inf")' \
	--begin 'maximum2=float("-inf")' \
	"file1" using 1,v1:=4,~v2:=5 \
	":maximum1=max(maximum1, v1)" \
	":~maximum2=max(maximum2, v2)" \
	--end   'print("MAX: %f %f" % (maximum1, maximum2))'
```

Will output 3 columns from file "file1" (column 1, 4 and so far accumulated
maximal value of column 5). At the end it will additionally print maximal values
of the 4th and 5th column.


### Empty file name `""`

A special empty filename instructs `kolumny` to read the previous file again.
In most shells this can be done using `""`.

```
kolumny "file1.txt" using 1,2 "" using 1,3
```

Is essentially the shortcut to:

```
kolumny "file1.txt" using 1,2 "file1.txt" using 1,3
```

And that is essentially similar to:

```
kolumny "file1.txt" using 1,2,1,3
```


### Standard input file name `"-"`

A special filename `-` can be used to read data from standard input:

```
seq 10 | ./kolumny '-' using "~x:=1" ":x" ":x*x"
```

Will output 10 columns with consecutive numbers and their squares.

```
seq 10 | ./kolumny --import random '-' using "~x:=1" ":'%.3f'%random.random()" ":'%.3f'%random.random()"
```

Will output 10 columns with 2 random values (with just 3 decimal digits after a
decimal point) in each row.


### Commenting out files

A file can be skipped from processing by prepending it with `#`:

```text
kolumny "#file1.txt" using 1,3 "file2.txt" using 1,3
```

Will only show 2 columns. column 1 and 3 from `file2.txt`. `kolumny` will ignore
`file1.txt` and its `using` statements. The file doesn't even need to exist. The
`using` must be syntactically correct tho.

```text
kolumny "#file1.txt" using foo "file2.txt" using 1,3
```

will produce an error.

### Commenting out expressions

Similarly expressions can be skipped from processing by prepending it with `#`:

```text
kolumny ":7" "#:8"
```

Will only show one column with value `7`.

The expression can be malformed and will still be ignored:

```text
kolumny ":7" "#:foo"
```

Will only show one column with value `7`.


Commenting out the expression, disables evaluation of this expression and
assignment to variables defined by it, so the variables can't used anymore in
other expressions.


```text
kolumny "#:a:=7" ":2*a"
```

Will result in evaluation error and produce no results, because `a` is undefined.

### Commenting in the expressions

It is possible to add custom comments in expressions:

```text
./kolumny ":a:=1 # One" :2*a
```

Will print `1 2`.

`#` and anything after it in any given expression will be ignored.

This style of commenting is not supported in using or using expressions, but can
be emulated using disabled expressions:

```text
./kolumny "file1.txt" using 1,3 "#: Read first and third column" \
	"file2.txt" using 2,4 "#: Read two columns"
```

The text in these expressions is not a valid expression, but because it is
disabled it doesn't really matter.

A different form:

```text
./kolumny "file1.txt" using 1,3 "# Two columns" \
	"file2.txt" using 2,4 "# Extra two other columns"
```

Will also work, but is strongly discouraged, because this is interpreted as
disabled input file name, and even for them the `using` statement are processed,
and that can lead to nasty surprises:

```text
./kolumny "file.txt" "# Two columns" using 1,3
```

Will output all columns of `file.txt`, not just columns 1 and 3. The `using 1,3`
is attached to a hypothetical input file `" Two columns"` (starting with space
and having spaces) and is being ignored.


### Quoting

In many cases quotes around arguments are not needed:

```
kolumny :42*123
```

Will show `5166`.

```
kolumny :a:=2**13 :~b:=a/3 :a*b
```

Will show `8192 22364160`.


The interface was designed to limit use of special characters that could
interfere with the the standard Unix shell.

```
kolumny myfiles/file.txt using ~a:=1,~b:=2 :a :b/a :b*a
```

In general quoting is only needed, in conventional interactive shell and script
when:
 * spaces or brackets in file paths (or file names or paths comes from unknown
   sources)
 * using commenting feature (`#` in input spec or expression)
 * using brackets in `using` or in expressions.
 * referencing some other special characters (i.e. `$`, in such case single
   quotes are recommended).

Below each quoted argument must be quoted as presented, when executed from Bash
and other shells and script.

```
kolumny "My Documents/some data(1).txt" using "1,~b:=(column(3)+column(5))" "#:b*2" ":sqrt(b)"
```

Quotes itself are not passed to the `kolumny`, and `kolumny` is not handling them
in the above example. It is just a feature of most shells to also interpret `#`
as comment, space as a argument separator, and brackets for various features.

Of course various techniques can be use to combine shell scripting capabilities
to pass information between shell and `kolumny`. For example the filename,
special (per-file) constants to expressions or to skip.

### Time related values handling

`kolumny` doesn't support handling of time values (i.e. timestamps, ISO 8601 dates,
or time periods like `2h`), but these can be mostly handled using Python functions
like these from [datetime module](https://docs.python.org/3/library/datetime.html).


### Using kolumny as input to gnuplot

### Using kolumny to generate fits and plots via gnuplot

```
kolumny \
	--gnuplot_begin 'set terminal png' \
	--gnuplot1_begin 'set output "chart1.png"' \
	--gnuplot2_begin 'set output "chart2.png"' \
	--gnuplot_begin 'f(x) = k1*x + k2' \
	--gnuplot_begin 'g(x) = A*x + B' \
	"file1" using y1a:=3,y2a:=4,~xa:=1 \
	"file2" using y1b:=3,y2b:=4,xb:=1 \
	":~check(xa==xb)" \
	":diff1=y1a-y1b" \
	":diff2=y2a-y2b" \
	--gnuplot1_fit ":xa,y1"     'f(x)" via "k1,k2" \
	--gnuplot2_fit ":xa,diff1"  'g(x)" via "A,B" \
	--gnuplot1_plot "f(x)" ":xa,y2b" ":xa,y1a" ";" \
	--gnuplot2_plot ":xa,diff2" title "Data2" "g(x)" title 'Fit2' ";" \
	--end 'print "A: %f  B: %f " % (A, B)'
```

Perform some fitting and plotting on multiple files.


### Performance

`kolumny` can process about 450000 lines per second on modern desktop class CPU,
when using only simple features, and processing files only with few columns.

When using more complex processing, complex expressions, vectors and dozens of
total columns (across all files), this will usually drop to about 25000-50000
lines per second in practical scenarios.

So, having each file with 500 rows of data, one can process about 50-100 files
per second. Just an example.

The memory usage shouldn't exceed few megabytes, even when processing extremally
large files (~10 gigabytes and more).

There is plenty of room for performance improvements in `kolumny`, and it is
belived it can process more than 1 million lines per second with suitable
optimizations, without switching to other programming language.

`kolumny` is single threaded, and will only use at most one CPU core on your
processor. If you are processing massive amounts of data, either split input
data into multiple smaller files and process them in parallel, and then join
 them back using `cat`. Or if you are processing many data sets to begin
with, that take hours to process, process them in parallel instead. Good
tool for doing so is [GNU parallel](https://www.gnu.org/software/parallel/),
`xargs`, `make`, `fine -exec`, or simply `&` operator in shell, for small
number of files. See
[parallel alternative](https://www.gnu.org/software/parallel/parallel_alternatives.html#DIFFERENCES-BETWEEN-pyargs-AND-GNU-Parallel)
for some more (but not all) alternatives (be aware of possible bias in this
document, as it is written by GNU parallel developers)


### Crazy

```
kolumny \
	"file1" using a:=1,~c:=3...7 \
	"file2" s 11 u ~b:=1,d:=2 \
	"#file3" skip 21 using ~e:=1,g:=2 \
	"= 10" \
	":'somestring'" \
	":1e3*xy" \
	":4,5,b" \
	":4,5,c" \
	":~xy:=x+y" \
	"file4" skip 11 u 3
```

Will output lines of the form:

```
a d g h S=sum(c) d+g*h sqrt(d)-c[3]*S z True/False somestring 1e3*xy=1000*(x+y) (4,5,b) (4,5,[c1,...]) column-3-of-file4
```

Notes:

 - order of files and expressions intermixed
 - multiple columns and files read at once
 - silent checks (asserts) that some columns between input files are equal
 - not printed columns using `~` both in expressions and file inputs
 - forward reference of variable `x` and `xy` before they are defined
 - comments (disabled files and expressions) using `#`
 - Python expressions, including math operations, checks, tuple/array printing
 - vector columns like `c:=3...7`. `c[3]` means 3+3+1=7th column from "file1"
 - usage of subcommands to generate data and remove empty lines using `grep`
 - usage of the same file multiple times
 - short forms of `using` (`u`) and `skip` (`s`)


Installing
----------

Either clone this git repository (`git clone
https://github.com/baryluk/kolumny.git`) and add it to your `PATH` (for example
using `export PATH=~/kolumny:$PATH` at the end of your `~/.bashrc` file).

Or [download the main
executable](https://raw.githubusercontent.com/baryluk/kolumny/master/kolumny) and
put for example in your `~/bin/` directory (make sure it is in your `PATH`).

The only required dependency is Python 3. If you are running Linux you already
probably have it installed.

To test quickly if it works, execute in terminal:

```
kolumny :x:=3 :y:=x**x
```

You should see this output:

```
3 27
```


Future work
-----------

The majority of future work on `kolumny` will focus on bug fixing, tests and
adding some additional features like:

* automatic row alignment, to eliminate manual `skip` and `check`.
* row interpolation
* subsampling similar to `set sample` and `every` in `gnuplot`
* computations of statistics across rows
* multi-pass algorithms and data caching from sub-processes


Contributing
------------

Simply open an issue (bug, feature request) or a pull request via this GitHub
project. If possible please provide all input files and command line to reproduce
the problem, and try to keep it as minimal as possible.


Authors and License
-------------------

* Witold Baryluk

This project is licensed under [BSD License](https://choosealicense.com/licenses/bsd-3-clause/).

Copyright, Witold Baryluk - 2010, 2012, 2018.

本源码包内暂不包含可直接显示的源代码文件,请下载源码包。