资源说明:Column-based spreadsheet for command line. Advanced multifile procesing and plotting.
kolumny ======= `kolumny` - declarative multi-file column oriented command line processing engine What is it? ----------- It is a simple command line tool that was designed to preprocess and process scientific data files and to be used together with `gnuplot` program. It can be used for your various processing needs when you have text files with numeric data, especially organized in rows. It is excellent for processing `tsv` and `csv` files, processing outputs of other tools, time series analysis, etc. It was initially (2010-2012) used for preprocessing and processing spectroscopic data in biochemical research. It augments nicely other Unix text processing tools like `grep`, `awk`, `sed`, `cat`, `cut`, `paste`, `bc` and ad-hoc solutions with `perl` or `python`. And in some cases completely replaces them for numerical data processing. Most notably it is easier and more flexible to use than `awk` or `paste`, especially when using with multiple files at the same time, that should be processes in parallel with some computations between files need to be performed at the same time. Usually this can't be done easily in `gnuplot` either (i.e. plot a sum of two timeseries, with each time series coming from a different file). Magic ----- Here is very quick glimpse what it can do, if you are already familiar with `gnuplot`. Lets say two files `d1.txt` and `d2.txt`, are both a text files with some measurement data. They do have some two line metadata header, and then they contain 1000 rows of data with each row having 6 columns. Columns are separate by tabs or a single space. First column being a time and rest of columns being energy in various channels measured in the same units. Lets additionally assume that each file correspond to a different total amount of measurements (5 and 7 respectively), and they need to be normalized first before being merged. Something like this maybe: d1.txt ```raw # MESUREMENT 1, 2010-01-01 15:00 # Count 5 0.1 0.57 0.59 0.20 0.30 0.99 0.2 0.80 0.33 0.02 0.73 0.74 0.3 0.48 0.22 0.15 0.57 0.81 0.4 0.10 0.10 0.38 0.73 0.36 0.5 0.87 0.85 0.41 0.66 0.85 0.6 0.01 0.65 0.02 0.80 0.22 0.7 0.89 0.89 0.60 0.17 0.82 0.8 0.20 0.78 0.56 0.35 0.30 ... ``` d2.txt ``` # MESUREMENT 1, 2010-01-01 15:00 # Count 5 0.1 0.78 0.34 0.50 0.54 0.45 0.2 0.37 0.65 0.52 0.23 0.10 0.3 0.95 0.13 0.17 0.53 0.21 0.4 0.25 0.55 0.99 0.55 0.94 0.5 0.19 1.00 0.01 0.72 0.29 0.6 0.23 0.32 0.91 0.53 0.30 0.7 0.44 0.77 0.76 0.56 0.66 0.8 0.33 0.40 0.61 0.30 0.84 ... ``` If we want to merge these two independent measurement data files to obtain better statistic, and plot it in we could do this: ```gnuplot plot "output.txt ``` `kolumny` has many optional features and shortcuts, that allow to express processing shortly and in flexible manner. It also helps with debugging by giving reasonable error messages, ability to comment out (and disable) any part of the argument list, or do computations without printing them, to be used as further input to other expressions. The interface is heavily influenced by `gnuplot` `plot` and `fit` commands. `kolumny` is fast and memory efficient (its memory usage is not dependent on the size of input data). ## Usage ```text Usage: kolumny [options]... inputspec... options: --begin python-code : execute Python code after startup --end python-code : execute Python code at shutdown --gnuplot_begin gnuplot-code : execute gnuplot code after startup --fit fit-expression : perform simplified internal fiting (linear, with weights) --gnuplot_plot gnuplot-plot-expressions : plot using gnuplot --gnuplot_fit gnuplot-fit-expressions : fit usign gnuplot --tmpdir path : use path as temporary directory if needed -- : end options processing here inputspec - one of: filespec [skip N] [using usingspec] ":expression" filespec - one of: "filepath" : named file "-" : standard input " >`. Additional integer bit operators: `&`, `|`, `^`, `~` Comparison operators: `==`, `!=`, `>`, `>=`, `<`, `<=`, `<>` Additional operators: `[]`. Logic operators: `and`, `or`, `not`. Parantheese: `()` Data conversion functions: * `int` * `float` * `complex` Math functions and constants: * `sqrt` * `sin`, `cos`, `tan`, `asin`, `acos`, `atan` * `atan2`, `hypot` * `cosh`, `sinh`, `tanh`, `acosh`, `asinh`, `atanh` * `pow`, `exp`, `expm1`, `log`, `log10`, `log2`, `log1p`, `ldexp` * `erf`, `erfc`, `gamma`, `lgamma`, `factorial` * `gcd` * `isinf`, `isnan`, `isfinite` * `pi`, `e`, `tau`, `inf` * `floor`, `ceil`, `trunc`, `fabs`, `fmod`, `frexp`, `modf` * `fabs`, `copysign` * `degrees`, `radians` * `fsum` See a [documentation of Python math module](https://docs.python.org/3/library/math.html) for details. Additional functions: * `sum` * `avg`, `stddev` * `eq` * `vec_add` * `min`, `max`, `count` Numbers can have underscores for grouping, for example: `12_345`, `1.131_411e+33`. Hexedecimal integers are supported, for example: `0xdeadbeef`. Chained comparisons, like `a < b < c` are guaranteed to work. Note that `1 != 2 != 1` is true and is guaranteed to work in future version. Note that strange comparisons like `1 < 3 > 2` will work, but are not guaranteed to continue working the same way (or even work at all) in future major versions of `kolumny`. ### Python compatibility notes In general the entire power of Python is available in `kolumny`. However, some features are not guaranteed to work in feature versions, as indicated below. Other Python operators, data types and functions are available, but are not guaranteed to work in the future new major versions of `kolumny`. For these reasons try to keep any custom complex processing to minimum. Including use of `map`, `filter`, `reduce`, `zip` and custom `lambda` expressions. But these are useful in many applications, and makes `kolumny` extremally powerful tool, so use your good judgment. Note: Array/vector concatenation operator (`+`) is not guaranteed to work in future new major versions of `kolumny`. Note: Use of tuples (like `(1,2,3)`) is discouraged, and tuples are not guaranteed to work in future new major versions of `kolumny`. Note: Use of complex numbers (like `1+2j`) is supported, and in general will be preserved in future new major versions of `kolumny`. Note: Octal and binary literals, like `0o377` or `0b11001110101`, will work, but are not guaranteed to work in future major versions of `kolumny` (binary ones are more likely to be supported for longer time). Note: `Decimal` and `str` conversions are generally available, if used together with `--import`, but are not guaranteed to work in new major versions of `kolumny`. Note: Use of functions that modify variables (especially vectors) in-place (like `append`, `sort`, `extend`, `reverse`) is strongly discouraged, and can break, change behaviour or not work even between minor versions of `kolumny`. Note: These rules are mainly here to be able to port `kolumny` to different programming language if needed, without re-implementing all quirks of Python in it. Depending on user feedback of used features, different priorities will be assigned to what make supported and what to drop without big user impact. Note: Comments inside expressions are generally supported, for example `":a:=sum(x) # Add all columns."` will work. ### No input - command line calculator with variables / spreadsheet If no input files are specified at all, `kolumny` will process all the expressions one time and print results as requested: ``` kolumny :x:=sin(pi/7) :y:=cos(pi/7) :x*x+y*y ``` Will display `0.433883739118 0.900968867902 1.0`. ### Dependencies As seen previously, expressions can use variables defined by other expressions or `using` variables. ``` kolumny "file.txt" using "a:=1,b:=(column(2)+column(3))" \ ":~x:=a/b" ":~y:=b/a" ":x+y" ``` ### Reordering Variables in all modes, can be defined in arbitrary order. This enables you to use `kolumny` as a calculator or a spreadsheet. ``` kolumny ":x:=3" ":y+x*100" ":y:=4*z" ":z:=100000" ``` Will display `3 400300 400000 100000`, despite `y` using `z` that is defined only further in the command line. Note that only one line will be printed, and expressions can be provided in any order. Values printed will be printed in the exact order specified by the user. ``` kolumny \ ":x+y" \ ":~x:=a/b" \ ":~y:=b/a" \ "file1.txt" using "~a:=1,~b:=(column(2)+column(3)" ``` Is perfectly valid too, equivalent and will produce exactly same output as the example in the previous section ([Dependencies](#dependencies)). ### Cycles Cyclic dependencies are an error: ``` kolumny ":x:=3" ":y:=z+1" ":z:=y+1" ``` Will result in an error becasue `y` and `z` reference each another. ### Shell preprocessing / generation ``` kolumny " = 0)" ":sqrt(a)" ``` Will check that first column is non-negative, and then perform a square root operation on it and print the result. ### Checking multiple inputs When combining multiple files it might be beneficial to make sure that they are matched correctly for parallel processing. ``` kolumy \ "file1.txt" using t1:=1,x1:=2 \ "file2.txt" using t2:=1,x2:=2 \ ":~check(t1==t2)" \ ":t1" \ ":x1" ":x2" ``` Will output 3 columns, a first column (`t1`) from `file1.txt`, and second columns (`x1`, `x2`) from both input files. It will also make sure that first column (`t1`) is identical in both input files. This is a good way to check that multiple files we are combining conform to the same form (i.e. they were exported from other software the same way, or measurement equipements to capture data was set up the same way). ### Combining skip and check Especially when dealing with complex headers, or data that do not have rows correctly aligned, one would use `skip` and `check` ``` kolumy \ "file1.txt" skip 11 using t1:=1,x1:=2 \ "file2.txt" skip 9 using t2:=1,x2:=2 \ ":~check(t1==t2)" \ ":t1" \ ":x1" ":x2" ``` Will do the same as previous example, but initially will skip first 11 and first 9 lines from `file1.txt` and `file2.txt` respectively. This can be useful when data produced are not consistently starting at the same value. In many processing scenarios some other code (for example in Bash script) will determine a correct value of a skip and pass it to `kolumny`. This example can't be simply recreated using `tail`, `paste` and `awk`, without creating additional temporary files. ### Termination `kolumny` will terminate when its input file have no more rows to process. When using multiple input files `kolumny` will terminate processing as soon as one of the files finishes, or it can not be processed correctly any longer (i.e. no more correct columns to be used via `using`). TODO(baryluk): Add a feature, to allow continuing processing until all input files finish, and use implicit column values for already finished files. ### Complex numbers `1j` an imaginary unit. Example of complex number: `3+4j`. When reading data, one can easily create complex numbers and use them in expressions: ``` kolumny "data1.txt" u z:=(column(1)+1j*column(2)) ":z**2 ``` Will form a complex number from real and imaginary part from column 1 and 2, display this complex number (like this `(1.2+3.4j)`), and its square. Standard mathematical functions with support for complex arguments can be accessed via cmath module using `--import` option (see [Importing Python modules](#importing-python-modules) section for details). ### Custom initalization Arbitrary Python code can be executed on the start of the `kolumny` with the use of `--begin` options, and the results of these execution can be used in processing stages: ``` kolumny --begin a=3 "file1.txt" u ~x:=1 ":x*a" ``` or ``` kolumny --begin a=3 "file1.txt" u "(a*column(1))" ``` Will both output first column multiplied by a constant `a`, which is equal to 3. Multiple `--begin` can be specified, and they will be executed in order: ``` kolumny --begin a=3 --begin a=a*a "file1.txt" u "(a*column(1))" ``` Will output first column multiplied by 9. One can also use semicolons and comments, as in normal Python code: ``` kolumny --begin "a=3; b=4 # Init" "file1.txt" u "(a*column(1)+b)" ``` Or use statments with side effect: ``` kolumny --begin "print('Processing...')" "file1.txt" u 1 ``` This feature might be useful with ability to modify global variables defined by `--begin` in expressions: ``` kolumny --begin "a=0.0" "file1.txt" u "~x:=1" ":x" ":~a+=x" ":a" ``` Will output two columns. First a copy from the input, and second with a running (cummulative) sum of the first one. ### Importing Python modules Arbitrary Python modules can be imported to be used in expressions. ### Accumulators and other statistical operations across rows ``` kolumny \ --begin 'maximum1=float("-inf")' \ --begin 'maximum2=float("-inf")' \ "file1" using 1,v1:=4,~v2:=5 \ ":maximum1=max(maximum1, v1)" \ ":~maximum2=max(maximum2, v2)" \ --end 'print("MAX: %f %f" % (maximum1, maximum2))' ``` Will output 3 columns from file "file1" (column 1, 4 and so far accumulated maximal value of column 5). At the end it will additionally print maximal values of the 4th and 5th column. ### Empty file name `""` A special empty filename instructs `kolumny` to read the previous file again. In most shells this can be done using `""`. ``` kolumny "file1.txt" using 1,2 "" using 1,3 ``` Is essentially the shortcut to: ``` kolumny "file1.txt" using 1,2 "file1.txt" using 1,3 ``` And that is essentially similar to: ``` kolumny "file1.txt" using 1,2,1,3 ``` ### Standard input file name `"-"` A special filename `-` can be used to read data from standard input: ``` seq 10 | ./kolumny '-' using "~x:=1" ":x" ":x*x" ``` Will output 10 columns with consecutive numbers and their squares. ``` seq 10 | ./kolumny --import random '-' using "~x:=1" ":'%.3f'%random.random()" ":'%.3f'%random.random()" ``` Will output 10 columns with 2 random values (with just 3 decimal digits after a decimal point) in each row. ### Commenting out files A file can be skipped from processing by prepending it with `#`: ```text kolumny "#file1.txt" using 1,3 "file2.txt" using 1,3 ``` Will only show 2 columns. column 1 and 3 from `file2.txt`. `kolumny` will ignore `file1.txt` and its `using` statements. The file doesn't even need to exist. The `using` must be syntactically correct tho. ```text kolumny "#file1.txt" using foo "file2.txt" using 1,3 ``` will produce an error. ### Commenting out expressions Similarly expressions can be skipped from processing by prepending it with `#`: ```text kolumny ":7" "#:8" ``` Will only show one column with value `7`. The expression can be malformed and will still be ignored: ```text kolumny ":7" "#:foo" ``` Will only show one column with value `7`. Commenting out the expression, disables evaluation of this expression and assignment to variables defined by it, so the variables can't used anymore in other expressions. ```text kolumny "#:a:=7" ":2*a" ``` Will result in evaluation error and produce no results, because `a` is undefined. ### Commenting in the expressions It is possible to add custom comments in expressions: ```text ./kolumny ":a:=1 # One" :2*a ``` Will print `1 2`. `#` and anything after it in any given expression will be ignored. This style of commenting is not supported in using or using expressions, but can be emulated using disabled expressions: ```text ./kolumny "file1.txt" using 1,3 "#: Read first and third column" \ "file2.txt" using 2,4 "#: Read two columns" ``` The text in these expressions is not a valid expression, but because it is disabled it doesn't really matter. A different form: ```text ./kolumny "file1.txt" using 1,3 "# Two columns" \ "file2.txt" using 2,4 "# Extra two other columns" ``` Will also work, but is strongly discouraged, because this is interpreted as disabled input file name, and even for them the `using` statement are processed, and that can lead to nasty surprises: ```text ./kolumny "file.txt" "# Two columns" using 1,3 ``` Will output all columns of `file.txt`, not just columns 1 and 3. The `using 1,3` is attached to a hypothetical input file `" Two columns"` (starting with space and having spaces) and is being ignored. ### Quoting In many cases quotes around arguments are not needed: ``` kolumny :42*123 ``` Will show `5166`. ``` kolumny :a:=2**13 :~b:=a/3 :a*b ``` Will show `8192 22364160`. The interface was designed to limit use of special characters that could interfere with the the standard Unix shell. ``` kolumny myfiles/file.txt using ~a:=1,~b:=2 :a :b/a :b*a ``` In general quoting is only needed, in conventional interactive shell and script when: * spaces or brackets in file paths (or file names or paths comes from unknown sources) * using commenting feature (`#` in input spec or expression) * using brackets in `using` or in expressions. * referencing some other special characters (i.e. `$`, in such case single quotes are recommended). Below each quoted argument must be quoted as presented, when executed from Bash and other shells and script. ``` kolumny "My Documents/some data(1).txt" using "1,~b:=(column(3)+column(5))" "#:b*2" ":sqrt(b)" ``` Quotes itself are not passed to the `kolumny`, and `kolumny` is not handling them in the above example. It is just a feature of most shells to also interpret `#` as comment, space as a argument separator, and brackets for various features. Of course various techniques can be use to combine shell scripting capabilities to pass information between shell and `kolumny`. For example the filename, special (per-file) constants to expressions or to skip. ### Time related values handling `kolumny` doesn't support handling of time values (i.e. timestamps, ISO 8601 dates, or time periods like `2h`), but these can be mostly handled using Python functions like these from [datetime module](https://docs.python.org/3/library/datetime.html). ### Using kolumny as input to gnuplot ### Using kolumny to generate fits and plots via gnuplot ``` kolumny \ --gnuplot_begin 'set terminal png' \ --gnuplot1_begin 'set output "chart1.png"' \ --gnuplot2_begin 'set output "chart2.png"' \ --gnuplot_begin 'f(x) = k1*x + k2' \ --gnuplot_begin 'g(x) = A*x + B' \ "file1" using y1a:=3,y2a:=4,~xa:=1 \ "file2" using y1b:=3,y2b:=4,xb:=1 \ ":~check(xa==xb)" \ ":diff1=y1a-y1b" \ ":diff2=y2a-y2b" \ --gnuplot1_fit ":xa,y1" 'f(x)" via "k1,k2" \ --gnuplot2_fit ":xa,diff1" 'g(x)" via "A,B" \ --gnuplot1_plot "f(x)" ":xa,y2b" ":xa,y1a" ";" \ --gnuplot2_plot ":xa,diff2" title "Data2" "g(x)" title 'Fit2' ";" \ --end 'print "A: %f B: %f " % (A, B)' ``` Perform some fitting and plotting on multiple files. ### Performance `kolumny` can process about 450000 lines per second on modern desktop class CPU, when using only simple features, and processing files only with few columns. When using more complex processing, complex expressions, vectors and dozens of total columns (across all files), this will usually drop to about 25000-50000 lines per second in practical scenarios. So, having each file with 500 rows of data, one can process about 50-100 files per second. Just an example. The memory usage shouldn't exceed few megabytes, even when processing extremally large files (~10 gigabytes and more). There is plenty of room for performance improvements in `kolumny`, and it is belived it can process more than 1 million lines per second with suitable optimizations, without switching to other programming language. `kolumny` is single threaded, and will only use at most one CPU core on your processor. If you are processing massive amounts of data, either split input data into multiple smaller files and process them in parallel, and then join them back using `cat`. Or if you are processing many data sets to begin with, that take hours to process, process them in parallel instead. Good tool for doing so is [GNU parallel](https://www.gnu.org/software/parallel/), `xargs`, `make`, `fine -exec`, or simply `&` operator in shell, for small number of files. See [parallel alternative](https://www.gnu.org/software/parallel/parallel_alternatives.html#DIFFERENCES-BETWEEN-pyargs-AND-GNU-Parallel) for some more (but not all) alternatives (be aware of possible bias in this document, as it is written by GNU parallel developers) ### Crazy ``` kolumny \ "file1" using a:=1,~c:=3...7 \ "file2" s 11 u ~b:=1,d:=2 \ "#file3" skip 21 using ~e:=1,g:=2 \ " = 10" \ ":'somestring'" \ ":1e3*xy" \ ":4,5,b" \ ":4,5,c" \ ":~xy:=x+y" \ "file4" skip 11 u 3 ``` Will output lines of the form: ``` a d g h S=sum(c) d+g*h sqrt(d)-c[3]*S z True/False somestring 1e3*xy=1000*(x+y) (4,5,b) (4,5,[c1,...]) column-3-of-file4 ``` Notes: - order of files and expressions intermixed - multiple columns and files read at once - silent checks (asserts) that some columns between input files are equal - not printed columns using `~` both in expressions and file inputs - forward reference of variable `x` and `xy` before they are defined - comments (disabled files and expressions) using `#` - Python expressions, including math operations, checks, tuple/array printing - vector columns like `c:=3...7`. `c[3]` means 3+3+1=7th column from "file1" - usage of subcommands to generate data and remove empty lines using `grep` - usage of the same file multiple times - short forms of `using` (`u`) and `skip` (`s`) Installing ---------- Either clone this git repository (`git clone https://github.com/baryluk/kolumny.git`) and add it to your `PATH` (for example using `export PATH=~/kolumny:$PATH` at the end of your `~/.bashrc` file). Or [download the main executable](https://raw.githubusercontent.com/baryluk/kolumny/master/kolumny) and put for example in your `~/bin/` directory (make sure it is in your `PATH`). The only required dependency is Python 3. If you are running Linux you already probably have it installed. To test quickly if it works, execute in terminal: ``` kolumny :x:=3 :y:=x**x ``` You should see this output: ``` 3 27 ``` Future work ----------- The majority of future work on `kolumny` will focus on bug fixing, tests and adding some additional features like: * automatic row alignment, to eliminate manual `skip` and `check`. * row interpolation * subsampling similar to `set sample` and `every` in `gnuplot` * computations of statistics across rows * multi-pass algorithms and data caching from sub-processes Contributing ------------ Simply open an issue (bug, feature request) or a pull request via this GitHub project. If possible please provide all input files and command line to reproduce the problem, and try to keep it as minimal as possible. Authors and License ------------------- * Witold Baryluk This project is licensed under [BSD License](https://choosealicense.com/licenses/bsd-3-clause/). Copyright, Witold Baryluk - 2010, 2012, 2018.
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。