Sample CSV data file:
POKI_RUN_COMMAND{{cat example.csv}}HERE
mlr cat is like cat ...
POKI_RUN_COMMAND{{mlr --csv cat example.csv}}HERE
... but it can also do format conversion (here, to pretty-printed tabular format):
POKI_RUN_COMMAND{{mlr --icsv --opprint cat example.csv}}HERE
mlr head and mlr tail count records rather than lines. The CSV
header is included either way:
POKI_RUN_COMMAND{{mlr --csv head -n 4 example.csv}}HERE
POKI_RUN_COMMAND{{mlr --csv tail -n 4 example.csv}}HERE
Sort primarily alphabetically on one field, then secondarily
numerically descending on another field:
POKI_RUN_COMMAND{{mlr --icsv --opprint sort -f shape -nr index example.csv}}HERE
Use cut to retain only specified fields, in input-data order:
POKI_RUN_COMMAND{{mlr --icsv --opprint cut -f flag,shape example.csv}}HERE
Use cut -o to retain only specified fields, in your specified order:
POKI_RUN_COMMAND{{mlr --icsv --opprint cut -o -f flag,shape example.csv}}HERE
Use cut -x to omit specified fields:
POKI_RUN_COMMAND{{mlr --icsv --opprint cut -x -f flag,shape example.csv}}HERE
Use filter to retain specified records:
POKI_RUN_COMMAND{{mlr --icsv --opprint filter '$color == "red"' example.csv}}HERE
POKI_RUN_COMMAND{{mlr --icsv --opprint filter '$color == "red" && $flag == 1' example.csv}}HERE
Use put to add/replace fields which are computed from other fields:
POKI_RUN_COMMAND{{mlr --icsv --opprint put '$ratio = $quantity / $rate; $color_shape = $color . "_" . $shape' example.csv}}HERE
JSON output:
POKI_RUN_COMMAND{{mlr --icsv --ojson put '$ratio = $quantity/$rate; $shape = toupper($shape)' example.csv}}HERE
JSON output with vertical-formatting flags:
POKI_RUN_COMMAND{{mlr --icsv --ojson --jvstack --jlistwrap tail -n 2 example.csv}}HERE
Use then to pipe commands together. Also, the
-g option for many Miller commands is for group-by: here, head -n
1 -g shape outputs the first record for each distinct value of the
shape field. This means we’re finding the record with highest
index field for each distinct shape field:
POKI_RUN_COMMAND{{mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv}}HERE
Statistics can be computed with or without group-by field(s). Also, the first of these two
examples uses --oxtab output format which is a nice alternative to --opprint when you
have lots of columns:
POKI_RUN_COMMAND{{mlr --icsv --oxtab --from example.csv stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate}}HERE
POKI_RUN_COMMAND{{mlr --icsv --opprint --from example.csv stats1 -a count,min,mean,max -f quantity -g shape}}HERE
POKI_RUN_COMMAND{{mlr --icsv --opprint --from example.csv stats1 -a count,min,mean,max -f quantity -g shape,color}}HERE
Choices for printing to files
Often we want to print output to the screen. Miller does this by default, as we’ve
seen in the previous examples.
Sometimes we want to print output to another file: just use '>
outputfilenamegoeshere' at the end of your command:
% mlr --icsv --opprint cat example.csv > newfile.csv
# Output goes to the new file;
# nothing is printed to the screen.
% cat newfile.csv
color shape flag index quantity rate
yellow triangle 1 11 43.6498 9.8870
red square 1 15 79.2778 0.0130
red circle 1 16 13.8103 2.9010
red square 0 48 77.5542 7.4670
purple triangle 0 51 81.2290 8.5910
red square 0 64 77.1991 9.5310
purple triangle 0 65 80.1405 5.8240
yellow circle 1 73 63.9785 4.2370
yellow circle 1 87 63.5058 8.3350
purple square 0 91 72.3735 8.2430
Other times we just want our files to be changed in-place: just use 'mlr -I'.
% mlr -I --icsv --opprint cat newfile.txt
% cat newfile.txt
color shape flag index quantity rate
yellow triangle 1 11 43.6498 9.8870
red square 1 15 79.2778 0.0130
red circle 1 16 13.8103 2.9010
red square 0 48 77.5542 7.4670
purple triangle 0 51 81.2290 8.5910
red square 0 64 77.1991 9.5310
purple triangle 0 65 80.1405 5.8240
yellow circle 1 73 63.9785 4.2370
yellow circle 1 87 63.5058 8.3350
purple square 0 91 72.3735 8.2430
Also using mlr -I you can bulk-operate on lots of files: e.g.
mlr -I --csv cut -x -f unwanted_column_name *.csv
If you like, you can first copy off your original data somewhere else, before doing in-place operations.
Lastly, using tee within put, you can split your input data into separate files
per one or more field names:
POKI_RUN_COMMAND{{mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'}}HERE
POKI_RUN_COMMAND{{cat circle.csv}}HERE
POKI_RUN_COMMAND{{cat square.csv}}HERE
POKI_RUN_COMMAND{{cat triangle.csv}}HERE
Other-format examples
What’s a CSV file, really? It’s an array of rows, or
records, each being a list of key-value pairs, or fields: for CSV
it so happens that all the keys are shared in the header line and the values
vary data line by data line.
For example, if you have
Data written this way are called DKVP, for delimited key-value pairs.
We’ve also already seen other ways to write the same data:
CSV PPRINT JSON
shape,flag,index shape flag index [
circle,1,24 circle 1 24 {
square,0,36 square 0 36 "shape": "circle",
"flag": 1,
"index": 24
},
DKVP XTAB {
shape=circle,flag=1,index=24 shape circle "shape": "square",
shape=square,flag=0,index=36 flag 1 "flag": 0,
index 24 "index": 36
}
shape square ]
flag 0
index 36
Anything we can do with CSV input data, we can do with any
other format input data. And you can read from one format, do any
record-processing, and output to the same format as the input, or to a
different output format.
SQL-output examples
I like to produce SQL-query output with header-column and tab delimiter:
this is CSV but with a tab instead of a comma, also known as TSV. Then I
post-process with mlr --tsv or mlr --tsvlite. This
means I can do some (or all, or none) of my data processing within SQL queries,
and some (or none, or all) of my data processing using Miller — whichever
is most convenient for my needs at the moment.
For example, using default output formatting in mysql we get
formatting like Miller’s --opprint --barred:
$ mysql --database=mydb -e 'show columns in mytable'
+------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| id | bigint(20) | NO | MUL | NULL | |
| category | varchar(256) | NO | | NULL | |
| is_permanent | tinyint(1) | NO | | NULL | |
| assigned_to | bigint(20) | YES | | NULL | |
| last_update_time | int(11) | YES | | NULL | |
+------------------+--------------+------+-----+---------+-------+
Using mysql’s -B we get TSV output:
$ mysql --database=mydb -B -e 'show columns in mytable' | mlr --itsvlite --opprint cat
Field Type Null Key Default Extra
id bigint(20) NO MUL NULL -
category varchar(256) NO - NULL -
is_permanent tinyint(1) NO - NULL -
assigned_to bigint(20) YES - NULL -
last_update_time int(11) YES - NULL -
Since Miller handles TSV output, we can do as much or as little processing
as we want in the SQL query, then send the rest on to Miller. This includes
outputting as JSON, doing further selects/joins in Miller, doing stats, etc.
etc.
$ mysql --database=mydb -B -e 'select * from mytable' > query.tsv
$ mlr --from query.tsv --t2p stats1 -a count -f id -g category,assigned_to
category assigned_to id_count
special 10000978 207
special 10003924 385
special 10009872 168
standard 10000978 524
standard 10003924 392
standard 10009872 108
...
Again, all the examples in the CSV section apply here — just change the input-format
flags.
SQL-input examples
One use of NIDX (value-only, no keys) format is for loading up SQL tables.
Create and load SQL table:
mysql> CREATE TABLE abixy(
a VARCHAR(32),
b VARCHAR(32),
i BIGINT(10),
x DOUBLE,
y DOUBLE
);
Query OK, 0 rows affected (0.01 sec)
bash$ mlr --onidx --fs comma cat data/medium > medium.nidx
mysql> LOAD DATA LOCAL INFILE 'medium.nidx' REPLACE INTO TABLE abixy FIELDS TERMINATED BY ',' ;
Query OK, 10000 rows affected (0.07 sec)
Records: 10000 Deleted: 0 Skipped: 0 Warnings: 0
mysql> SELECT COUNT(*) AS count FROM abixy;
+-------+
| count |
+-------+
| 10000 |
+-------+
1 row in set (0.00 sec)
mysql> SELECT * FROM abixy LIMIT 10;
+------+------+------+---------------------+---------------------+
| a | b | i | x | y |
+------+------+------+---------------------+---------------------+
| pan | pan | 1 | 0.3467901443380824 | 0.7268028627434533 |
| eks | pan | 2 | 0.7586799647899636 | 0.5221511083334797 |
| wye | wye | 3 | 0.20460330576630303 | 0.33831852551664776 |
| eks | wye | 4 | 0.38139939387114097 | 0.13418874328430463 |
| wye | pan | 5 | 0.5732889198020006 | 0.8636244699032729 |
| zee | pan | 6 | 0.5271261600918548 | 0.49322128674835697 |
| eks | zee | 7 | 0.6117840605678454 | 0.1878849191181694 |
| zee | wye | 8 | 0.5985540091064224 | 0.976181385699006 |
| hat | wye | 9 | 0.03144187646093577 | 0.7495507603507059 |
| pan | wye | 10 | 0.5026260055412137 | 0.9526183602969864 |
+------+------+------+---------------------+---------------------+
Aggregate counts within SQL:
mysql> SELECT a, b, COUNT(*) AS count FROM abixy GROUP BY a, b ORDER BY COUNT DESC;
+------+------+-------+
| a | b | count |
+------+------+-------+
| zee | wye | 455 |
| pan | eks | 429 |
| pan | pan | 427 |
| wye | hat | 426 |
| hat | wye | 423 |
| pan | hat | 417 |
| eks | hat | 417 |
| pan | zee | 413 |
| eks | eks | 413 |
| zee | hat | 409 |
| eks | wye | 407 |
| zee | zee | 403 |
| pan | wye | 395 |
| wye | pan | 392 |
| zee | eks | 391 |
| zee | pan | 389 |
| hat | eks | 389 |
| wye | eks | 386 |
| wye | zee | 385 |
| hat | zee | 385 |
| hat | hat | 381 |
| wye | wye | 377 |
| eks | pan | 371 |
| hat | pan | 363 |
| eks | zee | 357 |
+------+------+-------+
25 rows in set (0.01 sec)
Aggregate counts within Miller:
$ mlr --opprint uniq -c -g a,b then sort -nr count data/medium
a b count
zee wye 455
pan eks 429
pan pan 427
wye hat 426
hat wye 423
pan hat 417
eks hat 417
eks eks 413
pan zee 413
zee hat 409
eks wye 407
zee zee 403
pan wye 395
hat pan 363
eks zee 357
Pipe SQL output to aggregate counts within Miller:
$ mysql -D miller -B -e 'select * from abixy' | mlr --itsv --opprint uniq -c -g a,b then sort -nr count
a b count
zee wye 455
pan eks 429
pan pan 427
wye hat 426
hat wye 423
pan hat 417
eks hat 417
eks eks 413
pan zee 413
zee hat 409
eks wye 407
zee zee 403
pan wye 395
wye pan 392
zee eks 391
zee pan 389
hat eks 389
wye eks 386
hat zee 385
wye zee 385
hat hat 381
wye wye 377
eks pan 371
hat pan 363
eks zee 357
Log-processing examples
Another of my favorite use-cases for Miller is doing ad-hoc processing of
log-file data. Here’s where DKVP format really shines: one, since the
field names and field values are present on every line, every line stands on
its own. That means you can grep or what have you. Also it means not
every line needs to have the same list of field names (“schema ”).
Again, all the examples in the CSV section apply here — just change
the input-format flags. But there’s more you can do when not all the
records have the same shape.
Writing a program — in any language whatsoever — you can have
it print out log lines as it goes along, with items for various events jumbled
together. After the program has finished running you can sort it all out,
filter it, analyze it, and learn from it.
Suppose your program has printed something like this:
POKI_RUN_COMMAND{{cat log.txt}}HERE
Each print statement simply contains local information: the current
timestamp, whether a particular cache was hit or not, etc. Then using either
the system grep command, or Miller’s having-fields, or
is_present, we can pick out the parts we want and analyze them:
POKI_INCLUDE_AND_RUN_ESCAPED(10-1.sh)HERE
POKI_INCLUDE_AND_RUN_ESCAPED(10-2.sh)HERE
Alternatively, we can simply group the similar data for a better look:
POKI_RUN_COMMAND{{mlr --opprint group-like log.txt}}HERE
POKI_RUN_COMMAND{{mlr --opprint group-like then sec2gmt time log.txt}}HERE
More
Please see the reference for complete
information, as well as the FAQ and the cookbook for more tips.