0.134.0
版本发布时间: 2024-09-10 20:11:27
jqnatividad/qsv最新发布版本:0.134.0(2024-09-10 20:11:27)
qsv pro v1 is here! 🎉
If you've been using qsv for a while, even if you're a command-line ninja, you'll find a lot of new capabilities in qsv pro that can make your data wrangling experience even better!
Apart from making qsv easier to use, qsv pro has a multitude of features including: view interactive data tables; browse stats/frequency/metadata; run recipes and tools (scripts); run Polars SQL queries; use Natural Language queries (using Retrieval Augmented Generation (RAG) techniques); regular expression search; export to multiple file formats; download/upload from/to compatible CKAN instances; design custom node-based flows and data pipelines; interact with a local API from external programs including the qsv pro command; run various qsv commands in a graphical user interface; and the list goes on!
And that's just the beginning, there's more to come! You just have to try it!
Download qsv pro v1 now at qsvpro.dathere.com.
Other highlights include:
-
pro
: new command to allow qsv to interact with the qsv pro API to tap into qsv pro exclusive features. -
lens
: new command to interactively view CSVs using the csvlens crate. - The ludicrously fast
diff
command is now easier to use with its--drop-equal-fields
option. @janriemer continues to work on hiscsv-diff
crate, and there's morediff
UX improvements coming soon! -
stats
addssum_length
andavg_length
"streaming" statistics in addition to the existingmin_length
andmax_length
metrics. These are especially useful for datasets with a lot of "free text" columns. -
stats
also got "smarter" and "faster" by dog-fooding its own statistics to make it run faster!
It's a little complicated, but the waystats
works is that it compiles the "streaming" statistics on the fly first as it multiplex load the data across several threads, and the more expensive advanced statistics are "lazily" computed at the end.
Since we now compile "sort order" in a streaming manner, we use this info when deriving cardinality at the end to see if we can skip sorting - an otherwise necessary step to get cardinality which is done by "scanning" all the sorted values of a column. Everytime two neighboring values differ in a sorted column, it increments the cardinality count.
Apart from this "sort order" optimization, we also improved the "cardinality scan" algorithm - halving its memory footprint and making it faster still for larger datasets by parallelizing the computation. This in turn, makes thefrequency
command faster and more memory efficient.
It's performance tweaks like these, that despite adding six metrics (is_ascii
,sort_order
,sum_length
,avg_length
,sem
- standard error of the mean &cv
- coefficient of variation) in recent releases, thatstats
is still able to compile 35 statistics and do GUARANTEED data type inferences of a million row, 41 column, 520 MB sample of NYC's 311 data in 1.327 seconds (753,580 records per second)![^1] - we now also use our own fork of the
csv
crate, featuring SIMD-accelerated UTF-8 validation and other minor perf tweaks, making the entire qsv suite faster still!
[^1]: see stats_everything_index
benchmark
Added
-
pro
: addqsv pro
command to interact with qsv pro API by @rzmk in https://github.com/jqnatividad/qsv/pull/2039 -
lens
: new command to interactively view CSVs using the csvlens crate https://github.com/jqnatividad/qsv/pull/2117 -
apply
: add crc32 operation https://github.com/jqnatividad/qsv/pull/2121 -
count
: add --delimiter option https://github.com/jqnatividad/qsv/pull/2120 -
diff
: add flag--drop-equal-fields
by @janriemer in https://github.com/jqnatividad/qsv/pull/2114 -
stats
: addsum_length
andavg_length
columns https://github.com/jqnatividad/qsv/pull/2113 -
stats
: smarter cardinality computation - added new parallel algorithm for large datasets (10,000+ rows) and updated sequential algorithm for smaller datasets https://github.com/jqnatividad/qsv/commit/4e63fec61a394ef2ddfa499c0cdd0958e677ad17
Changed
-
count
: added comment to justify magic number https://github.com/jqnatividad/qsv/commit/5241e3972c05f024a0791be04632d03a06b2f9ce -
stats
: use simdjson for faster JSONL parsing; micro-optimizecompute
hot loop https://github.com/jqnatividad/qsv/commit/0e8b73451999a3e95bfd52246b1088aecd64b88f -
stats
: standardized OVERFLOW and UNDERFLOW messages https://github.com/jqnatividad/qsv/commit/38c61285704e5064a63c9dbb1ac866f18fa130fd -
sort
: renamed symbol so eliminate devskim lint false positive warning https://github.com/jqnatividad/qsv/commit/12db7397f68d3199e3311f402d5c7afed586b88c - enable
lens
feature in GH workflows https://github.com/jqnatividad/qsv/pull/2122 -
deps
: bump polars 0.42.0 to latest upstream at time of release https://github.com/jqnatividad/qsv/commit/3c17ed12c3c763d644d9713afcc8442964f22de3 -
deps
: use our own optimized fork of csv crate, with simdutf8 validation and other minor perf tweaks https://github.com/jqnatividad/qsv/commit/e4bcd7123172fa8d8094c305d7780e151c120db1 - build(deps): bump serde from 1.0.209 to 1.0.210 by @dependabot in https://github.com/jqnatividad/qsv/pull/2111
- build(deps): bump serde_json from 1.0.127 to 1.0.128 by @dependabot in https://github.com/jqnatividad/qsv/pull/2106
- build(deps): bump qsv-stats from 0.19.0 to 0.22.0 https://github.com/jqnatividad/qsv/pull/2107 https://github.com/jqnatividad/qsv/pull/2112 https://github.com/jqnatividad/qsv/commit/cb1eb60a0a9fb3b9ba381183a2c29909f82efa42
- apply select clippy lint suggestions
- updated several indirect dependencies
- made various doc and usage text improvements
Fixed
-
schema
: Print an error if theqsv stats
invocation fails by @abrauchli in https://github.com/jqnatividad/qsv/pull/2110
New Contributors
- @abrauchli made their first contribution in https://github.com/jqnatividad/qsv/pull/2110
Full Changelog: https://github.com/jqnatividad/qsv/compare/0.133.1...0.134.0
1、 qsv-0.134.0-aarch64-apple-darwin.zip 146.46MB
2、 qsv-0.134.0-aarch64-unknown-linux-gnu.zip 37.1MB
3、 qsv-0.134.0-geocode-index.bincode 14.27MB
4、 qsv-0.134.0-geocode-index.bincode.cities15000 14.27MB
5、 qsv-0.134.0-geocode-index.bincode.cities15000.sz 5.65MB
6、 qsv-0.134.0-x86_64-apple-darwin.zip 111.89MB
7、 qsv-0.134.0-x86_64-pc-windows-gnu.zip 72.73MB
8、 qsv-0.134.0-x86_64-pc-windows-msvc.zip 162.86MB
9、 qsv-0.134.0-x86_64-unknown-linux-gnu.zip 241.36MB
10、 qsv-0.134.0-x86_64-unknown-linux-musl.zip 91.16MB
11、 qsv-0.134.0.msi 38.45MB