mindstalk: (Default)
At work today, Boss suggested I look at sqlite a bit, since our client code uses it. What I thought might be a brief glance turned into hours of reading, as it became rather fascinating. For those who don't know, it's an embedded SQL database, with not much code, unlike the client/server databases of Oracle or anything else you've probably heard of. As their docs put it, they're not competing with such databases, they're competing with fopen() and other filesystem access.

They call their testing "aviation grade", possibly without hyperbole: 100% branch coverage, 100% coverage of something stronger than branches, 700x more testing code than actual library code and a lot of that generates tests parametrically... it sounds pretty nuts. They worship Valgrind but find compiler warnings somewhat useless; getting warnings to zero added more bugs than it solved. https://www.sqlite.org/testing.html

They claim "billions and billions of deployments", which sounded like humorous hyperbole until they added being on every iPhone or Android phone, every Mac or Windows 10 machine, every major browser install... There are over 2 billon smartphones, so just from the phone OS and the phone browser, you've got 4 billion installs...

They also make a pitched case for consider a sqlite database any time you'd be considering some complex file format. With almost no code to write, you'd get consistency robustness, complex queries, machine and language independence, and at least some ability to do partial writes[1], compared to throwing a bunch of files into a zipfile.


They also had a nicely educational description of their rollback and write-ahead models. https://www.sqlite.org/atomiccommit.html

[1] I do wonder about this. One odd thing about sqlite is a looseness about types, and AIUI cramming numeric values into the smallest range that will hold them. So I'd think that if you UPDATED a value 100 to a value 1000000000000, you'd have to shuffle the trailing part of the file, compared to a format that e.g. reserved 8 bytes for a numeric type. But maybe they do buffer numeric or string storage. And not having to write the whole file, or not having to read the whole file (e.g. to decompress it) seem like at least partial wins.
mindstalk: (Default)
In April a friend introduced me to csvkit, a suite of command line tools for manipulating CSV files, including doing SQL queries against them, and that sounded cool so I made a note. A bit later, friend Z Facebooked about q, which is the worst software name ever, which also ran queries against CSV files. I made another note.

My use case is my finances, which I'd been keeping in ad hoc text files like "May2015", with some awk scripts to sum up categories in a month, and crosscheck that the overall sum matched the sum of all categories, to detect miscategorization. It worked well for that task but wasn't very flexible, and late last year I had the idea of finally going to 'proper' software. At first I assumed a spreadsheet, because spreadsheets = finances, right? But then I realized that for the queries I wanted to do, SQL was more appropriate.

So I wrote a Python script to convert my years of files into one big CSV files, with date broken down into year and day for easy queries, and my text tags converted into a category column. Then I imported it into MySQL and it was good.

But what about going forward? I spend more, and make new text files... making notes in the full format (date, year, month, day, amount, category, notes) is a pain, and I kept forgetting how to import more into MySQL, and I just let things slide.

Last night I decided to get back to it, as part of checking my spending and savings, and checked out the old tools, with this year's spending in a simpler (date, amount, notes) CSV file.

Both programs work, and I figured out sqlite for extracting month on the fly (so I can group sums by month, or compare power spending across all Junes, say.) Sample queries:

q -H -d, "select sum(amount) from ./mon where code like '%rent%'"

q -H -d, "select strftime('%m', date) as month, sum(amount) from ./mon where code like '%transport%' group by month"

csvsql --query "select Year, sum(amount) from money2 where Month='06' group by year" money2.csv
#that's against the more complex CSV

How do they compare? Probably the more important is that q is way faster, perceptually instantaneous on a 7000+ line file, while csvsql has notable startup time. Both are Python, but csvkit also requires Java, so maybe it's starting a JVM in the background.

q is much lighter, an 1800 line Python program; csvkit has a long dependency list. I tried using the Arch AUR package, but don't have an AUR dependency tracer, so ended up using 'pip install csvkit' instead.

q needs to be told that the CSV file is actually comma separated, not space-separated, and has a header; OTOH csvsql needs to be told if you want to do a query, and the file you're querying.

It looks like both only do SELECT, not UPDATE; I'd wanted to do UPDATE in cleaning up my booklog CSV file but ended up resorting to another Python script. (After trying to push everything into a real sqlite database, but failing to get the weird CSV imported correctly.)

q only does queries; csvsql does more, I dunno exactly.

q has a man page, csvkit docs are entirely online.

I'll probably be using q.

Why not use an actual database? Mostly to cut out steps: new expenditures or books read are easy to update in a text file, and if I can treat that as a database, I don't need a step to update some other DB.

mysql felt heavy and clunky, though thanks to work I now know about the '~/.my.cnf' file which can store authentication. You still need a mysqld up. sqlite3 can run directly off a file and is certainly worth considering -- though as noted, I never got it actually working.

Most Popular Tags


RSS Atom