SQL and text files

Sat, 01 Jul 2017 01:39:02 GMT

In April a friend introduced me to csvkit, a suite of command line tools for manipulating CSV files, including doing SQL queries against them, and that sounded cool so I made a note. A bit later, friend Z Facebooked about q, which is the worst software name ever, which also ran queries against CSV files. I made another note.

My use case is my finances, which I'd been keeping in ad hoc text files like "May2015", with some awk scripts to sum up categories in a month, and crosscheck that the overall sum matched the sum of all categories, to detect miscategorization. It worked well for that task but wasn't very flexible, and late last year I had the idea of finally going to 'proper' software. At first I assumed a spreadsheet, because spreadsheets = finances, right? But then I realized that for the queries I wanted to do, SQL was more appropriate.

So I wrote a Python script to convert my years of files into one big CSV files, with date broken down into year and day for easy queries, and my text tags converted into a category column. Then I imported it into MySQL and it was good.

But what about going forward? I spend more, and make new text files... making notes in the full format (date, year, month, day, amount, category, notes) is a pain, and I kept forgetting how to import more into MySQL, and I just let things slide.

Last night I decided to get back to it, as part of checking my spending and savings, and checked out the old tools, with this year's spending in a simpler (date, amount, notes) CSV file.

Both programs work, and I figured out sqlite for extracting month on the fly (so I can group sums by month, or compare power spending across all Junes, say.) Sample queries:

q -H -d, "select sum(amount) from ./mon where code like '%rent%'"

q -H -d, "select strftime('%m', date) as month, sum(amount) from ./mon where code like '%transport%' group by month"

csvsql --query "select Year, sum(amount) from money2 where Month='06' group by year" money2.csv
#that's against the more complex CSV

How do they compare? Probably the more important is that q is way faster, perceptually instantaneous on a 7000+ line file, while csvsql has notable startup time. Both are Python, but csvkit also requires Java, so maybe it's starting a JVM in the background.

q is much lighter, an 1800 line Python program; csvkit has a long dependency list. I tried using the Arch AUR package, but don't have an AUR dependency tracer, so ended up using 'pip install csvkit' instead.

q needs to be told that the CSV file is actually comma separated, not space-separated, and has a header; OTOH csvsql needs to be told if you want to do a query, and the file you're querying.

It looks like both only do SELECT, not UPDATE; I'd wanted to do UPDATE in cleaning up my booklog CSV file but ended up resorting to another Python script. (After trying to push everything into a real sqlite database, but failing to get the weird CSV imported correctly.)

q only does queries; csvsql does more, I dunno exactly.

q has a man page, csvkit docs are entirely online.

I'll probably be using q.

Why not use an actual database? Mostly to cut out steps: new expenditures or books read are easy to update in a text file, and if I can treat that as a database, I don't need a step to update some other DB.

mysql felt heavy and clunky, though thanks to work I now know about the '~/.my.cnf' file which can store authentication. You still need a mysqld up. sqlite3 can run directly off a file and is certainly worth considering -- though as noted, I never got it actually working.

comments

Rich and Strange Aeons

SQL and text files