I’m pleased to announce the release of zeptodb 3.0.
zeptodb is a small set of tools for working with DBM databases (GDBM, specifically). DBM databases are simple disk-based, key-value databases (like disk-based hash tables). GDBM, for example, comes with a command-line tool for manipulating its databases, however it is not practical to use in simple scripts. zeptodb aims to fix that by providing simplified tools that work nicely in shell pipelines. For look-ups in large tables, the performance (O(1)) rapidly improves on that of, say, grepping a tab-separated text file (O(n)), but without having to deal with constructing database queries.
Example
A real-life usage that I employ it for quite often is for translating
between different biological entity identifiers (genes,
proteins, etc.). For example, the genome assemblies of the parasite
P. falciparum has undergone some major updates in the last few
years, which resulted in a complete overhaul of their gene identifiers
(from, e.g. PF13_0083
to PF3D7_1314600
). However, some old
datasets that you find might still use the old identifiers. So, we
can get an id-mapping from PlasmoDB, which
has a row for each gene, in which the first column is the current ID
and any following columns are previous IDs (there may be more than
one). Let’s convert that first to a two column file, with the old ID
in the first column and the new ID in the second column:
$ wget http://plasmodb.org/common/downloads/release-28/Pfalciparum3D7/txt/PlasmoDB-28_Pfalciparum3D7_GeneAliases.txt
$ awk '{for(i=2; i<=NF; i++){printf("%s\t%s\n", $i, $1)}}' PlasmoDB-28_Pfalciparum3D7_GeneAliases.txt >pfalc_gene_aliases.tsv
$ head pfalc_gene_aliases.tsv
PF02_0090 PF3D7_0209400
PFB0423c PF3D7_0209400
PF13_0270 PF3D7_1351800.2
MAL5P1.73 PF3D7_0507200
MAL4P2.52 PF3D7_0412000
1791.m00049 PF3D7_0732200
MAL3P7.54 PF3D7_0324800
2277.t00266 PF3D7_1227500
MAL12P1.266 PF3D7_1227500
MAL3P8.16 PF3D7_0301800
Next we’ll create our DBM database using the command zdbc
.
$ zdbc pfalc_gene_aliases.db
We can now fill it with values with the zdbs
command:
$ zdbs --delimiter=' ' pfalc_gene_aliases.db <pfalc_gene_aliases.tsv
OK, we’re ready to put it to use. As a bit of a contrived example, let’s grab a dataset from an old version of PlasmoDB (contrived because they actually update the data files with each release to use the latest identifiers; an actual need usually arises when taking data sets from old publications or other databases).
$ wget http://plasmodb.org/common/downloads/release-7.0/Pfalciparum/transcriptExpression/Pf_Cowman_Invasion_KO/molecular_mechanism_invasion.txt
$ head molecular_mechanism_invasion.txt
ID W2mef EBA175 KO (late T) rep1 W2mef WT (late T) rep1 W2mef/c4/Nm (late T) rep1
MAL13P1.1 3.053661 3.107224 3.248607
MAL13P1.100 2.297358 2.32458 2.290255
MAL13P1.102 4.67976 3.612763 4.019086
MAL13P1.103 3.686802 3.726557 3.685324
MAL13P1.105 2.347508 2.234413 2.078631
MAL13P1.106 3.386187 4.404499 2.485488
MAL13P1.107 2.431189 2.430622 2.514539
MAL13P1.11 2.335451 2.320175 2.212009
MAL13P1.111 3.913259 3.53488 3.478326
I just grabbed this file at random, so I have no idea what the data is. Let’s say that, for whatever reason, we’re interested in the genes for which the second column has a value greater than 12. We can do this:
$ awk '{if (NR==1){next};if ($2 > 12){print $1}}' molecular_mechanism_invasion.txt
PF08_0119
PF11_0040
PF11_0224
PF13_0058
PF14_0016
PFB0120w
PFB0915w
…which isn’t useful because they’re the old IDs. Enter zdbf
:
$ awk '{if (NR==1){next};if ($2 > 12){print $1}}' molecular_mechanism_invasion.txt | zdbf pfalc_gene_aliases.db
PF3D7_0805200
PF3D7_1102800
PF3D7_1121600
PF3D7_1310700
PF3D7_1401400
PF3D7_0202500
PF3D7_0220000
We can go one step further if we’re interested in the actual gene products:
$ wget http://plasmodb.org/common/downloads/release-28/Pfalciparum3D7/fasta/data/PlasmoDB-28_Pfalciparum3D7_AnnotatedProteins.fasta
$ zdbc pfalc_gene_products.db
$ sed -n '/^>/p' PlasmoDB-28_Pfalciparum3D7_AnnotatedProteins.fasta | cut -f1,3 -d'|' | sed 's/product=\(.*\)/\1/' | sed 's/>//;s/ | /|/' | zdbs pfalc_gene_products.db
I just took the annotated proteins file, found all the sequence header
lines (starting with ">") and then did a few manipulations to convert
it to "ID|product" form. These were then stored via zdbs
.
Finally, we can redo our previous query, only this time also showing the gene products:
$ awk '{if (NR==1){next};if ($2 > 12){print $1}}' molecular_mechanism_invasion.txt | zdbf pfalc_gene_aliases.db | zdbf -d' ' pfalc_gene_products.db
PF3D7_0805200 gamete release protein, putative (GAMER)
PF3D7_1102800 early transcribed membrane protein 11.2 (ETRAMP11.2)
PF3D7_1121600 exported protein 1 (EXP1)
PF3D7_1310700 RNA-binding protein, putative
PF3D7_1401400 early transcribed membrane protein 14.1 (ETRAMP14)
PF3D7_0202500 early transcribed membrane protein 2 (ETRAMP2)
PF3D7_0220000 liver stage antigen 3 (LSA3)
OK, now, that might look crazy and like a lot of work, but of course you only have to do it once (and of course you’re putting all of that into a Makefile to be easily reproducible). Once it’s made, you can easily pipe IDs into your new database at any point. So, it’s very quick to integrate into pipelines.
Download
News
This is a major update. Please note that there have been significant changes to the interface via several changes to the options.
- Removed support for Kyoto Cabinet. In order to be able to better support the program in the long-term, the decision was made to only support one DBM library.
- Overhaul of options. All zeptodb programs now support a set of
common options:
--mmap-size
,--cache-size
,--block-size
,--no-mmap
, and--no-lock
(in addition to the usual help, usage, version and verbose options). These options control how the database is opened. Please note that the--num-buckets
option for zdbc has been removed since it was only appropriate for Kyoto Cabinet and was a misnomer for GDBM (it’s effectively been replaced by--cache-size
). The--sync
option has also been added for commands that write changes to the database. Please see the documentation for a full explanation of these options. zdbc
always creates a new database.zdbc
now overwrites a database if it’s called on an existing file.- New command
zdbi
. The new commandzdbi
prints out some basic information about a database file.