Brandon Invergo

Release of zeptodb 3.0

I’m pleased to announce the release of zeptodb 3.0.

zeptodb is a small set of tools for working with DBM databases (GDBM, specifically). DBM databases are simple disk-based, key-value databases (like disk-based hash tables). GDBM, for example, comes with a command-line tool for manipulating its databases, however it is not practical to use in simple scripts. zeptodb aims to fix that by providing simplified tools that work nicely in shell pipelines. For look-ups in large tables, the performance (O(1)) rapidly improves on that of, say, grepping a tab-separated text file (O(n)), but without having to deal with constructing database queries.

Example

A real-life usage that I employ it for quite often is for translating between different biological entity identifiers (genes, proteins, etc.). For example, the genome assemblies of the parasite P. falciparum has undergone some major updates in the last few years, which resulted in a complete overhaul of their gene identifiers (from, e.g. PF13_0083 to PF3D7_1314600). However, some old datasets that you find might still use the old identifiers. So, we can get an id-mapping from PlasmoDB, which has a row for each gene, in which the first column is the current ID and any following columns are previous IDs (there may be more than one). Let’s convert that first to a two column file, with the old ID in the first column and the new ID in the second column:

$ wget http://plasmodb.org/common/downloads/release-28/Pfalciparum3D7/txt/PlasmoDB-28_Pfalciparum3D7_GeneAliases.txt
$ awk '{for(i=2; i<=NF; i++){printf("%s\t%s\n", $i, $1)}}' PlasmoDB-28_Pfalciparum3D7_GeneAliases.txt >pfalc_gene_aliases.tsv
$ head pfalc_gene_aliases.tsv
PF02_0090   PF3D7_0209400
PFB0423c    PF3D7_0209400
PF13_0270   PF3D7_1351800.2
MAL5P1.73   PF3D7_0507200
MAL4P2.52   PF3D7_0412000
1791.m00049 PF3D7_0732200
MAL3P7.54   PF3D7_0324800
2277.t00266 PF3D7_1227500
MAL12P1.266 PF3D7_1227500
MAL3P8.16   PF3D7_0301800

Next we’ll create our DBM database using the command zdbc.

$ zdbc pfalc_gene_aliases.db

We can now fill it with values with the zdbs command:

$ zdbs --delimiter='    ' pfalc_gene_aliases.db <pfalc_gene_aliases.tsv

OK, we’re ready to put it to use. As a bit of a contrived example, let’s grab a dataset from an old version of PlasmoDB (contrived because they actually update the data files with each release to use the latest identifiers; an actual need usually arises when taking data sets from old publications or other databases).

$ wget http://plasmodb.org/common/downloads/release-7.0/Pfalciparum/transcriptExpression/Pf_Cowman_Invasion_KO/molecular_mechanism_invasion.txt
$ head molecular_mechanism_invasion.txt
ID  W2mef EBA175 KO (late T) rep1   W2mef WT (late T) rep1  W2mef/c4/Nm (late T) rep1
MAL13P1.1   3.053661    3.107224    3.248607
MAL13P1.100 2.297358    2.32458 2.290255
MAL13P1.102 4.67976 3.612763    4.019086
MAL13P1.103 3.686802    3.726557    3.685324
MAL13P1.105 2.347508    2.234413    2.078631
MAL13P1.106 3.386187    4.404499    2.485488
MAL13P1.107 2.431189    2.430622    2.514539
MAL13P1.11  2.335451    2.320175    2.212009
MAL13P1.111 3.913259    3.53488 3.478326

I just grabbed this file at random, so I have no idea what the data is. Let’s say that, for whatever reason, we’re interested in the genes for which the second column has a value greater than 12. We can do this:

$ awk '{if (NR==1){next};if ($2 > 12){print $1}}' molecular_mechanism_invasion.txt
PF08_0119
PF11_0040
PF11_0224
PF13_0058
PF14_0016
PFB0120w
PFB0915w

…which isn’t useful because they’re the old IDs. Enter zdbf:

$ awk '{if (NR==1){next};if ($2 > 12){print $1}}' molecular_mechanism_invasion.txt | zdbf pfalc_gene_aliases.db
PF3D7_0805200
PF3D7_1102800
PF3D7_1121600
PF3D7_1310700
PF3D7_1401400
PF3D7_0202500
PF3D7_0220000

We can go one step further if we’re interested in the actual gene products:

$ wget http://plasmodb.org/common/downloads/release-28/Pfalciparum3D7/fasta/data/PlasmoDB-28_Pfalciparum3D7_AnnotatedProteins.fasta
$ zdbc pfalc_gene_products.db
$ sed -n '/^>/p' PlasmoDB-28_Pfalciparum3D7_AnnotatedProteins.fasta | cut -f1,3 -d'|' | sed 's/product=\(.*\)/\1/' | sed 's/>//;s/ | /|/' | zdbs pfalc_gene_products.db 

I just took the annotated proteins file, found all the sequence header lines (starting with ">") and then did a few manipulations to convert it to "ID|product" form. These were then stored via zdbs.

Finally, we can redo our previous query, only this time also showing the gene products:

$ awk '{if (NR==1){next};if ($2 > 12){print $1}}' molecular_mechanism_invasion.txt | zdbf pfalc_gene_aliases.db | zdbf -d'    ' pfalc_gene_products.db 
PF3D7_0805200   gamete release protein, putative (GAMER) 
PF3D7_1102800   early transcribed membrane protein 11.2 (ETRAMP11.2) 
PF3D7_1121600   exported protein 1 (EXP1) 
PF3D7_1310700   RNA-binding protein, putative 
PF3D7_1401400   early transcribed membrane protein 14.1 (ETRAMP14) 
PF3D7_0202500   early transcribed membrane protein 2 (ETRAMP2) 
PF3D7_0220000   liver stage antigen 3 (LSA3)

OK, now, that might look crazy and like a lot of work, but of course you only have to do it once (and of course you’re putting all of that into a Makefile to be easily reproducible). Once it’s made, you can easily pipe IDs into your new database at any point. So, it’s very quick to integrate into pipelines.

Download

Download [sig]

News

This is a major update. Please note that there have been significant changes to the interface via several changes to the options.

  • Removed support for Kyoto Cabinet. In order to be able to better support the program in the long-term, the decision was made to only support one DBM library.
  • Overhaul of options. All zeptodb programs now support a set of common options: --mmap-size, --cache-size, --block-size, --no-mmap, and --no-lock (in addition to the usual help, usage, version and verbose options). These options control how the database is opened. Please note that the --num-buckets option for zdbc has been removed since it was only appropriate for Kyoto Cabinet and was a misnomer for GDBM (it’s effectively been replaced by --cache-size). The --sync option has also been added for commands that write changes to the database. Please see the documentation for a full explanation of these options.
  • zdbc always creates a new database. zdbc now overwrites a database if it’s called on an existing file.
  • New command zdbi. The new command zdbi prints out some basic information about a database file.