Brandon Invergo

Data formatting and distribution tips for biologists

Incredible amounts of data are now being generating in all corners of the biological sciences. Given the immensity of the data being generated, we must turn to bioinformatic analyses in order to derive knowledge from our experiences. This process typically involves significant amounts of data reformatting as well as computationally intensive statistical analyses. Thus, the most efficient and natural way to process the data is via computer programming.

Unfortunately, large datasets are still quite often being distributed in a manner that is unfriendly to bioinformatic analysis. While it is sometimes a trivial task to transform the data into a usable format, other times it results in the poor bioinformatician wanting to put his or her fist through the monitor while cursing each name on the author list in succession.

Here are a few tips for preparing data for bioinformatic analysis. Please consider these in your collaborations and for distributing your supplementary materials. After all, you put all that work into generating that amazing dataset, it would be a shame to put up roadblocks to amazing analyses of it!

Note that this is coming only from my experiences on the receiving end of data. I have not (yet) generated any large datasets myself, so I may indeed be missing some perspective of that side of the process.

If anyone has any other suggestions, please let me know and I'll add them to this list!

Make your data available in text format

This is probably the number one complaint: the only file formats appropriate for computational analysis are text formats. Sure, it might be possible to load other formats with special libraries for certain programming languages or to convert them to text, but you should not demand that if you don't absolutely need it (hint: you don't). Other formats are OK for small data tables that can be easily reproduced by hand but for anything large (say, >50 rows) should not be distributed as anything other than text.

First and foremost: never, ever, under any circumstances, distribute a large data table in PDF format! Never. While PDF tables look nice and it's convenient to have them in the same document as your supplementary figures, it is an absolute chore to extract tabular data from them. Try it yourself sometime. When a table is copied and pasted from a PDF file, all whitespace is lost between columns. Depending on the structure of the text in the table, it might be essentially impossible to reliably separate each of the columns without having to visually verify against the PDF file at every step. I'll repeat again: your nice proteomics dataset is completely useless inside a PDF file.

Second: spreadsheets (MS Excel, LibreOffice Calc, etc.) are not good data exchange formats either. They are (essentially) closed to programmatic access outside of their own macro languages (yes, some libraries exist but again, don't make me download and learn to use a library just to parse your text-only data, which is otherwise simple to parse). Of course, a spreadsheet can always be exported to a comma-separated value (CSV) file, but because spreadsheets often have some "helpful" formatting (see below), this means that a lot of work often needs to be done to properly format the CSV file, which can be an error-prone process. Also, don't be so sure that everyone can open your file. For example, MS Excel produces files that cannot be reliably read on other systems. Since a lot of bioinformatics research happens on GNU/Linux systems, it's very likely that the people who will analyze your data will first find themselves struggling against either the old, proprietary Excel format or the new, obfuscated one.

The best solution is to make your data available as a CSV file (or a tab-delimited file). This will make sure that a programmer can immediately start working with your dataset without introducing any potential errors by unnecessary reformatting.

This also goes for non-tabular data. Please don't send sequencing results in a word processor format! A simple text file in an appropriate sequence format (e.g. FASTA) is immensely more useful.

A note on this: there are presumably some journals that require uploading supplementary materials in one of these inappropriate formats. The first thing that we should do is make them aware of the inappropriateness of their policy. Second, if they will not let you also upload the text file, make sure that it is available for download somewhere on the web (and that you make this fact known).

Don't format your tables to be visually appealing

By this, I mean don't try to format your data table to make browsing it easier. For example, one might put table super-headers spanning two columns to visually connect them, or one could have "sub-rows" or other summary rows adding information to a row above them. All rows should have the same meaning, e.g. if you know how to read one row, you know you can read any other row in the same way.

The key to understanding this is understanding the difference between how a human reads a table and how a computer reads it. The above tricks are really great for humans. Proper table structure allows us to take in a lot of information at a glance, however computers have no good way of analyzing a table in that way. When a table is read in by a computer, it does so line-by-line. Each line is then split into cells (say, by splitting at all the commas, or at all the tab characters), and then each column is handled independently. Thus, any context established between columns or contained in previous lines (or, worse, in lines not yet read) requires needless programming effort to maintain.

Also, please do not hide information by depending on formatting tricks such as cell coloring! For example, if the rows can be classified by applying a threshold to some column of data, do not simply color those cells of data to indicate the results of the classification. A program will have no way of getting that information. Instead, you should add a column to the table specifying the classification.

So remember, if your data will be analyzed computationally, make it beautiful for computers not for humans!

Don't throw out data

Of course, it's difficult to be sure just how often this happens but I have definitely seen it. When a lot of biological data is generated, often a lot of it is not at all necessary for your immediate analysis needs. However, one scientist's junk is another one's treasure. So, don't haphazardly delete columns from your table just because you don't think it's necessary! Likewise, leave all of the raw data from which you made your calculations. Let the future users of your data decide what they need.

Make your data available elsewhere

While having your data available alongside your publication as supplementary information is useful, a lot of the above problems can be alleviated by also making your data available from public repositories. These will usually have fairly strict standards, so that anyone downloading from them can safely expect the data to be in a useful format. This also neatly sidesteps any unreasonable format restrictions imposed by the publisher.