8 Common file types and other software

8.1 Docker

One thing I have not covered here but which puts the whole reproducible issue to yet another level would be the use of docker. This would allow you to not just share your data & scripts, but also the exact software versions that you used. Frankly, I did not include this here because I do not use it myself (didn’t get the chance to learn how to use it yet).

8.2 File types

Just a very superficial register of the most frequently used file types that you should probably know. All of these are actually plain text files and you can open them in regular text editor (don’t do this if the files are large….).

The difference is just in the formatting conventions and in the expected content.

txt : any type of text
md : text with minimal layout code

csv : comma separated values (example with thee columns)

A,important text,3

tsv : tab separated values

A\timportant text\t3

8.2 Genetic data

fa : aka. fasta - plain genetic sequences, includes an header line per sequence starting with >. (On the right is an example with two sequences - seq1 & seq2)

>seq1 
ATGCGT
GCATGG
>seq2
ATGTAA

fq : aka. fastq - sequencing data with quality score. (On the right is an example of a sequence - seq1)

@seq1
ATGCGTGCATGG
#
*55CCF>>''))

sam/bam : sequence alignment format. Genetic sequences mapped to a reference, header lines start with @.
(sam is human readable, bam is binary - only for computers)
vcf/bcf : variant call format. Genotypes + metadata in table form, header lines start with ##.
(vcf is human readable, bcf is binary - only for computers)

bed : Browser Extensible Data. Ranges on a reference genome. Includes at least three tab separated columns (chromosome, start, end, example with three ranges).

LG02    0      400
LG02    1500   3000
LG15    555    1200

gff : general feature format. It describes exons, genes and other features of DNA. The structure is similar to a bed file with additional columns.

8.2 Code

sh : code written in bash
R : code written in R
py : code written in python
pl : code written in pearl
nf : code written in nextflow

8.3 Software

Below I list what I think are the must haves for any bioinformatic tool shed:

fastqc : tool for quality checking of sequencing data (first step of any project using new sequencing data)
multiqc : summarize the fastqc reports for all your samples in a single report
samtools : tool set for working with sam files
vcftools : tool set for working with vcf files (reformatting & population statistics)
vcflib : convenience scripts for working with vcf files
bedtools : tool set for working with bed files