3 Bash

The Bourne-again shell (bash) is usually the programming language that you will use when running the command line.

To be efficient you will therefore need some knowledge of this language. I do not intend to rewrite the 1254th tutorial on “How to use bash” since there are already lots of tutorials online (again - google is your friend here…).

Here I will give a small overview and simply list the common patterns/issues/programs and the way I deal with them in bash:

3.1 Commands/Programs (I/II)

The first thing you need to know is how to navigate using the command line.

The following commands help you with that:

  • pwd is short for “print working directory” and reports the location on your file system where the command line is currently operating (this tells you where you are)
  • ls will print the content (the files & folders) of the current directory (what is around you)
  • cd is short for “change directory” and allows you to change your working directory (move around)
  • echo allows you to print some text as output (it allows you to speak)
  • ./ is the current directory
  • .. is the parent directory (the directory which contains the current directory)

With this you can do the most basic navigation. In our assumed project this could look like this:

One thing that makes your life easier when using the command line is the <tab> key on your keyboard. If you don’t know what it does try it. It’s basically an auto-complete for your input.

If eg. you are sitting in the root_folder and want to move to root_folder/data all you need to type is root_folder/da<tab> and it will auto-complete to root_folder/data. Typing root_folder/d<tab> will not be enough since this is ambiguous since it might complete to either root_folder/data or root_folder/docs. Of course, in this example we only saved a single key stroke, but in real life can save you a lot of (mis-)typing.

3.2 Paths

A path is a location on your file system - it is quite similar to a URL in your web browser. It can either point to a file or to folder.

We have seen this before: the command pwd prints the path to the current working directory.

Generally we need to be aware of two different types - absolute vs. relative paths:

An absolute path looks something like this:

/home/khench/root_folder (Note the leading slash /home....)

This is what the type of path that pwd reports and it will always point to the same location on your file system regardless of the directory you are currently operating in. That is because the leading slash points to the root folder of your file system (not of your project) which is an absolute position.

The disadvantage of absolute paths is that things can change: you might move your project folder to a different location (backup on external hard drive) or share with collaborators. In these cases your path will point to a file/location that does not exist - it is basically a broken link.

Therefore in some cases the use of relative paths is useful. Relative paths indicate the location of a file/folder relative to your current working directory.

We used this for example in the command cd data (short for cd ./data). Here, there is no leading slash - instead the path starts directly with the folder name or with the current directory (./)

So to get from the root_folder into the data folder we can use either of the two commands:

  • cd data (relative path)
  • cd /home/khench/root_folder/data (absolute path)

3.3 Variables

Variables are containers for something else. You can for example store some text in a variable. Later you can access this “information”. Therefore you have to address the variable name and put it behind a dollar sign:

(There are two equivalent notations: $VAR and ${VAR}. The notation with curly brackets helps if your variable name is more complex since the start and end of the name is exactly defined.)

Variables are often used to store important paths. You could for example store the (absolute) location of your project folder in a Variable (PROJ="/home/khench/root_folder") in some sort of configuration script and then use the variable to navigate (cd $PROJ/data). That way you only need to update the location of the project folder in a single place in case it needs to be updated.

There are quite a view variables that are already defined on your computer and it is good to be aware of these. Two important ones are $HOME and $PATH.

$HOME directs to the home folder of the user (often you can also use ~/ as an equivalent to $HOME/):

$PATH often not a directory but a collection of directories separated by a colon:

The $PATH is a super important variable - it the register of directories where your computer looks for installed software. Every command that you type into the command line (eg. cd or ls) is a program that is located in one of the folders within your $PATH.

You can still run programs that are located elsewhere, but if you do so you need to specify the absolute path of this program (/home/khench/own/scripts/weird_script.py instead of just weird_script.py).

We will discuss later how to extend the $PATH in case you want to include a custom software folder if you need to install stuff manually.

3.4 Scripts

So far, we have been working interactively on the command line. That is we typed directly into the terminal and observed the direct output. But I claimed before that one of the advantages of the command line is that reproducible and the possibility to protocol the work on the command line. One aspects of this is the ability to store your workflow in scripts. If you use a script to store bash commands the conventional suffix is .sh (eg: script.sh). Additionally it is useful to add a header line that points to the location of bash itself (usually one of the two):

A full (admittedly quite silly) script might look like this:

Provided the script is located in our sh folder you can run it like this:

The big benefit of using bash scripts is that you will be able to remember later what you are doing right now. A workflow that you do interactively is basically gone in the sense that you will never be able to remember it exactly.

As with the paths there is one script that you should be aware of - your bash start up script. This is a hidden file (.bashrc or .bash_profile)located in your $HOME folder. This script is run every time when you open the terminal and will be important later.

3.5 Commands/Programs (II/II)

3.5.1 Flags

One important feature of most command line programs is the usage of flags. These are (optional) parameters that alter they way a program operates an are invoked with - or -- (depending on the program):

Arguably one of the most important flags for most programs is -help/--help. As you might guess this will print the documentation for most programs. Often this includes an example of the input the program expects, as well as all the options available.

3.5.2 More commands/programs

Apart from the most basic commands needed for navigating within the command line, I want to list the commands I personally use most frequently to actually do stuff:

3.5.2.3 Text manipulation (table like text files)

3.5.2.4 Text manipulation (table like text files)

In my opinion especially sed & awk are extremely powerful commands. I will give a few usage examples at the very end of this section.

3.5.2.5 Text editors

There are several actual text editors for the command line:

All have their own fan base. It makes sense to learn how to use at least one of those - basically all will require to learn a handful of key-combinations.

3.6 Loops

Sometimes you need to do one thing multiple times. Or almost the same thing and just modify a single parameter. In this case you can use loops instead of manual copy-and-paste. There are two loops that you should know:

  • for
  • while

You can use bash commands to create the sequence to loop over by using $(command) (eg: $(ls *.txt)).

You can also loop over the content of a file:

Yet this might give unexpected results when your file contains whitespaces (looping over a table):

In this case you can switch to while:

I use this pattern to read in parameters eg. for a function call within the loop:

3.7 Installing new software

This is hell! Installing new software is what can take up a huge portion of your time & sanity. That is because most new programs that you want to install will not work straight out of the box but depend on 17 other programs, packages and libraries and all of course in specific versions winch contradict each other. These dependencies will then of course not work straight out of the box but depend on … (You get the idea - right?)

Ok, I might exaggerate a little here but not much…

Therefore, whenever possible try to use package managers. These are programs that try to keep track of your software and - well - to manage the whole dependency hell for you.

Depending on your OS, there are different package managers available for you:

(Conda & especially bioconda are generally pretty useful, but lately they seem a little buggy. That is, lately the take ages to load and their ‘catalog’ might not contain the bleding edge version of the program you’re looking for.)

Yet, sometimes the program you need is not available using package managers. In those cases you will have to bite the bullet and install the manually. How to do this exactly varies case by case, but a common theme to compile a program from source is the following:

  • ./configure
  • make
  • make install

One example here would be the installation of Stacks:

You might notice that here we used sudo make install instead of make install. This means that to execute this command we need admin rights - no problem on your laptop, but on the cluster this is a no-go. We’ll talk about this later.

The (naive) way I like to picture the installation is by comparing it to building IKEA furniture:

  • ./configure: Your computer reads the IKEA manual and checks that is has all the tools needed to build the piece
  • make: Your computer unpacks the pieces and puts the closet together (it is now standing in your workshop)
  • make install: You put the closet into the bedroom (which is where you will be looking for your cloths)

The point with make install is that it puts the compiled software into one of the standard folders within your $PATH so that the computer can find it. But since these folders are quite important the average user is not allowed to modify them - you need admin rights (sudo ...,“become root”) to do this.

As mentioned before , this is possible if you own the computer, but on the cluster this will not be the case.

The work-around is to create a custom folder that you can acess and to collect the manually installed software there instead.

Lets say your software folder is /home/khench/software. This would change your installation procedure to the following:

This will place the compiled software into /home/khench/software or /home/khench/software/bin. Our problem is only half solved at this point, since the computer still does not find the software. Assuming the program is called new_program, running which new_program will still return a blank line at this point.

Therefore, we need to add /home/khench/software & /home/khench/software/bin to our $PATH. To do this we add the following line to our bash start-up script ($HOME/.bashrc or $HOME/.bash_profile):

We append the existing $PATH with our two custom directories. Note that the start-up script is a start-up script - the changes will come into effect once we restart the terminal but they will not effect the running session. To update the current session we need to run the start-up script manually (source $HOME/.bashrc or source $HOME/.bash_profile).

At this point the following should work:

I usually like to double check that the program will open properly by calling the program help after installing.

If this does not produce an error but displays the help text, the program should generally work.


3.8 Appendix (sed and awk examples)

A large portion of bioinformatics is dealing with plain text files. The programs sed and awk are very powerful for this and can really help you working efficiently. I sure you can do way more using these two commands if you dig into their manuals, but here is how I usually use them:

3.8.1 sed

This is my go-to search & replace function. I use it to reformat sample names from genotype files ( -> ), reformat variable within a bash pipeline, transform whitespaces (“ ->”“;” " -> ") and similar tasks:

3.8.1.1 Basics

The basic structure looks like sed 's/pattern/replace/g' <input file>. Here the s/ activates the search & replace mode, the /pattern/ is the old content (search), the /replace/ is the new content (replace) and the /g indicates that you want to replace all occurences (globally) of the patterns within every line of the text. In contrast sed 's/pattern/replace/' would only replace the first occurrence of the pattern within each line.

In cases where you need to replace a slash (“/”) you can use a different delimiter. The following commands are equivalent and I believe there are many more options:

  • s/pattern/replace/g
  • s=pattern=replace=g
  • s#pattern#replace#g

You can also replace several patterns in one command (one after the other) by separating them with a semicolon (sed 's/pattern1/replace1/g; s/pattern2/replace2/g').

When using sed in a bash pipeline is looks like this:

Generally (also eg. when using grep), there are two important special characters:

  • ^: the start of a line
  • $: the end of a line

So, in the example above (sed 's/^/>/') we introduced a “>” at the start of each line (before crating new lines by introducing a line break \n)

3.8.1.2 Wildcards

So far we have replaced patterns in a destructive manner. By this I mean that after using sed the pattern is replaced and thus gone. But sometimes you don’t want to delete the pattern you are looking for but to merely modify it. Of course you could argue that all you need to do is to write eg:

But this only works when we know exactly what we are looking for. To be more precise so far we did not really do s/pattern/replace/g but more something like s/name/replace/g. By this I mean that we searched for an exact string (what I call a name), while a pattern can be more ambiguous by using wildcards and regular expressions (.*,[123] ,[0-9], [0-9]*, [A-Z]*,[a-z]*):

We could for example replace all (lowercase) words followed the combination of numbers starting with “1”:

3.8.1.3 Modifying matches

Now in this case, we don’t know the exact string that we are going to replace - we only know the pattern. So if we want to modify but avoid deleting it we need a different method to capture the detected pattern. To do this we fragment the search pattern using \(pattern\) or \(pat\)\(tern\). The patterns declared like s/\(pattern\)/ will still be found just like in s/pattern/, but now we can access the matched pattern using \1 (pattern) or \1 (pat) & \2 (tern) in the replace section:

3.8.2 awk

I feel like awk is more like its own programming language than just a unix command - it can be super useful. I usually use it when “I need to work with columns” within a bash pipeline. This could be eg. add in two columns or add a string to a column based on a condition. I’m afraid this is almost an insult to the program because I sure you can do waaay cooler things that this - alas, so far I could not get past RTFM.

3.8.2.1 Basics

The basic structure of awk looks like:

The most simple (and useless) version is to emulate cat:

The first thing to know about awk is how to address the individual columns. By default awk uses any whitespace (" " & "\t") as column delimiter. $0 is the whole line, $1 is the first column, $2 is the second …

Reading a "\t" delimited table:

Space " " is also delimiter:

3.8.2.2 Conditions

The second thing to know is how to add a condition to a action:

Combining conditions with logical and:

Combining conditions with logical or:

Different actions for different cases:

In awk, "NR" stands for the row number and "NF" stands for the numbers of fields (columns) within that row:

3.8.2.3 Variables

One to be aware of is the use of variables within awk. You might have noticed that the columns within awk look like bash variables ( eg. $1). But if you try to use a bash variable within awk this will fail:

This is not what we expected - that would have been:

The issue here is the use of different quotation marks (single '' vs. double ""). In short - the awk command needs to be wrapped in single quotes, but within these, bash variables don’t work:

vs.

To get around this we can pass the variable to awk before the use of the single quotes:

Basically you can store anything within a awk variable and use it within awk:

One special awk variable is “OFS”. This is the field delimiter and it can be set like any other variable. The following two examples are equivalent (but one is way easier to read/write).

3.8.3 Bash one-liners

By combining the programs awk, cat, cut, grep and sed using the pipe | you can build quite concise and specific commands (one-liners) to deal with properly formatted data. Over the years I collected some combinations that I regularly use to deal with different types of bioinformatic data. You can find them together with some more useful bash examples within the oneliners.md file in the source repository of this document.