Twin package pt1 - hypogen

Twin package pt1 - hypogen

R package to ease working with genomic data of hamlets

What: R package

R:

Over the last years in the Puebla lab I spent quite some time on the analysis of genetic data of Caribbean Hamlets. During this time I kept rewriting the same type of R code over and over again.

After the publication of the study Inter-chromosomal coupling between vision and pigmentation genes during genomic divergence, I wanted to formalize these bits and pieces and wrote two small packages - hypogen and hypoimg.

The first (hypogen) one is a collection of all the code snippets related to genetics of hamlets, while the second one (hypoimg) deals with code related to graphical annotaion of plots using hamlet images.

This is the first of two blog entries presenting the bysic functionality of these packages.

Hypogen

The main focus of hypogen is the import and visualization of genetic data mapped to the hamlet reference genome.

library(tidyverse)
library(hypogen)

Import data

Often when working with genetic data we have a set of markers for which we want to display a specific statistic. This might come as a dataset that includes the chromosome the marker is placed on, the marker position and the statistics (eg. FST). The package hypogen provides the function hypo_import_snps() to easily import this type of data and add the genomic context of the marker (the genomic start of the chromosome and the genomic position of the marker):

system('echo "CHROM\tPOS\tVALUE\nLG05\t10\t2" > test.tsv')
hypogen::hypo_import_snps('test.tsv')
## # A tibble: 1 x 5
##   CHROM   POS VALUE    GSTART      GPOS
##   <chr> <dbl> <dbl>     <dbl>     <dbl>
## 1 LG05     10     2 105714998 105715008

Another classic type of data is a summary of a genomic window (eg. a sliding window analysis). This data can be imported using the function hypo_import_windows().

system('echo "CHROM\tBIN_START\tBIN_END\tVALUE\nLG05\t10\t15\t2\nLG18\t1000\t1500\t4" > test.tsv')
hypogen::hypo_import_windows('test.tsv')
## # A tibble: 2 x 7
##   CHROM BIN_START BIN_END VALUE    GSTART    POS       GPOS
##   <chr>     <dbl>   <dbl> <dbl>     <dbl>  <dbl>      <dbl>
## 1 LG05         10      15     2 105714998   12.5 105715010.
## 2 LG18       1000    1500     4 421775982 1250   421777232
system('rm test.tsv')

Apart from this data import functions, a sumary of the hamlet genome assembly is stored within the dataset hypo_karyotype.

hypogen::hypo_karyotype %>%
  head() %>%
  knitr::kable(align = 'ccccc')
CHROM LENGTH GSTART GEND GROUP
LG01 25890264 0 25890264 odd
LG02 31702000 25890264 57592264 even
LG03 23122023 57592264 80714287 odd
LG04 25000711 80714287 105714998 even
LG05 20912025 105714998 126627023 odd
LG06 24210077 126627023 150837100 even

Plot data

A second focus of the package is to provide help with data visualization along the hamlet genome within the ggplot2 framework. To indicate the individual linkage groups (LG), the function geom_hypo_LG() can be used in combination with scale_fill_hypo_LG_bg().

The function scale_x_hypo_LG() will adjust the ticks of the x-axis to indicate the LG number.

ggplot()+
  coord_cartesian(ylim = c(0, 1))+
  geom_hypo_LG() +
  scale_fill_hypo_LG_bg()+
  scale_x_hypo_LG()+
  theme_hypo()+
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank())

As an example, we can plot some RAD data published by Puebla et al. (2014). The data was mapped onto the hamlet genome and re-analysed using STACKS (Catchen et al. 2013).

loci_pos <- read_tsv("stacks/batch_2.phistats.tsv", skip = 4)  %>%
    mutate( CHROM =
              str_replace(Chr, pattern = 'chr', replacement = 'LG')) %>%
    select( `Locus ID`, CHROM,BP ) %>%
    filter( grepl('LG', CHROM) ) %>%
    left_join(hypo_chrom_start) %>%
    mutate(GPOS = BP + GSTART )

global_fst <- read_tsv('stacks/Fst_SPEC_genepop.tsv') %>%
  setNames(.,nm = c('Locus ID', 'FIS', 'FST', 'FIT')) %>%
  left_join( loci_pos ) %>%
  filter( !is.na(CHROM)) %>%
  mutate( window = 'global_fst')

Now, we can plot the RAD data in the context of the hamlet genome.

global_fst %>%
  ggplot(aes(x = GPOS, y = FST))+
  geom_hypo_LG() +
  geom_point(aes( color = CHROM ), alpha = .6, size = .3) +
  geom_smooth(aes(group = CHROM), color = rgb(1, 1, 1, .6),
              se = FALSE, size = 2,method = 'loess', span = .2)+
  geom_smooth(aes(color = CHROM, group = CHROM),
              se = FALSE, method = 'loess', span = .2)+
  scale_fill_hypo_LG_bg()+
  scale_x_hypo_LG()+
  scale_color_hypo_LG(guide = FALSE)+
  theme_hypo()

Besides plotting genome wide data, the package hypogen provides methods to plot data in the context of the genome annotation. For this, we first select a specific LG and define a range of interest.

LG <- "LG09"
XR <- c(200000, 275000)
anno_data <- global_fst %>% filter(CHROM == LG , BP > XR[1], BP < XR[2])
clr <- c('darkgray', RColorBrewer::brewer.pal(3, 'Set1')[c(1,3)])

Then, we plot the data subset together with the genome annotation.

hypo_annotation_baseplot(searchLG = LG, xrange = XR,
                         genes_of_interest = "AKT3_0",
                         genes_of_sec_interest = "Pus10",
                         anno_rown = 2) +
  geom_point(data = anno_data, aes(x = BP, y = FST), size=.5) +
  coord_cartesian(xlim = XR) +
  facet_grid(window~., scales = 'free_y',
             switch = 'y', labeller = label_parsed) +
  scale_color_manual(values = c('black', 'gray', clr), guide = FALSE) +
  scale_fill_manual(values = clr, guide = FALSE) +
  scale_x_continuous(name = "Hypo AnnoPlot Title",
                     expand = c(0, 0),position = 'top') +
  theme_hypo() +
  theme_hypo_anno_extra()

Genotype frequencies

The last feature of the hypogen package is the use of ternary diagrams for genotype frequencies. This is done to control for heterozygote excess or deficit.

The implementation of ternary diagrams within the ggplot2 framework is done using the ggtern package.

library(ggtern)

The genotypes of the data should be coded AA/Aa/aa:

n <- 350

data <- tibble(AA = runif(n)) %>%
  mutate(aa = (1-AA)*runif(n),
         Aa = 1-AA-aa)

data %>%
  head() %>%
  knitr::kable()
AA Aa aa
0.7620661 0.0660756 0.1718583
0.4350251 0.1445212 0.4204537
0.5044092 0.3341618 0.1614290
0.0365411 0.6628914 0.3005675
0.8947430 0.0148808 0.0903762
0.4489359 0.1179725 0.4330917

As reference, the function hypo_hwe() can be used to indicate the genotype frequencies that are in Hardy–Weinberg equilibrium.

hypogen::hypo_hwe(n = 25) %>%
  ggplot(aes(x = AA, y = Aa, z = aa )) +
  coord_tern() +
  ggtern::stat_density_tern(data = data,
        geom='polygon',
        aes(fill=..level..))+
  geom_point(data = data) +
  geom_line()+
  scale_fill_viridis_c(guide = FALSE)

We can use this for example to demonstate the development of the inbreeding coefficient (FIS) across the genotype frequencies space.

nmax <- 15
sq <- 0:nmax
cross_df(list(AA = sq, Aa = sq, aa = sq)) %>%
  mutate(n = AA+Aa+aa) %>%
  filter(n == nmax) %>%
  mutate( p = (AA+.5*Aa)/n,
          q = (aa+.5*Aa)/n,
          He = 2*p*q,
          Ho = Aa/n,
          Fis = ifelse(Aa > 0,((He - Ho)/(He)),1)) %>%
  ggplot(aes(x = AA, y = Aa, z = aa )) +
  coord_tern() +
  stat_interpolate_tern(geom = "polygon",
                     formula = value~x+y, method = lm, n = 200,
                     breaks=seq(-1.4, 1.4, length.out = 500),
                     aes( value = Fis, fill = ..level..), expand=10)+
  geom_line(data = hypogen::hypo_hwe(n = 25),color=rgb(0,0,0,.3))+
  scico::scale_fill_scico('Fis',palette = 'roma')

References

Catchen, Julian, Paul A. Hohenlohe, Susan Bassham, Angel Amores, and William A. Cresko. 2013. “Stacks: An Analysis Tool Set for Population Genomics.” Molecular Ecology 22 (11): 3124–40. https://doi.org/10.1111/mec.12354.

Puebla, O., E. Bermingham, and W. O. McMillan. 2014. “Genomic Atolls of Differentiation in Coral Reef Fishes (Hypoplectrus Spp., Serranidae).” Molecular Ecology 23 (21): 5291–5303. https://doi.org/10.1111/mec.12926.


© 2024. All rights reserved. KH.