5 (git 4) Analysis II (d_XY & pi)

This pipeline can be executed as follows:

cd $BASE_DIR/nf/04_analysis_dxy
source ../../sh/nextflow_alias.sh
nf_run_dxy

5.1 Summary

The genetic divergence, as well diversity, are computed within the nextflow script analysis_dxy.nf (located under $BASE_DIR/nf/04_analysis_dxy/). It takes the all BP data set and computes $d_{XY}$ and $\pi$. Below is an overview of the steps involved in the analysis. (The green dot indicates the genotype input, red arrows depict output that is exported for further use.)

5.2 Details of `analysis_dxy.nf`

5.2.1 Data preparation

The nextflow script starts by opening the genotype data and feeding it into two different streams (one for d_XY and one for $\pi$).

#!/usr/bin/env nextflow
// This pipeline includes the analysis run on the
//   all callable sites data sheet (dxy).

// git 4.1
// load genotypes
Channel
    .fromFilePairs("../../1_genotyping/3_gatk_filtered/filterd.allBP.vcf.{gz,gz.tbi}")
    .into{ vcf_ch; vcf_pi_ch }

The computation of d_XY is split by linkage group, so we need to initialize a channel for the LGs.

// git 4.2
// initialize LGs
Channel
    .from( ('01'..'09') + ('10'..'19') + ('20'..'24') )
    .set{ lg_ch }

Now, we can subset the data set and convert it to a custom genotype format.

// git 4.3
// split by LG and reformat the genotypes
process split_allBP {
    label 'L_32g15h_split_allBP'
    tag "LG${lg}"

    input:
    set val( lg ), vcfId, file( vcf ) from lg_ch.combine( vcf_ch )

    output:
    set val( lg ), file( 'filterd.allBP.vcf.gz' ), file( "allBP.LG${lg}.geno.gz" ) into geno_ch

    script:
    """
   module load openssl1.0.2

   vcftools --gzvcf ${vcf[0]} \
       --chr LG${lg} \
       --recode \
       --stdout | bgzip  > allBP.LG${lg}.vcf.gz

   python \$SFTWR/genomics_general/VCF_processing/parseVCF.py \
       -i allBP.LG${lg}.vcf.gz  | gzip > allBP.LG${lg}.geno.gz
   """
}

Since the species composition differs between locations, we need to initialize three separate sets of species. These are going to be used to create the divergence species pairs.

// git 4.4
// define location specific sepcies set
Channel.from( [[1, "ind"], [2, "may"], [3, "nig"], [4, "pue"], [5, "uni"]] ).into{ bel_spec1_ch; bel_spec2_ch }
Channel.from( [[1, "abe"], [2, "gum"], [3, "nig"], [4, "pue"], [5, "ran"], [6, "uni"]] ).into{ hon_spec1_ch; hon_spec2_ch }
Channel.from( [[1, "nig"], [2, "pue"], [3, "uni"]] ).into{ pan_spec1_ch; pan_spec2_ch }

For the diversity, we also initialize the full set of populations of the study.

// git 4.5
// init all sampled populations (for pi)
Channel
    .from('indbel', 'maybel', 'nigbel', 'puebel', 'unibel', 'abehon', 'gumhon', 'nighon', 'puehon', 'ranhon', 'unihon', 'nigpan', 'puepan', 'unipan')
    .set{spec_dxy}

We want to run a sliding window at different resolutions, so we set up a channel for these.

// git 4.6
// init slining window resolutions
Channel
    .from( 1, 5 )
    .into{ kb_ch; kb_ch2; kb_ch3 }

5.2.2 d_XY

To prepare all species comparisons used to estimate divergence, we combine each species witch all other species within a location.

// git 4.7
// prepare pair wise dxy
// ------------------------------
// create all possible species pairs depending on location
//   and combine with genotype subset (for the respective location)
// ------------------------------
// channel content after joining:
// set [0:val(loc), 1:file(vcf), 2:file(pop), 3:val(spec1), 4:val(spec2)]
// ------------------------------
bel_pairs_ch = Channel.from( "bel" )
    .combine( bel_spec1_ch )
    .combine( bel_spec2_ch )
    .filter{ it[1] < it[3] }
    .map{ it[0,2,4]}
hon_pairs_ch = Channel.from( "hon" )
    .combine( hon_spec1_ch )
    .combine(hon_spec2_ch)
    .filter{ it[1] < it[3] }
    .map{ it[0,2,4]}
pan_pairs_ch = Channel.from( "pan" )
    .combine( pan_spec1_ch )
    .combine(pan_spec2_ch)
    .filter{ it[1] < it[3] }
    .map{ it[0,2,4]}

Now we attach the genotypes and the resolution level to the species pairs.

// git 4.8
// combine species pair with genotypes (and window size)
bel_pairs_ch
    .concat( hon_pairs_ch, pan_pairs_ch )
    .combine( geno_ch )
    .combine( kb_ch )
    .into { all_dxy_pairs_ch; random_dxy_pairs_ch }

At this point, we can calculate d_XY.

// git 4.9
// compute the dxy values
process dxy_lg {
    label 'L_G32g15h_dxy_lg'
    tag "${spec1}${loc}-${spec2}${loc}_LG${lg}"

    input:
    set val( loc ), val( spec1 ), val( spec2 ), val( lg ), file( vcf ), file( geno ), val( kb ) from all_dxy_pairs_ch

    output:
    set val( "${spec1}${loc}-${spec2}${loc}-${kb}" ), file( "dxy.${spec1}${loc}-${spec2}${loc}.LG${lg}.${kb}0kb-${kb}kb.txt.gz" ), val( lg ), val( "${spec1}${loc}" ), val( "${spec2}${loc}" ), val( kb ) into dxy_lg_ch

    script:
    """
   module load openssl1.0.2
   module load intel17.0.4 intelmpi17.0.4

   zcat ${geno} | \
       head -n 1 | \
       cut -f 3- | \
       sed 's/\\t/\\n/g' | \
       awk -v OFS='\\t' '{print \$1, substr( \$1, length(\$1) - 5, 6)}' > pop.txt

   mpirun \$NQSII_MPIOPTS -np 1 \
       python \$SFTWR/genomics_general/popgenWindows.py \
       -w ${kb}0000 -s ${kb}000 \
       --popsFile pop.txt \
       -p ${spec1}${loc} -p ${spec2}${loc} \
       -g ${geno} \
       -o dxy.${spec1}${loc}-${spec2}${loc}.LG${lg}.${kb}0kb-${kb}kb.txt.gz \
       -f phased \
       --writeFailedWindows \
       -T 1
    """
}

Since the calculation was split across LGs, we now need to collect all LGs of a particular species pair…

// git 4.10
// collect all LGs for each species pair
dxy_lg_ch
  .groupTuple()
  .set{ tubbled_dxy }

… and merge the results.

// git 4.11
// concatenate all LGs for each species pair
process receive_tuple {
    label 'L_20g2h_receive_tuple'
    publishDir "../../2_analysis/dxy/${kb[0]}0k/", mode: 'copy'
    tag "${pop1[0]}-${pop2[0]}"

    input:
    set val( comp ), file( dxy ), val( lg ), val( pop1 ), val( pop2 ), val( kb ) from tubbled_dxy

    output:
    file( "dxy.${pop1[0]}-${pop2[0]}.${kb[0]}0kb-${kb[0]}kb.tsv.gz" ) into dxy_output_ch

    script:
    """
   zcat dxy.${pop1[0]}-${pop2[0]}.LG01.${kb[0]}0kb-${kb[0]}kb.txt.gz | \
   head -n 1 > dxy.${pop1[0]}-${pop2[0]}.${kb[0]}0kb-${kb[0]}kb.tsv;

   for j in {01..24};do
       echo "-> LG\$j"
       zcat dxy.${pop1[0]}-${pop2[0]}.LG\$j.${kb[0]}0kb-${kb[0]}kb.txt.gz | \
           awk 'NR>1{print}' >> dxy.${pop1[0]}-${pop2[0]}.${kb[0]}0kb-${kb[0]}kb.tsv;
   done

   gzip dxy.${pop1[0]}-${pop2[0]}.${kb[0]}0kb-${kb[0]}kb.tsv
   """
}

For control, we also create a “random” d_XY run, where we take the most diverged species pair and randomize the population assignment of the samples. Therefore, we first pick the most diverged species pair.

// git 4.12
// collect a species pair to randomize
Channel
    .from( [['bel', 'ind', 'may']] )
    .set{ random_run_ch }

Then we set the sliding window resolution.

// git 4.13
// setup channel content for random channel
Channel
    .from( 1 )
    .combine( random_run_ch )
    .combine( kb_ch2 )
    .filter{ it[4] == 5 }
    .set{ random_sets_ch }

Now, we randomize the population assignment of the samples and calculate differentiation.

// git 4.14
// permute the population assignment (the randomization)
process randomize_samples {
    label 'L_20g15h_randomize_samples'
    publishDir "../../2_analysis/fst/${kb}0k/random", mode: 'copy' , pattern: "*_windowed.weir.fst.gz"
    module "R3.5.2"

    input:
    set val( random_set ), val( loc ), val(spec1), val(spec2), val( kb ) from random_sets_ch

    output:
    set random_set, file( "random_pop.txt" ) into random_pops_ch
    file( "*_windowed.weir.fst.gz") into random_fst_out

    script:
    """
   cut -f 2,3 \$BASE_DIR/metadata/sample_info.txt | \
       grep "${loc}" | \
       grep "${spec1}\\|${spec2}" > pop_prep.tsv

   Rscript --vanilla \$BASE_DIR/R/randomize_pops.R

   grep A random_pop.txt | cut -f 1  > pop1.txt
   grep B random_pop.txt | cut -f 1  > pop2.txt

   vcftools \
     --gzvcf \$BASE_DIR/1_genotyping/3_gatk_filtered/filterd_bi-allelic.allBP.vcf.gz \
     --weir-fst-pop pop1.txt \
     --weir-fst-pop pop2.txt \
     --fst-window-step ${kb}0000 \
     --fst-window-size ${kb}0000 \
     --stdout | gzip > ${loc}-aaa-bbb.${kb}0k.random_${spec1}_${spec2}_windowed.weir.fst.gz

   """
}

We set up another channel to prepare the random d_XY…

// git 4.15
// pick random pair of interest
random_dxy_pairs_ch
    .filter{ it[0] == 'bel' && it[1] == 'ind' && it[2] == 'may'  && it[6] == 5 }
    .combine( random_pops_ch )
    .set{ random_assigned_ch }

.. and compute the divergence.

// git 4.16
// compute the dxy values
process dxy_lg_random {
    label 'L_G32g15h_dxy_lg_random'
    tag "aaa${loc}-bbb${loc}_LG${lg}"
    module "R3.5.2"

    input:
    set val( loc ), val( spec1 ), val( spec2 ), val( lg ), file( vcf ), file( geno ), val( kb ), val( random_set ), file( pop_file ) from random_assigned_ch

    output:
    set val( "aaa${loc}-bbb${loc}-${kb}0kb" ), file( "dxy.aaa${loc}-bbb${loc}.LG${lg}.${kb}0kb-${kb}kb.txt.gz" ), val( lg ), val( "aaa${loc}" ), val( "bbb${loc}" ), val( kb ) into dxy_random_lg_ch

    script:
    """
   module load openssl1.0.2
   module load intel17.0.4 intelmpi17.0.4

   mpirun \$NQSII_MPIOPTS -np 1 \
       python \$SFTWR/genomics_general/popgenWindows.py \
       -w ${kb}0000 -s ${kb}000 \
       --popsFile ${pop_file} \
       -p A -p B \
       -g ${geno} \
       -o dxy.aaa${loc}-bbb${loc}.LG${lg}.${kb}0kb-${kb}kb.txt.gz \
       -f phased \
       --writeFailedWindows \
       -T 1
    """
}

Again, we collect the output of the individual LGs….

// git 4.17
// collect all LGs of random run
dxy_random_lg_ch
.groupTuple()
.set{ tubbled_random_dxy }

… and we merge them

// git 4.18
// concatinate all LGs of random run
process receive_random_tuple {
    label 'L_20g2h_receive_random_tuple'
    publishDir "../../2_analysis/dxy/random/", mode: 'copy'

    input:
    set val( comp ), file( dxy ), val( lg ), val( pop1 ), val( pop2 ), val( kb )  from tubbled_random_dxy

    output:
    file( "dxy.${pop1[0]}-${pop2[0]}.${kb[0]}0kb-${kb[0]}kb.tsv.gz" ) into dxy_random_output_ch

    script:
    """
   zcat dxy.${pop1[0]}-${pop2[0]}.LG01.${kb[0]}0kb-${kb[0]}kb.txt.gz | \
   head -n 1 > dxy.${pop1[0]}-${pop2[0]}.${kb[0]}0kb-${kb[0]}kb.tsv;

   for j in {01..24};do
       echo "-> LG\$j"
       zcat dxy.${pop1[0]}-${pop2[0]}.LG\$j.${kb[0]}0kb-${kb[0]}kb.txt.gz | \
           awk 'NR>1{print}' >> dxy.${pop1[0]}-${pop2[0]}.${kb[0]}0kb-${kb[0]}kb.tsv;
   done

   gzip dxy.${pop1[0]}-${pop2[0]}.${kb[0]}0kb-${kb[0]}kb.tsv
   """
}

5.2.3 $\pi$

Estimating the diversity is quite straight forward: We take the prepared population identifier, the data set and the window resolutions and run VCFtools on each combination.

// ---------------------------------------------------------------
// The pi part need to be run AFTER the global fst outlier
// windows were selected (REMEMBER TO CHECK FST OUTLIER DIRECTORY)
// ---------------------------------------------------------------

// git 4.19
// calculate pi per species
process pi_per_spec {
    label 'L_32g15h_pi'
    tag "${spec}"
    publishDir "../../2_analysis/pi/${kb}0k", mode: 'copy'

    input:
    set val( spec ), vcfId, file( vcf ), val( kb ) from spec_dxy.combine( vcf_pi_ch ).combine( kb_ch3 )

    output:
   file( "*.${kb}0k.windowed.pi.gz" ) into pi_50k

    script:
    """
   module load openssl1.0.2

   vcfsamplenames ${vcf[0]} | \
       grep ${spec} > pop.txt

   vcftools --gzvcf ${vcf[0]} \
       --keep pop.txt \
       --window-pi ${kb}0000 \
       --window-pi-step ${kb}000 \
       --out ${spec}.${kb}0k 2> ${spec}.pi.log
   gzip ${spec}.${kb}0k.windowed.pi

   tail -n +2 \$BASE_DIR/2_analysis/summaries/fst_outliers_998.tsv | \
       cut -f 2,3,4 > outlier.bed

   vcftools --gzvcf ${vcf[0]} \
       --keep pop.txt \
       --exclude-bed outlier.bed \
       --window-pi ${kb}0000 \
       --window-pi-step ${kb}000\
       --out ${spec}_no_outlier.${kb}0k 2> ${spec}_${kb}0k_no_outllier.pi.log
   gzip ${spec}_no_outlier.${kb}0k.windowed.pi
   """
}

At this step we are done with divergence and diversity.

Script repository

5 (git 4) Analysis II (d_XY & pi)

5.1 Summary

5.2 Details of `analysis_dxy.nf`

5.2.1 Data preparation

5.2.2 d_XY

5.2.3 \(\pi\)

5 (git 4) Analysis II (dXY & pi)

5.1 Summary

5.2 Details of analysis_dxy.nf

5.2.1 Data preparation

5.2.2 dXY

5.2.3 \(\pi\)

5 (git 4) Analysis II (d_XY & pi)

5.2 Details of `analysis_dxy.nf`

5.2.2 d_XY