6 (git 5) Analysis III (topology weighting)

This pipeline can be executed as follows:

cd $BASE_DIR/nf/05_analysis_fasttree_twisst
source ../../sh/nextflow_alias.sh
nf_run_phylo

6.1 Summary

A superseded version of the whole genome phylogeny and the topology weighting are prepared within the nextflow script analysis_fasttree_twisst.nf (located under $BASE_DIR/nf/05_analysis_fasttree_twisst/), which runs on the SNPs only data set. Below is an overview of the steps involved in the process. (The green dot indicates the genotype input, red arrows depict output that is exported for further use.)

6.2 Details of `analysis_fasttree_twisst.nf`

6.2.1 Setup

The nextflow script starts by opening the genotype data and feeding it into two different streams (one for the phylogeny and one for topology weighting).

#!/usr/bin/env nextflow
// git 5.1
// open genotype data
Channel
    .fromFilePairs("../../1_genotyping/4_phased/phased_mac2.vcf.{gz,gz.tbi}")
    .into{ vcf_fasttree_whg; vcf_locations }

6.2.2 Whole genome phylogeny

During data exploration, various subsets of the data set were investigated, including the different sampling locations, whole genome vs. non-outlier regions and all samples vs. hamlets only. (Now, most of the options have been muted by commenting these options out.)

Here, we set up a channel managing the subset by location.

// git 5.2
// setting the sampling location
// (the script is set up to run on diffferent subsets of samples)
Channel
    .from( "all" ) //, "bel", "hon", "pan" )
    .set{ locations4_ch }

This channel toggles the inclusion of $F_{ST}$ outlier regions.

// git 5.3
// setting loci restrictions
// (keep vs remove outlier regions)
Channel
    .from( "whg" ) //, "no_outl" )
    .set{ whg_modes }

This channel toggles the inclusion of the non-hamlet outgroup.

// git 5.4
// setting the sampling mode
// (the script is set up to run on diffferent subsets of samples)
Channel
    .from( "no_outgroups" ) //, "all" )
    .into{ sample_modes }

We combine the different selector channels to create all possible combinations of the settings.

// git 5.5
// compile the config settings and add data file
locations4_ch
    .combine( vcf_fasttree_whg )
    .combine( whg_modes )
    .combine( sample_modes )
    .set{ vcf_fasttree_whg_location_combo }

To prepare the input for the phylogeny, the fine tuned selection is applied to subset the genotypes.

// git 5.6
// apply sample filter, subset and convert genotypes
process subset_vcf_by_location_whg {
    label "L_28g5h_subset_vcf_whg"

    input:
    set val( loc ), vcfId, file( vcf ), val( mode ), val( sample_mode ) from vcf_fasttree_whg_location_combo

    output:
    set val( mode ), val( loc ), val( sample_mode ), file( "${loc}.${mode}.${sample_mode}.whg.geno.gz" ) into snp_geno_tree_whg

    script:
    """
   DROP_CHRS=" "

   # check if samples need to be dropped based on location
   if [ "${loc}" == "all" ];then
       vcfsamplenames ${vcf[0]} > prep.pop
   else
       vcfsamplenames ${vcf[0]} | \
           grep ${loc}  > prep.pop
   fi

   # check if outgroups need to be dropped
   if [ "${sample_mode}" == "all" ];then
       mv prep.pop ${loc}.pop
   else
       cat prep.pop | \
           grep -v tor | \
           grep -v tab > ${loc}.pop
   fi

   # check if diverged LGs need to be dropped
   if [ "${mode}" == "no_outl" ];then
       DROP_CHRS="--not-chr LG04 --not-chr LG07 --not-chr LG08 --not-chr LG09 --not-chr LG12 --not-chr LG17 --not-chr LG23"
   fi

   vcftools --gzvcf ${vcf[0]} \
       --keep ${loc}.pop \
       \$DROP_CHRS \
       --mac 3 \
       --recode \
       --stdout | gzip > ${loc}.${mode}.${sample_mode}.vcf.gz

   python \$SFTWR/genomics_general/VCF_processing/parseVCF.py \
       -i  ${loc}.${mode}.${sample_mode}.vcf.gz | gzip > ${loc}.${mode}.${sample_mode}.whg.geno.gz
   """
}

The subset is converted to fasta format, creating two whole genome pseudo haplotypes per sample.

// git 5.7
// convert genotypes to fasta
process fasttree_whg_prep {
    label 'L_190g4h_fasttree_whg_prep'
    tag "${mode} - ${loc} - ${sample_mode}"

    input:
    set val( mode ), val( loc ), val( sample_mode ), file( geno ) from snp_geno_tree_whg

    output:
    set val( mode ), val( loc ), val( sample_mode ), file( "all_samples.${loc}.${mode}.${sample_mode}.whg.SNP.fa" ) into ( fasttree_whg_prep_ch )

    script:
    """
   python \$SFTWR/genomics_general/genoToSeq.py -g ${geno} \
       -s  all_samples.${loc}.${mode}.${sample_mode}.whg.SNP.fa \
       -f fasta \
       --splitPhased
   """
}

Then, the phylogeny is reconstructed based on the converted geneotypes.

// git 5.8
// create phylogeny
process fasttree_whg_run {
    label 'L_300g30h_fasttree_run'
    tag "${mode} - ${loc} - ${sample_mode}"
    publishDir "../../2_analysis/fasttree/", mode: 'copy'

    input:
    set val( mode ), val( loc ), val( sample_mode ), file( fa ) from fasttree_whg_prep_ch

    output:
    file( "${sample_mode}.${loc}.${mode}.SNP.tree" ) into ( fasttree_whg_output )

    script:
    """
   fasttree -nt ${fa} > ${sample_mode}.${loc}.${mode}.SNP.tree
   """
}

6.2.3 Topology weighting

The complexity of topology weighting increases non-linearely with the number of included populations as the number of possible unrooted topologies skyrockets. The maximum number of possible populations allowed within twisst is eight, so we are running the topology weighting for Belize and Honduras independently (for the three species at Panama only one unrooted topology is possible, so we don’t run twisst here).

We start by setting up a channel for the sampling location.

// git 5.9
// initialize the locations for topology weighting
Channel
    .from( "bel", "hon" )
    .set{ locations_ch }

Then, we attach the genotypes to the location.

// git 5.10
locations_ch
    .combine( vcf_locations )
    .set{ vcf_location_combo }

The analysis is split by linkage group, so we need to initialize the LGs.

// git 5.11
// initialize LGs
Channel
    .from( ('01'..'09') + ('10'..'19') + ('20'..'24') )
    .map{ "LG" + it }
    .set{ lg_twisst }

The data is subset to include only the samples of the respective location.

// git 5.12
// subset the genotypes by location
process subset_vcf_by_location {
    label "L_20g2h_subset_vcf"

    input:
    set val( loc ), vcfId, file( vcf ) from vcf_location_combo

    output:
    set val( loc ), file( "${loc}.vcf.gz" ), file( "${loc}.pop" ) into ( vcf_loc_twisst )

    script:
    """
   vcfsamplenames ${vcf[0]} | \
       grep ${loc} | \
       grep -v tor | \
       grep -v tab > ${loc}.pop

   vcftools --gzvcf ${vcf[0]} \
       --keep ${loc}.pop \
       --mac 3 \
       --recode \
       --stdout | gzip > ${loc}.vcf.gz
   """
}

While running twisst, we ran into an issues regarding incompatibilities of the used scheduling system used on our computing cluster and the threading within the python scripts used in the preparation of twisst. Unfortunately, this prevented us running the preparation in place (as part of the nextflow script). Instead, we rand the preparation separately on a local computer and clumsily plugged the results into the nextflow process.

Below we include the originally intended workflow which we muted so that the script is runnable using the plug-in approach. Still, the the original workflow describes the steps executed locally and conveys the intermediate steps more clearly. Also, on a different computer cluster, the original script should work alright.

The next step is to attach the LGs to the genotype subsets.

// ---------------------------------------------------------------
// Unfortunately the twisst preparation did not work on the cluster
// ('in place'), so I had to setup the files locally and then plug
// them into this workflow.
// Below is the originally intended clean workflow (commented out),
// while the plugin version picks up at git 5.19.
// ---------------------------------------------------------------

/* MUTE:
// git 5.13
// add the lg channel to the genotype subset
vcf_loc_twisst
   .combine( lg_twisst )
   .set{ vcf_loc_lg_twisst }
*/

Based on the LGs the genotypes are split.

/* MUTE: python thread conflict - run locally and feed into ressources/plugin
// git 5.14
// subset genotypes by LG
process vcf2geno_loc {
   label 'L_20g15h_vcf2geno'

   input:
   set val( loc ), file( vcf ), file( pop ), val( lg ) from vcf_loc_lg_twisst

   output:
   set val( loc ), val( lg ), file( "${loc}.${lg}.geno.gz" ), file( pop ) into snp_geno_twisst

   script:
   """
   vcftools \
       --gzvcf ${vcf} \
       --chr ${lg} \
       --recode \
       --stdout |
       gzip > intermediate.vcf.gz

   python \$SFTWR/genomics_general/VCF_processing/parseVCF.py \
     -i intermediate.vcf.gz | gzip > ${loc}.${lg}.geno.gz
   """
}
*/
/* MUTE: python thread conflict - run locally and feed into ressources/plugin

We initialize the window size size (as SNPs) used for the topology weighting…

// git 5.15
// initialize SNP window size
Channel.from( 50, 200 ).set{ twisst_window_types }
*/

…and attach them to the genotypes.

/* MUTE: python thread conflict - run locally and feed into ressources/plugin
// git 5.16
// add the SNP window size to the genotype subset
snp_geno_twisst.combine( twisst_window_types ).set{ twisst_input_ch }
*/

To conduct topology weighting, we need some underlying phylogenies, so we run PhyML along the sliding window.

/* MUTE: python thread conflict - run locally and feed into ressources/plugin
// git 5.17
// create the phylogenies along the sliding window
process twisst_prep {
  label 'L_G120g40h_prep_twisst'
   publishDir "../../2_analysis/twisst/positions/${loc}/", mode: 'copy'

  input:
  set val( loc ), val( lg ), file( geno ), file( pop ), val( twisst_w ) from twisst_input_ch.filter { it[0] != 'pan' }

   output:
   set val( loc ), val( lg ), file( geno ), file( pop ), val( twisst_w ), file( "*.trees.gz" ), file( "*.data.tsv" ) into twisst_prep_ch

  script:
   """
   module load intel17.0.4 intelmpi17.0.4

   mpirun \$NQSII_MPIOPTS -np 1 \
   python \$SFTWR/genomics_general/phylo/phyml_sliding_windows.py \
      -g ${geno} \
      --windType sites \
      -w ${twisst_w} \
      --prefix ${loc}.${lg}.w${twisst_w}.phyml_bionj \
      --model HKY85 \
      --optimise n \
       --threads 1
    """
}
*/

The last step then is to run twisst on the prepared phylogenies.

/* MUTE: python thread conflict - run locally and feed into ressources/plugin
// git 5.18
// run the topology weighting on the  phylogenies
process twisst_run {
   label 'L_G120g40h_run_twisst'
   publishDir "../../2_analysis/twisst/weights/", mode: 'copy'

   input:
   set val( loc ), val( lg ), file( geno ), file( pop ), val( twisst_w ), file( tree ), file( data ) from twisst_prep_ch

   output:
   set val( loc ), val( lg ), val( twisst_w ), file( "*.weights.tsv.gz" ), file( "*.data.tsv" ) into ( twisst_output )

   script:
   """
   module load intel17.0.4 intelmpi17.0.4

   awk '{print \$1"\\t"\$1}' ${pop} | \
   sed 's/\\(...\\)\\(...\\)\$/\\t\\1\\t\\2/g' | \
   cut -f 1,3 | \
   awk '{print \$1"_A\\t"\$2"\\n"\$1"_B\\t"\$2}' > ${loc}.${lg}.twisst_pop.txt

   TWISST_POPS=\$( cut -f 2 ${loc}.${lg}.twisst_pop.txt | sort | uniq | paste -s -d',' | sed 's/,/ -g /g; s/^/-g /' )

   mpirun \$NQSII_MPIOPTS -np 1 \
   python \$SFTWR/twisst/twisst.py \
     --method complete \
     -t ${tree} \
     -T 1 \
     \$TWISST_POPS \
     --groupsFile ${loc}.${lg}.twisst_pop.txt | \
     gzip > ${loc}.${lg}.w${twisst_w}.phyml_bionj.weights.tsv.gz
   """
}
*/

Here, the plug in approach picks up. At this point we have run PhyML locally and deposited the results under $BASE_DIR/ressources/plugin/trees.

To restart, we need to emulate the settings needed for the twisst process.

// git 5.19
// emulate setting
Channel
    .from(50, 200)
    .combine( vcf_loc_twisst )
    .combine( lg_twisst )
    .set{ twisst_modes }

Then, we feed the phylogenies from an external directory into twisst.

// git 5.20
// run the topology weighting on the  phylogenies
process twisst_plugin {
    label 'L_G120g40h_twisst_plugin'
    publishDir "../../2_analysis/twisst/weights/", mode: 'copy'
    tag "${loc}-${lg}-${mode}"

    input:
    set val( mode ), val( loc ), file( vcf ), file( pop ), val( lg ) from twisst_modes

    output:
    set val( loc ), val( lg ), file( "*.weights.tsv.gz" ) into ( twisst_output )

    script:
    """
   module load intel17.0.4 intelmpi17.0.4

   awk '{print \$1"\\t"\$1}' ${pop} | \
   sed 's/\\(...\\)\\(...\\)\$/\\t\\1\\t\\2/g' | \
   cut -f 1,3 | \
   awk '{print \$1"_A\\t"\$2"\\n"\$1"_B\\t"\$2}' > ${loc}.${lg}.twisst_pop.txt

   TWISST_POPS=\$( cut -f 2 ${loc}.${lg}.twisst_pop.txt | sort | uniq | paste -s -d',' | sed 's/,/ -g /g; s/^/-g /' )

   mpirun \$NQSII_MPIOPTS -np 1 \
   python \$SFTWR/twisst/twisst.py \
     --method complete \
     -t \$BASE_DIR/ressources/plugin/trees/${loc}/${loc}.${lg}.w${mode}.phyml_bionj.trees.gz \
     \$TWISST_POPS \
     --groupsFile ${loc}.${lg}.twisst_pop.txt | \
     gzip > ${loc}.${lg}.w${mode}.phyml_bionj.weights.tsv.gz
   """
}

At this step we are done with phylogeny and the topology weighting.