6 (git 5) Analysis III (topology weighting)
This pipeline can be executed as follows:
cd $BASE_DIR/nf/05_analysis_fasttree_twisst
source ../../sh/nextflow_alias.sh
nf_run_phylo
6.1 Summary
A superseded version of the whole genome phylogeny and the topology weighting are prepared within the nextflow script analysis_fasttree_twisst.nf
(located under $BASE_DIR/nf/05_analysis_fasttree_twisst/
), which runs on the SNPs only data set.
Below is an overview of the steps involved in the process.
(The green dot indicates the genotype input, red arrows depict output that is exported for further use.)
6.2 Details of analysis_fasttree_twisst.nf
6.2.1 Setup
The nextflow script starts by opening the genotype data and feeding it into two different streams (one for the phylogeny and one for topology weighting).
#!/usr/bin/env nextflow
// git 5.1
// open genotype data
Channel
.fromFilePairs("../../1_genotyping/4_phased/phased_mac2.vcf.{gz,gz.tbi}")
.into{ vcf_fasttree_whg; vcf_locations }
6.2.2 Whole genome phylogeny
During data exploration, various subsets of the data set were investigated, including the different sampling locations, whole genome vs. non-outlier regions and all samples vs. hamlets only. (Now, most of the options have been muted by commenting these options out.)
Here, we set up a channel managing the subset by location.
// git 5.2
// setting the sampling location
// (the script is set up to run on diffferent subsets of samples)
Channel
.from( "all" ) //, "bel", "hon", "pan" )
.set{ locations4_ch }
This channel toggles the inclusion of \(F_{ST}\) outlier regions.
// git 5.3
// setting loci restrictions
// (keep vs remove outlier regions)
Channel
.from( "whg" ) //, "no_outl" )
.set{ whg_modes }
This channel toggles the inclusion of the non-hamlet outgroup.
// git 5.4
// setting the sampling mode
// (the script is set up to run on diffferent subsets of samples)
Channel
.from( "no_outgroups" ) //, "all" )
.into{ sample_modes }
We combine the different selector channels to create all possible combinations of the settings.
// git 5.5
// compile the config settings and add data file
locations4_ch
.combine( vcf_fasttree_whg )
.combine( whg_modes )
.combine( sample_modes )
.set{ vcf_fasttree_whg_location_combo }
To prepare the input for the phylogeny, the fine tuned selection is applied to subset the genotypes.
// git 5.6
// apply sample filter, subset and convert genotypes
process subset_vcf_by_location_whg {
label "L_28g5h_subset_vcf_whg"
input:
set val( loc ), vcfId, file( vcf ), val( mode ), val( sample_mode ) from vcf_fasttree_whg_location_combo
output:
set val( mode ), val( loc ), val( sample_mode ), file( "${loc}.${mode}.${sample_mode}.whg.geno.gz" ) into snp_geno_tree_whg
script:
"""
DROP_CHRS=" "
# check if samples need to be dropped based on location
if [ "${loc}" == "all" ];then
vcfsamplenames ${vcf[0]} > prep.pop
else
vcfsamplenames ${vcf[0]} | \
grep ${loc} > prep.pop
fi
# check if outgroups need to be dropped
if [ "${sample_mode}" == "all" ];then
mv prep.pop ${loc}.pop
else
cat prep.pop | \
grep -v tor | \
grep -v tab > ${loc}.pop
fi
# check if diverged LGs need to be dropped
if [ "${mode}" == "no_outl" ];then
DROP_CHRS="--not-chr LG04 --not-chr LG07 --not-chr LG08 --not-chr LG09 --not-chr LG12 --not-chr LG17 --not-chr LG23"
fi
vcftools --gzvcf ${vcf[0]} \
--keep ${loc}.pop \
\$DROP_CHRS \
--mac 3 \
--recode \
--stdout | gzip > ${loc}.${mode}.${sample_mode}.vcf.gz
python \$SFTWR/genomics_general/VCF_processing/parseVCF.py \
-i ${loc}.${mode}.${sample_mode}.vcf.gz | gzip > ${loc}.${mode}.${sample_mode}.whg.geno.gz
"""
}
The subset is converted to fasta
format, creating two whole genome pseudo haplotypes per sample.
// git 5.7
// convert genotypes to fasta
process fasttree_whg_prep {
label 'L_190g4h_fasttree_whg_prep'
tag "${mode} - ${loc} - ${sample_mode}"
input:
set val( mode ), val( loc ), val( sample_mode ), file( geno ) from snp_geno_tree_whg
output:
set val( mode ), val( loc ), val( sample_mode ), file( "all_samples.${loc}.${mode}.${sample_mode}.whg.SNP.fa" ) into ( fasttree_whg_prep_ch )
script:
"""
python \$SFTWR/genomics_general/genoToSeq.py -g ${geno} \
-s all_samples.${loc}.${mode}.${sample_mode}.whg.SNP.fa \
-f fasta \
--splitPhased
"""
}
Then, the phylogeny is reconstructed based on the converted geneotypes.
// git 5.8
// create phylogeny
process fasttree_whg_run {
label 'L_300g30h_fasttree_run'
tag "${mode} - ${loc} - ${sample_mode}"
publishDir "../../2_analysis/fasttree/", mode: 'copy'
input:
set val( mode ), val( loc ), val( sample_mode ), file( fa ) from fasttree_whg_prep_ch
output:
file( "${sample_mode}.${loc}.${mode}.SNP.tree" ) into ( fasttree_whg_output )
script:
"""
fasttree -nt ${fa} > ${sample_mode}.${loc}.${mode}.SNP.tree
"""
}
6.2.3 Topology weighting
The complexity of topology weighting increases non-linearely with the number of included populations as the number of possible unrooted topologies skyrockets.
The maximum number of possible populations allowed within twisst
is eight, so we are running the topology weighting for Belize and Honduras independently (for the three species at Panama only one unrooted topology is possible, so we don’t run twisst
here).
We start by setting up a channel for the sampling location.
// git 5.9
// initialize the locations for topology weighting
Channel
.from( "bel", "hon" )
.set{ locations_ch }
Then, we attach the genotypes to the location.
// git 5.10
locations_ch
.combine( vcf_locations )
.set{ vcf_location_combo }
The analysis is split by linkage group, so we need to initialize the LGs.
// git 5.11
// initialize LGs
Channel
.from( ('01'..'09') + ('10'..'19') + ('20'..'24') )
.map{ "LG" + it }
.set{ lg_twisst }
The data is subset to include only the samples of the respective location.
// git 5.12
// subset the genotypes by location
process subset_vcf_by_location {
label "L_20g2h_subset_vcf"
input:
set val( loc ), vcfId, file( vcf ) from vcf_location_combo
output:
set val( loc ), file( "${loc}.vcf.gz" ), file( "${loc}.pop" ) into ( vcf_loc_twisst )
script:
"""
vcfsamplenames ${vcf[0]} | \
grep ${loc} | \
grep -v tor | \
grep -v tab > ${loc}.pop
vcftools --gzvcf ${vcf[0]} \
--keep ${loc}.pop \
--mac 3 \
--recode \
--stdout | gzip > ${loc}.vcf.gz
"""
}
While running
twisst
, we ran into an issues regarding incompatibilities of the used scheduling system used on our computing cluster and the threading within the python scripts used in the preparation oftwisst
. Unfortunately, this prevented us running the preparation in place (as part of thenextflow
script). Instead, we rand the preparation separately on a local computer and clumsily plugged the results into thenextflow
process.
Below we include the originally intended workflow which we muted so that the script is runnable using the plug-in approach. Still, the the original workflow describes the steps executed locally and conveys the intermediate steps more clearly. Also, on a different computer cluster, the original script should work alright.
The next step is to attach the LGs to the genotype subsets.
// ---------------------------------------------------------------
// Unfortunately the twisst preparation did not work on the cluster
// ('in place'), so I had to setup the files locally and then plug
// them into this workflow.
// Below is the originally intended clean workflow (commented out),
// while the plugin version picks up at git 5.19.
// ---------------------------------------------------------------
/* MUTE:
// git 5.13
// add the lg channel to the genotype subset
vcf_loc_twisst
.combine( lg_twisst )
.set{ vcf_loc_lg_twisst }
*/
Based on the LGs the genotypes are split.
/* MUTE: python thread conflict - run locally and feed into ressources/plugin
// git 5.14
// subset genotypes by LG
process vcf2geno_loc {
label 'L_20g15h_vcf2geno'
input:
set val( loc ), file( vcf ), file( pop ), val( lg ) from vcf_loc_lg_twisst
output:
set val( loc ), val( lg ), file( "${loc}.${lg}.geno.gz" ), file( pop ) into snp_geno_twisst
script:
"""
vcftools \
--gzvcf ${vcf} \
--chr ${lg} \
--recode \
--stdout |
gzip > intermediate.vcf.gz
python \$SFTWR/genomics_general/VCF_processing/parseVCF.py \
-i intermediate.vcf.gz | gzip > ${loc}.${lg}.geno.gz
"""
}
*/
/* MUTE: python thread conflict - run locally and feed into ressources/plugin
We initialize the window size size (as SNPs) used for the topology weighting…
// git 5.15
// initialize SNP window size
Channel.from( 50, 200 ).set{ twisst_window_types }
*/
…and attach them to the genotypes.
/* MUTE: python thread conflict - run locally and feed into ressources/plugin
// git 5.16
// add the SNP window size to the genotype subset
snp_geno_twisst.combine( twisst_window_types ).set{ twisst_input_ch }
*/
To conduct topology weighting, we need some underlying phylogenies, so we run PhyML
along the sliding window.
/* MUTE: python thread conflict - run locally and feed into ressources/plugin
// git 5.17
// create the phylogenies along the sliding window
process twisst_prep {
label 'L_G120g40h_prep_twisst'
publishDir "../../2_analysis/twisst/positions/${loc}/", mode: 'copy'
input:
set val( loc ), val( lg ), file( geno ), file( pop ), val( twisst_w ) from twisst_input_ch.filter { it[0] != 'pan' }
output:
set val( loc ), val( lg ), file( geno ), file( pop ), val( twisst_w ), file( "*.trees.gz" ), file( "*.data.tsv" ) into twisst_prep_ch
script:
"""
module load intel17.0.4 intelmpi17.0.4
mpirun \$NQSII_MPIOPTS -np 1 \
python \$SFTWR/genomics_general/phylo/phyml_sliding_windows.py \
-g ${geno} \
--windType sites \
-w ${twisst_w} \
--prefix ${loc}.${lg}.w${twisst_w}.phyml_bionj \
--model HKY85 \
--optimise n \
--threads 1
"""
}
*/
The last step then is to run twisst
on the prepared phylogenies.
/* MUTE: python thread conflict - run locally and feed into ressources/plugin
// git 5.18
// run the topology weighting on the phylogenies
process twisst_run {
label 'L_G120g40h_run_twisst'
publishDir "../../2_analysis/twisst/weights/", mode: 'copy'
input:
set val( loc ), val( lg ), file( geno ), file( pop ), val( twisst_w ), file( tree ), file( data ) from twisst_prep_ch
output:
set val( loc ), val( lg ), val( twisst_w ), file( "*.weights.tsv.gz" ), file( "*.data.tsv" ) into ( twisst_output )
script:
"""
module load intel17.0.4 intelmpi17.0.4
awk '{print \$1"\\t"\$1}' ${pop} | \
sed 's/\\(...\\)\\(...\\)\$/\\t\\1\\t\\2/g' | \
cut -f 1,3 | \
awk '{print \$1"_A\\t"\$2"\\n"\$1"_B\\t"\$2}' > ${loc}.${lg}.twisst_pop.txt
TWISST_POPS=\$( cut -f 2 ${loc}.${lg}.twisst_pop.txt | sort | uniq | paste -s -d',' | sed 's/,/ -g /g; s/^/-g /' )
mpirun \$NQSII_MPIOPTS -np 1 \
python \$SFTWR/twisst/twisst.py \
--method complete \
-t ${tree} \
-T 1 \
\$TWISST_POPS \
--groupsFile ${loc}.${lg}.twisst_pop.txt | \
gzip > ${loc}.${lg}.w${twisst_w}.phyml_bionj.weights.tsv.gz
"""
}
*/
Here, the plug in approach picks up. At this point we have run
PhyML
locally and deposited the results under$BASE_DIR/ressources/plugin/trees
.
To restart, we need to emulate the settings needed for the twisst
process.
// git 5.19
// emulate setting
Channel
.from(50, 200)
.combine( vcf_loc_twisst )
.combine( lg_twisst )
.set{ twisst_modes }
Then, we feed the phylogenies from an external directory into twisst
.
// git 5.20
// run the topology weighting on the phylogenies
process twisst_plugin {
label 'L_G120g40h_twisst_plugin'
publishDir "../../2_analysis/twisst/weights/", mode: 'copy'
tag "${loc}-${lg}-${mode}"
input:
set val( mode ), val( loc ), file( vcf ), file( pop ), val( lg ) from twisst_modes
output:
set val( loc ), val( lg ), file( "*.weights.tsv.gz" ) into ( twisst_output )
script:
"""
module load intel17.0.4 intelmpi17.0.4
awk '{print \$1"\\t"\$1}' ${pop} | \
sed 's/\\(...\\)\\(...\\)\$/\\t\\1\\t\\2/g' | \
cut -f 1,3 | \
awk '{print \$1"_A\\t"\$2"\\n"\$1"_B\\t"\$2}' > ${loc}.${lg}.twisst_pop.txt
TWISST_POPS=\$( cut -f 2 ${loc}.${lg}.twisst_pop.txt | sort | uniq | paste -s -d',' | sed 's/,/ -g /g; s/^/-g /' )
mpirun \$NQSII_MPIOPTS -np 1 \
python \$SFTWR/twisst/twisst.py \
--method complete \
-t \$BASE_DIR/ressources/plugin/trees/${loc}/${loc}.${lg}.w${mode}.phyml_bionj.trees.gz \
\$TWISST_POPS \
--groupsFile ${loc}.${lg}.twisst_pop.txt | \
gzip > ${loc}.${lg}.w${mode}.phyml_bionj.weights.tsv.gz
"""
}
At this step we are done with phylogeny and the topology weighting.