15 (git 14) Analysis XII (Outlier Region Phylogenies)
This pipeline can be executed as follows:
cd $BASE_DIR/nf/14_analysis_phylo_regions
nextflow run analysis_phylo_regions.nf
15.1 Summary
The phylogenies specific to particular differentiation outlier regions are reconstructed within the nextflow script analysis_phylo_regions.nf
(located under $BASE_DIR/nf/14_analysis_phylo_regions/
).
This includes both the sample-level as well as the population-level phylogenies.
15.2 Details of analysis_phylo_regions.nf
This part of the analysis was actually manged manually and not via
nextflow
. We still report the analysis as a.nf
script as we believe this is a cleaner and more concise report of the conducted analysis.
15.2.1 Setup
The nextflow script starts by opening the two specific linkage groups of the all_bp genotype data set and binding it to differentiation outlier IDs as well as to a reference table containing the genomic coordinates of all differentiation outlier regions.
#!/usr/bin/env nextflow
// ----------------------- DISCLAIMER ----------------------
// this pipeline was not actually run using nexflow,
// but managed manually
// ---------------------------------------------------------
// Region-specific phylogenies
// ---------------------------
// git 14.1
// bundle allBP files and outlier table
Channel
.fromFilePairs("../../1_genotyping/3_gatk_filtered/byLG/filterd.allBP.LG04.vcf.{gz,gz.tbi}")
.concat(Channel.fromFilePairs("../../1_genotyping/3_gatk_filtered/byLG/filterd.allBP.LG12.vcf.{gz,gz.tbi}"))
.concat(Channel.fromFilePairs("../../1_genotyping/3_gatk_filtered/byLG/filterd.allBP.LG12.vcf.{gz,gz.tbi}"))
.merge(Channel.from("LG04_1", "LG12_3", "LG12_4"))
.combine(Channel.fromPath("../../ressources/focal_outlier.tsv"))
.set{ vcf_lg_ch }
Then, two different sample lists (both excluding hybrid samples, one with and one without Serranus outgroup samples) are loaded and bound to a sample mode identifier.
// git 14.2
// toggle sample modes (with / without Serranus outgroup)
Channel.fromPath("../../ressources/samples_155.txt")
.concat(Channel.fromPath("../../ressources/samples_hybrids.txt"))
.merge(Channel.from("155", "hyS"))
.set{ sample_mode_ch }
Next, for each outlier region, the genotype data is subset to the respective outlier region.
// git 14.3
// subset genotypes to outlier region
process extract_regions {
input:
set val( vcfIdx ), file( vcf ), val( outlierId ), file( outlier_file ), file( sample_file ), val( sample_mode ) from vcf_lg_ch.combine( sample_mode_ch )
output:
set file( "*_${sample_mode}.vcf" ), val( outlierId ), val( sample_mode ) into ( vcf_raxml_ch, vcf_pomo_ch )
script:
"""
# Extract regions of interest from genotype data (allBP),
# remove hybrid / Serranus samples and indels; simplify headers
head -n 1 ${outlier_file} | cut -f 1-3 > outlier.bed
grep ${outlierId} ${outlier_file} | cut -f 1-3 >> outlier.bed
OUT_ALT=\$(echo ${outlierId} | tr '[:upper:]' '[:lower:]' | sed 's/_/./')
vcftools --gzvcf \
${vcf[0]} \
--bed outlier.bed \
--remove-indels \
--remove ${sample_file} \
--recode \
--stdout | \
grep -v '##' > \${OUT_ALT}_${sample_mode}.vcf
"""
}
Then, for the population-level phylogenies, the genotypes are first converted to fasta
format and then to a allele frequency format (.cf
).
At that point, iqtree2
is run to create the population-level phylogenies.
// git 14.4
// run iqtree under pomo model
process run_pomo {
publishDir "../../2_analysis/revPoMo/outlier_regions/", mode: 'copy'
input:
set file( vcf ), val( outlierId ), val( sample_mode ) from vcf_raxml_ch
output:
file( "*_pop.cf.treefile" ) into pomo_results_ch
script:
"""
OUT_ALT=\$(echo ${outlierId} | tr '[:upper:]' '[:lower:]' | sed 's/_/./')
# Convert to fasta format (Python scripts available at https://github.com/simonhmartin/genomics_general), picked up from 6.1.1 output
python \$SFTWR/genomics_general/VCF_processing/parseVCF.py -i ${vcf} > \${OUT_ALT}_${sample_mode}.geno
python \$SFTWR/genomics_general/genoToSeq.py \
-g \${OUT_ALT}_${sample_mode}.geno \
-s \${OUT_ALT}_${sample_mode}.fas \
-f fasta \
--splitPhased
# Reformat sample ids to provide population prefixes for cflib
sed -e 's/-/_/g' -e 's/>\(.*\)\([a-z]\{6\}\)_\([AB]\)/>\2-\1_\3/g' \${OUT_ALT}_${sample_mode}.fas > \${OUT_ALT}_${sample_mode}_p.fas
# Convert to allele frequency format (cflib library available at https://github.com/pomo-dev/cflib)
\$SFTWR/cflib/FastaToCounts.py \${OUT_ALT}_${sample_mode}_p.fas \${OUT_ALT}_${sample_mode}_pop.cf
# IQTREE analysis under PoMo model
iqtree2 \
-nt 16 \
-s \${OUT_ALT}_${sample_mode}_pop.cf \
-m HKY+F+P+N9+G4 \
-b 100
"""
}
For the sample-level phylogenies, the genotypes are also converted to fasta
format.
// git 14.5
// convert genotypes to fasta for raxml
process conversion_raxml {
input:
set file( vcf ), val( outlierId ), val( sample_mode ) from vcf_pomo_ch
output:
set val( outlierId ), val( sample_mode ), file( "*N.fas" ) into outlier_regions_ch
script:
"""
OUT_ALT=\$(echo ${outlierId} | tr '[:upper:]' '[:lower:]' | sed 's/_/./')
# Replace unknown character states and asterisks (deletions as encoded by GATK) with "N"
vcf-to-tab < ${vcf} | sed -e 's/\\.\\/\\./N\\/N/g' -e 's/[ACGTN\\*]\\/\\*/N\\/N/g' > \${OUT_ALT}_${sample_mode}N.tab
# Convert to fasta format (Perl script available at https://github.com/JinfengChen/vcf-tab-to-fasta)
wget https://raw.githubusercontent.com/JinfengChen/vcf-tab-to-fasta/master/vcf_tab_to_fasta_alignment.pl
perl ~/apps/vcf-tab-to-fasta/vcf_tab_to_fasta_alignment.pl -i \${OUT_ALT}_${sample_mode}N.tab > \${OUT_ALT}_${sample_mode}N.fas
"""
}
Then, raxml
is run directly on the fasta
files.
// git 14.6
// run raxml
process run_raxml {
publishDir "../../2_analysis/raxml/", mode: 'copy'
input:
set val( outlierId ), val( sample_mode ), file( fas ) from outlier_regions_ch
output:
file( "*.raxml.support" ) into outlier_results_ch
script:
"""
OUT_ALT=\$(echo ${outlierId} | tr '[:upper:]' '[:lower:]' | sed 's/_/./')
# Reconstruct phylogenies
raxml-NG --all \
--msa ${fas} \
--model GTR+G \
--tree pars{10},rand{10} \
--bs-trees 100 \
--threads 24 \
--worker 8 \
--seed 123 \
--prefix \${OUT_ALT}_${sample_mode}N
"""
}