7 (git 6) Analysis IV (rho)
This pipeline can be executed as follows:
cd $BASE_DIR/nf/06_analysis_recombination
source ../../sh/nextflow_alias.sh
nf_run_recombination
7.1 Summary
The population recombination rate is estimated within the nextflow script analysis_recombination.nf
(located under $BASE_DIR/nf/06_analysis_recombination/
), which runs on the SNPs only data set.
Below is an overview of the steps involved in the analysis.
(The green dot indicates the genotype input, red arrows depict output that is exported for further use.)
7.2 Details of analysis_recombination.nf
7.2.1 Data preparation
The nextflow script starts by opening the genotype data.
#!/usr/bin/env nextflow
// This pipeline includes the recombination anlysis
// git 6.1
// load genotypes
Channel
.fromFilePairs("../../1_genotyping/4_phased/phased_mac2.vcf.{gz,gz.tbi}")
.set{ vcf_ch }
The estimation of the population recombination rate using FastEPRR
happens in three steps.
The first step is run independently for all linkage groups, so we set up a channel for the LGs.
// git 6.2
// initialize LGs
Channel
.from( 1..24 )
.map{ it.toString().padLeft(2, "0") }
.set{ lg_ch }
To prepare the input, the genotypes are split by LG.
// git 6.3
// split genotypes by LG
process split_allBP {
label 'L_20g2h_split_by_lg'
tag "LG${lg}"
input:
set val( lg ), vcfId, file( vcf ) from lg_ch.combine( vcf_ch )
output:
set val( lg ), file( "phased_mac2.LG${lg}.vcf.gz" ) into vcf_by_lg_ch
script:
"""
module load openssl1.0.2
vcftools --gzvcf ${vcf[0]} \
--chr LG${lg} \
--recode \
--stdout | bgzip > phased_mac2.LG${lg}.vcf.gz
"""
}
The prepared data is then fed to the first step of FastEPRR
.
// git 6.4
// run fasteprr step 1
process fasteprr_s1 {
label 'L_20g2h_fasteprr_s1'
tag "LG${lg}"
module "R3.5.2"
input:
set val( lg ), file( vcf ) from vcf_by_lg_ch
output:
file( "step1_LG${lg}" ) into step_1_out_ch
script:
"""
mkdir step1_LG${lg}
Rscript --vanilla \$BASE_DIR/R/fasteprr_step1.R ./${vcf} step1_LG${lg} LG${lg} 50
"""
}
Since nextflow
manages the results of its processes in a complex file structure, we need to collect all results of step 1 and bundle them before proceeding.
// git 6.5
// collect step 1 output
process fasteprr_s1_summary {
label 'L_loc_fasteprr_s1_summmary'
input:
file( step1 ) from step_1_out_ch.collect()
output:
file( "step1" ) into ( step1_ch1, step1_ch2 )
script:
"""
mkdir step1
cp step1_LG*/* step1/
"""
}
The second step of FastEPRR
is parallelized over an arbitrary number of sub-processes.
Here, we initialize 250 parallel processes and combine the parallelization index with the results from step 1.
// git 6.6
// initialize fasteperr subprocesses and attach them to step 1 output
Channel
.from( 1..250 )
.map{ it.toString().padLeft(3, "0") }
.combine( step1_ch1 )
.set{ step_2_run_ch }
Taking this prepared bundle, we now can start the second step of FastEPRR
.
// git 6.7
// run fasteprr step 2
process fasteprr_s2 {
label 'L_long_loc_fasteprr_s2'
tag "run_${idx}"
module "R3.5.2"
input:
set val( idx ), file( step1 ) from step_2_run_ch
output:
set val( idx ), file( "step2_run${idx}" ) into ( step_2_indxs, step_2_files )
script:
"""
mkdir -p step2_run${idx}
Rscript --vanilla \$BASE_DIR/R/fasteprr_step2.R ${step1} step2_run${idx} ${idx}
"""
}
We collect both clones of the step 2 results and bundle the results in a single directory.
// git 6.8
// collect step 2 output
process fasteprr_s2_summary {
label 'L_loc_fasteprr_s2_summmary'
input:
val( idx ) from step_2_indxs.map{ it[0] }.collect()
file( files ) from step_2_files.map{ it[1] }.collect()
output:
file( "step2" ) into ( step2_ch )
script:
"""
mkdir step2
for k in \$( echo ${idx} | sed 's/\\[//g; s/\\]//g; s/,//g'); do
cp -r step2_run\$k/* step2/
done
"""
}
Then we feed the bundled results into the third step of FastEPRR
.
// git 6.9
// run fasteprr step 3
process fasteprr_s3 {
label 'L_32g4h_fasteprr_s3'
module "R3.5.2"
input:
set file( step1 ), file( step2 ) from step1_ch2.combine( step2_ch )
output:
file( "step3" ) into step_3_out_ch
script:
"""
mkdir step3
Rscript --vanilla \$BASE_DIR/R/fasteprr_step3.R ${step1} ${step2} step3
"""
}
To ease the usage of the FastEPRR
results downstream, we reformat them and compile a tidy table.
// git 6.10
// reformat overall fasteprr output
process fasteprr_s3_summary {
label 'L_loc_fasteprr_s3_summmary'
publishDir "../../2_analysis/fasteprr", mode: 'copy'
input:
file( step3 ) from step_3_out_ch
output:
file( "step4/fasteprr.all.rho.txt.gz" ) into ( step3_ch )
script:
"""
mkdir step4
# ------ rewriting the fasteprr output into tidy format --------
for k in {01..24};do
j="LG"\$k;
echo \$j;
\$BASE_DIR/sh/fasteprr_trans.sh step3/chr_\$j \$j step4/fasteprr.\$j
done
# --------- combining all LGs into a single data set -----------
cd step4
head -n 1 fasteprr.LG01.rho.txt > fasteprr.all.rho.txt
for k in {01..24}; do
echo "LG"\$k
awk 'NR>1{print \$0}' fasteprr.LG\$k.rho.txt >> fasteprr.all.rho.txt
done
gzip fasteprr.all.rho.txt
cd ..
"""
}
Finally, we are done with preparing the recombination rate for plotting.