7 (git 6) Analysis IV (rho)

This pipeline can be executed as follows:

cd $BASE_DIR/nf/06_analysis_recombination
source ../../sh/nextflow_alias.sh
nf_run_recombination

7.1 Summary

The population recombination rate is estimated within the nextflow script analysis_recombination.nf (located under $BASE_DIR/nf/06_analysis_recombination/), which runs on the SNPs only data set. Below is an overview of the steps involved in the analysis. (The green dot indicates the genotype input, red arrows depict output that is exported for further use.)

7.2 Details of `analysis_recombination.nf`

7.2.1 Data preparation

The nextflow script starts by opening the genotype data.

#!/usr/bin/env nextflow
// This pipeline includes the recombination anlysis
// git 6.1
// load genotypes
Channel
    .fromFilePairs("../../1_genotyping/4_phased/phased_mac2.vcf.{gz,gz.tbi}")
    .set{ vcf_ch }

The estimation of the population recombination rate using FastEPRR happens in three steps. The first step is run independently for all linkage groups, so we set up a channel for the LGs.

// git 6.2
// initialize LGs
Channel
    .from( 1..24 )
    .map{ it.toString().padLeft(2, "0") }
    .set{ lg_ch }

To prepare the input, the genotypes are split by LG.

// git 6.3
// split genotypes by LG
process split_allBP {
    label 'L_20g2h_split_by_lg'
    tag "LG${lg}"

    input:
    set val( lg ), vcfId, file( vcf ) from lg_ch.combine( vcf_ch )

    output:
    set val( lg ), file( "phased_mac2.LG${lg}.vcf.gz" ) into vcf_by_lg_ch

    script:
    """
   module load openssl1.0.2

   vcftools --gzvcf ${vcf[0]} \
       --chr LG${lg} \
       --recode \
       --stdout | bgzip  > phased_mac2.LG${lg}.vcf.gz
   """
}

The prepared data is then fed to the first step of FastEPRR.

// git 6.4
// run fasteprr step 1
process fasteprr_s1 {
    label 'L_20g2h_fasteprr_s1'
    tag "LG${lg}"
    module "R3.5.2"

    input:
    set val( lg ), file( vcf ) from vcf_by_lg_ch

    output:
    file( "step1_LG${lg}" ) into step_1_out_ch

    script:
    """
   mkdir step1_LG${lg}
   Rscript --vanilla \$BASE_DIR/R/fasteprr_step1.R ./${vcf} step1_LG${lg} LG${lg} 50
   """
}

Since nextflow manages the results of its processes in a complex file structure, we need to collect all results of step 1 and bundle them before proceeding.

// git 6.5
// collect step 1 output
process fasteprr_s1_summary {
    label 'L_loc_fasteprr_s1_summmary'

    input:
    file( step1 ) from step_1_out_ch.collect()

    output:
    file( "step1" ) into ( step1_ch1, step1_ch2 )

    script:
    """
   mkdir step1
   cp step1_LG*/* step1/
   """
}

The second step of FastEPRR is parallelized over an arbitrary number of sub-processes. Here, we initialize 250 parallel processes and combine the parallelization index with the results from step 1.

// git 6.6
// initialize fasteperr subprocesses and attach them to step 1 output
Channel
    .from( 1..250 )
    .map{ it.toString().padLeft(3, "0") }
    .combine( step1_ch1 )
    .set{ step_2_run_ch }

Taking this prepared bundle, we now can start the second step of FastEPRR.

// git 6.7
// run fasteprr step 2
process fasteprr_s2 {
    label 'L_long_loc_fasteprr_s2'
    tag "run_${idx}"
    module "R3.5.2"

    input:
    set val( idx ), file( step1 ) from step_2_run_ch

    output:
    set val( idx ), file( "step2_run${idx}" ) into ( step_2_indxs, step_2_files )

    script:
    """
   mkdir -p step2_run${idx}
   Rscript --vanilla \$BASE_DIR/R/fasteprr_step2.R ${step1} step2_run${idx} ${idx}
   """
}

We collect both clones of the step 2 results and bundle the results in a single directory.

// git 6.8
// collect step 2 output
process fasteprr_s2_summary {
    label 'L_loc_fasteprr_s2_summmary'

    input:
    val( idx ) from step_2_indxs.map{ it[0] }.collect()
    file( files ) from step_2_files.map{ it[1] }.collect()

    output:
    file( "step2" ) into ( step2_ch )

    script:
    """
   mkdir step2

   for k in \$( echo ${idx} | sed 's/\\[//g; s/\\]//g; s/,//g'); do
       cp -r step2_run\$k/* step2/
   done
   """
}

Then we feed the bundled results into the third step of FastEPRR.

// git 6.9
// run fasteprr step 3
process fasteprr_s3 {
    label 'L_32g4h_fasteprr_s3'
    module "R3.5.2"

    input:
    set file( step1 ), file( step2 ) from step1_ch2.combine( step2_ch )

    output:
    file( "step3" ) into step_3_out_ch

    script:
    """
   mkdir step3
   Rscript --vanilla \$BASE_DIR/R/fasteprr_step3.R ${step1} ${step2} step3
   """
}

To ease the usage of the FastEPRR results downstream, we reformat them and compile a tidy table.

// git 6.10
// reformat overall fasteprr output
process fasteprr_s3_summary {
    label 'L_loc_fasteprr_s3_summmary'
    publishDir "../../2_analysis/fasteprr", mode: 'copy'

    input:
    file( step3 ) from step_3_out_ch

    output:
    file( "step4/fasteprr.all.rho.txt.gz" ) into ( step3_ch )

    script:
    """
   mkdir step4

   # ------ rewriting the fasteprr output into tidy format --------

   for k in  {01..24};do
       j="LG"\$k;
       echo \$j;
       \$BASE_DIR/sh/fasteprr_trans.sh step3/chr_\$j \$j step4/fasteprr.\$j
   done

   # --------- combining all LGs into a single data set -----------
   cd step4

   head -n 1 fasteprr.LG01.rho.txt > fasteprr.all.rho.txt

   for k in {01..24}; do
   echo "LG"\$k
   awk 'NR>1{print \$0}' fasteprr.LG\$k.rho.txt >> fasteprr.all.rho.txt
   done
   gzip fasteprr.all.rho.txt
   cd ..
   """
}

Finally, we are done with preparing the recombination rate for plotting.

7 (git 6) Analysis IV (rho)

7.1 Summary

7.2 Details of analysis_recombination.nf

7.2.1 Data preparation

7.2 Details of `analysis_recombination.nf`