11 (git 10) Analysis VIII (admixture)

This pipeline can be executed as follows:

cd $BASE_DIR/nf/10_analysis_admixture
source ../../sh/nextflow_alias.sh
nf_run_admixture

11.1 Summary

The degree of admixture is estimated within the nextflow script analysis_admixture.nf (located under $BASE_DIR/nf/10_analysis_admixture/), which runs on the SNPs only data set. Below is an overview of the steps involved in the analysis. (The green dot indicates the genotype input, red arrows depict output that is exported for further use.)

11.2 Details of analysis_admixture.nf

11.2.1 Data preparation

The nextflow script starts by opening the genotype data.

#!/usr/bin/env nextflow
// git 10.1
// open genotype data
Channel
    .fromFilePairs("../../1_genotyping/4_phased/phased_mac2.vcf.{gz,gz.tbi}")
    .set{ vcf_ch }

Next, we select the range k values that we we want to explore within the admixture analysis.

// git 10.2
// Set different k values for the admixture analysis
Channel
    .from( 2..15 )
    .set{ k_ch }

Then, we open the \(F_{ST}\) outlier regions within which we want to run the admixture analysis.

// git 10.3
// load Fst outlier regions
Channel
    .fromPath("../../ressources/plugin/poptrees/outlier.bed")
    .splitCsv(header:true, sep:"\t")
    .map{ row -> [ chrom:row.chrom, start:row.start, end:row.end, gid:row.gid ] }
    .combine( vcf_ch )
    .set{ vcf_admx }

Then, we subset the genotypes to each outlier regions of interest respectively and reformat them to the plink genotype format.

// git 10.4
// subset genotypes to the outlier region and reformat
process plink12 {
    label 'L_20g2h_plink12'
    tag "${grouping.gid}"

    input:
    set val( grouping ),  val( vcfidx ), file( vcf ) from vcf_admx

    output:
    set val( grouping ), file( "hapmap.*.ped" ), file( "hapmap.*.map" ), file( "hapmap.*.nosex" ), file( "pop.txt" ) into admx_plink

    script:
    """
   echo -e "CHROM\\tSTART\\tEND" > outl.bed
   echo -e "${grouping.chrom}\\t${grouping.start}\\t${grouping.end}" >> outl.bed

   vcfsamplenames ${vcf[0]} | \
       grep -v "tor\\|tab\\|flo" | \
       awk '{print \$1"\\t"\$1}' | \
       sed 's/\\t.*\\(...\\)\\(...\\)\$/\\t\\1\\t\\2/g' > pop.txt

   vcftools \
       --gzvcf ${vcf[0]} \
       --keep pop.txt \
       --bed outl.bed \
       --plink \
       --out admx_plink

   plink \
       --file admx_plink \
       --recode12 \
       --out hapmap.${grouping.gid}
   """
}

Then, we combine the reformatted genotypes with the k values…

// git 10.5
// combine genoutype subsets with k values
admx_prep  = k_ch.combine( admx_plink )

… and run admixture.

// git 10.6
// run admixture
process admixture_all {
    label 'L_20g4h_admixture_all'
    publishDir "../../2_analysis/admixture/", mode: 'copy'
    tag "${grouping.gid}.${k}"

    input:
    set  val( k ), val( grouping ), file( ped ), file( map ), file( nosex ), file( pop ) from admx_prep

    output:
    set val( "dummy" ), file( "*.out" ), file( "*.Q" ), file( "*.txt" ) into admx_log

    script:
    """
   mv ${pop} pop.${grouping.gid}.${k}.txt
   admixture --cv ${ped} ${k} | tee log.${grouping.gid}.${k}.out
   """
}