2024 GBStutorial flowchart

1. Open a UNIX shell

  1. Connect with PuTTY to the server slurm-24.

  2. Obtain a Kerberos ticket to access our filer resources.

    kinit
    klist (1)
    1 A ticket is valid for seven days.
  3. You will see the prompt of the UNIX command-line interpreter, the shell. It is called "shell" because it executes commands that do not belong to the operating system, the "kernel".

  4. Check out your home directory.

    pwd
    ls

2. Set up your analysis directory

  1. Let’s have a look at the GBS data.

    cd /filer/projekte/gbs_course
    cd data
    ls
    ls -lhS gbs_reads | less (1)
    man ls
    cd gbs_reads
    ls | wc -l
    zcat HOR_337.fastq.gz | less -S (2)
    cd ..
    du -ch gbs_reads
    1 To close less, press q.
    2 The reads are in FASTQ format.
  2. Have a look at the passport data.

    less core200_passport_data.tsv
    column -t core200_passport_data.tsv | less -S
  3. Use WinSCP to copy the XLSX file with the passport data to your local computer.

  4. Create your working directory and change into it.

    cd /filer/projekte/gbs_course
    mkdir -p mascher/gbs (1)
    cd mascher/gbs
    1 Change mascher to your name.
  5. Create symbolic links to the GBS reads and the phenotype data.

    pwd (1)
    
    mkdir gbs_reads
    datadir="/filer/projekte/gbs_course/data/"
    ln -s $datadir/gbs_reads/*fastq.gz gbs_reads
    ls -l gbs_reads | less -S
    ls gbs_reads | wc -l
    1 Check that you are in your working directory (/filer/projekte/gbs_course/USER/gbs).
  6. Count the reads in one FASTQ file.

    zcat gbs_reads/HOR_337.fastq.gz | wc -l > HOR_337.len.txt
    cat HOR_337.len.txt
  7. We use a for-loop to get the read counts of all samples in one go.

    for i in gbs_reads/*fastq.gz; do
     name=`basename $i | cut -d . -f 1`
     count=`zcat $i | wc -l`
     echo $name $count
    done > fastq_line_counts.txt
  8. Add column with read count (as opposed to the line count).

    cat fastq_line_counts.txt | awk '{print $1,$2/4}' | sort -nk 2 > fastq_read_counts.txt (1)
    
    1 We use AWK to divide the second column by 4.

3. Read mapping

  1. We will use the tools BWA and samtools for read mapping and processing of alignment records, respectively.

    module load bwa samtools
    module list
  2. We will first run a toy example with 10,000 reads of the HOR 337 sample.

    zcat gbs_reads/HOR_337.fastq.gz | head -n 40000 > HOR_337_10k.fastq
    
    ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta' (1)
    bwa mem $ref HOR_337_10k.fastq > HOR_337_10k.sam (2)
    samtools sort HOR_337_10k.sam > HOR_337_10k.bam (2)
    samtools view HOR_337_10k.bam | less -S
    1 We use the MorexV3 reference genome sequence.
    2 Alignments are stored in the Sequence Alignment Map format (SAM) or its binary equivalent BAM.
  3. Combine mapping and sorting into one command.

    bwa mem $ref HOR_337_10k.fastq | samtools sort > HOR_337_10k.bam
  4. Open a new PuTTY session. Start tmux to keep your commands running in the background.

    k5reauth tmux (1)
    1 k5reauth is used to prolong the Kerberos ticket for as long as the tmux session is running.
  5. Try out tmux functionality.

    tmux ls
    tmux rename mapping
    tmux ls
    tmux detach
    tmux ls
    tmux attach -t mapping
  6. Map all samples.

    ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta'
    for i in gbs_reads/*fastq.gz; do
     name=`echo $i | cut -d . -f 1` (1)
     bwa mem -t 12 $ref $i | samtools sort > $name.bam
    done 2> bwa.err (2)
    1 Strip the extension: HOR_337.fastq.gz become HOR_337.
    2 To detach the tmux session, press Ctrl-B followed by D.
    If you forget to start bwa inside a tmux session, there is no way to prevent your job from aborting when you shutdown your laptop. Also without k5reauth programs cannot access filer resource after a maximum of ten hours.
  7. Open a new terminal. Look at your jobs in the table of processes (top).

    top -u mascher (1)
    1 Replace mascher with your username.
  8. Once the mapping is finished, count the number of BAM files.

    ls gbs_reads/*bam | wc -l
  9. If you have fewer than 200 files and the mapping didn’t work for some reason, use the pre-computed files.

    ln -fst gbs_reads /filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/data/bam_files/*bam
  10. Count the number of mapped reads with a minimum quality score of 20 for the HOR 337 sample and compare the result to the input read count of that sample.

    samtools view -cq 20 -cq20 gbs_reads/HOR_337.bam
    grep HOR_337 fastq_read_counts.txt
  11. Get mapped read counts for all samples.

    for i in gbs_reads/*bam; do
     name=`basename $i | cut -d . -f 1`
     count=`samtools view -cq 20 $i`
     echo $name $count
    done > mapped_reads.txt
  12. Combine the raw read counts and the mapping rates into one table.

    LC_ALL=C sort fastq_read_counts.txt > tmp1 (1)
    LC_ALL=C sort mapped_reads.txt > tmp2 (1)
    
    join tmp1 tmp2 | awk '{print $0,$3/$2*100}' | sort -nk 4 | column -t > mapping_stats.txt (2)
    
    rm -f tmp1 tmp2
    
    column -t mapping_stats.txt  | less -S
    1 To combine two lists with join, both lists need to be sorted on the common ID column.
    2 column is used to align columns.

4. Variant calling and filtration

  1. Open a new tmux session and load bcftools.

    tmux (1)
    tmux rename variant_call
    module load bcftools
    1 The variant calling will run for some time, so run it inside tmux.
  2. Get a list of all BAM files.

    ls gbs_reads/*bam | sort > bam_list.txt
  3. Run the variant calling.

    ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta'
    bamlist='/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/try_231222/bam_list.txt' (1)
    vcf='bcftools_SNP_calling.vcf' (2)
    
    bcftools mpileup --bam-list $bamlist --skip-indels --fasta-ref $ref --min-MQ 20 --annotate AD | bcftools call --consensus-caller --variants-only --output $vcf (3)
    1 List of pre-computed BAM files.
    2 Output file in variant call format (VCF). Here are the specifications of the VCF format.
    3 We ignore insertions and deletions (--skip-indels), consider only SNPs with a quality score no smaller than 20 (--min-MQ 20) and add allelic depth information (--annotate AD) for all genotype calls.
  4. Filter the variant calls.

    filter='/filer/projekte/gbs_course/scripts/filter_vcf.zsh'
    vcf='/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/try_231222/bcftools_SNP_calling.vcf' (1)
    fvcf='bcftools_SNP_calling_filtered.vcf'
    
    $filter --vcf $vcf --dphom 2 --dphet 4 --minmaf 0.2 --minpresent 0.9 --minhomp 0.9 > $fvcf (2)
    1 Path to pre-computed VCF file.
    2 We keep homozygous genotype calls if they have at least two supporting reads; heterozygous calls are accepted if they are supported by no fewer than four reads. SNPs with a minor allele frequency below 20 % or less than 90 % present calls or less than 90 % homozygous calls are discarded.
  5. Review the VCF file.

    grep -v '^##' bcftools_SNP_calling_filtered.vcf | column -t | less -S

5. Principal component analysis

  1. Open R.

    module load R/3.5.1
    R
  2. R is a widely used programming language in data science. There are many tutorials, e.g. this one.

  3. Load the required libraries.

    .libPaths(c("/filer-dg/agruppen/seq_shared/mascher/Rlib/3.5.1", "/opt/Bio/R_LIBS/3.5.1")) (1)
    
    library(data.table) (2)
    library(SeqArray) (3)
    library(SNPRelate) (3)
    library(countrycode) (4)
    
    1 Set the paths where the R libraries are located.
    2 data.table extends R core functionality when handling large tables.
    3 seqArray and SNPRelate are two R packages to store and analyze SNP matrices.
    4 The countrycode package will be used to make country abbreviations to geographic regions.
  4. Convert the VCF file to the binary GDS (Genomic Data Structure) format used by seqArray and SNPRelate.

    seqVCF2GDS(vcf.fn='/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/try_231222/bcftools_SNP_calling_filtered.vcf', out.fn='bcftools_SNP_calling_filtered.gds') (1)
    
    1 This creates the GDS file in the current working directory.
  5. Open the GDS file and get summary statistics.

    seqOpen('bcftools_SNP_calling_filtered.gds') -> gds
    seqSummary(gds)
  6. Run a principal components analysis (PCA) on the data and extract the eigenvectors.

    snpgdsPCA(gds, autosome.only=F) -> pca (1)
    
    data.table(pca$sample.id, pca$eigenvect[, 1:2]) -> ev
    setnames(ev, c("accession", "PC1", "PC2")) (2)
    ev[, accession := sub(".bam$", "", sub(".*/", "", accession))] (3)
    
    1 autosome.only=F is needed because chromsomes are named chr1H, chr2H …​ instead of 1, 2, …​
    2 Set proper column names.
    3 Change the sample names inherited from the VCF file (BAM file names).
  7. Read the passport data for the core200 panel and merge them with the PCA results.

    data.table(read.xlsx("/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/data/core200_passport_data.xlsx")) -> pp
    pp[ev, on="accession"] -> ev
  8. Plot the first two principal components (PCs) with samples coloured by row type.

    ev[, col := "gray"] (1)
    ev[row_type == "6-rowed", col := "black"] (2)
    ev[row_type == "2-rowed", col := "red"]
    
    pdf("PCA1.pdf") (3)
    ev[, plot(PC1, PC2, col=col, pch=19, xlab="PC1", ylab="PC2")] (4)
    dev.off() (3)
    
    1 Add a color column. The default color is gray.
    2 If the the row type is six-rowed, set the color to black.
    3 Open a PDF file as the plot device and close it after the plot function has been called.
    4 pch specifies the plotting symbol. 19 means solid circles.
  9. Repeat with samples colored by annual growth habit.

    ev[, col2 := "gray"]
    ev[annual_growth_habit == "spring", col2 := "black"]
    ev[annual_growth_habit  == "winter", col2 := "red"]
    
    pdf("PCA2.pdf")
    ev[, plot(PC1, PC2, col=col2, pch=19, xlab="PC1", ylab="PC2")]
    dev.off()
    
  10. Map the countries to continents.

    ev[, countrycode(country_of_origin, "iso3c", "continent")] (1)
    ev[country_of_origin %in% c("DDR", "GER"), country_of_origin := "DEU"]
    ev[country_of_origin == "SUN", country_of_origin := "RUS"]
    ev[country_of_origin == "CSK", country_of_origin := "CZE"]
    
    ev[, continent := countrycode(country_of_origin, "iso3c", "continent")]
    ev[is.na(continent)] (2)
    
    1 Some country code are invalid. Correct these.
    2 Check for missing data.
  11. Color according to country and plot the PCA again.

    ev[, col3 := "gray"]
    ev[continent == "Europe", col3 := "black"]
    ev[continent == "Asia", col3 := "red"]
    ev[continent == "Africa", col3 := "blue"]
    ev[continent == "Americas", col3 := "green"]
    
    pdf("PCA3.pdf")
    ev[, plot(PC1, PC2, col=col3, pch=19, xlab="PC1", ylab="PC2")]
    dev.off()
  12. Add a title and legend. Change the orientation of the y-axis labels.

    pdf("PCA4.pdf")
    ev[, plot(PC1, PC2, col=col3, pch=19, xlab="PC1", ylab="PC2", las=1, main="PCA colored by geography")] (1)
    legend("bottomright", col=c("black", "red", "blue", "green"), pch=19, legend=c("Europe", "Asia", "Africa", "Americas"))
    dev.off()
    1 las=1 means "always horizontal".