Analysis of genotyping-by-sequencing data

1. Open a UNIX shell

Connect with PuTTY to the server slurm-24.
Obtain a Kerberos ticket to access our filer resources.
```
kinit
klist (1)
```
1 A ticket is valid for seven days.
You will see the prompt of the UNIX command-line interpreter, the shell. It is called "shell" because it executes commands that do not belong to the operating system, the "kernel".
Check out your home directory.
```
pwd
ls
```

2. Set up your analysis directory

Let’s have a look at the GBS data.

cd /filer/projekte/gbs_course
cd data
ls
ls -lhS gbs_reads | less (1)
man ls
cd gbs_reads
ls | wc -l
zcat HOR_337.fastq.gz | less -S (2)
cd ..
du -ch gbs_reads

1	To close `less`, press `q`.
2	The reads are in FASTQ format.

Have a look at the passport data.

less core200_passport_data.tsv
column -t core200_passport_data.tsv | less -S

Use WinSCP to copy the XLSX file with the passport data to your local computer.

Create your working directory and change into it.

cd /filer/projekte/gbs_course
mkdir -p mascher/gbs (1)
cd mascher/gbs

1	Change `mascher` to your name.

Create symbolic links to the GBS reads and the phenotype data.

pwd (1)

mkdir gbs_reads
datadir="/filer/projekte/gbs_course/data/"
ln -s $datadir/gbs_reads/*fastq.gz gbs_reads
ls -l gbs_reads | less -S
ls gbs_reads | wc -l

1	Check that you are in your working directory (`/filer/projekte/gbs_course/USER/gbs`).

Count the reads in one FASTQ file.

zcat gbs_reads/HOR_337.fastq.gz | wc -l > HOR_337.len.txt
cat HOR_337.len.txt

We use a for-loop to get the read counts of all samples in one go.

for i in gbs_reads/*fastq.gz; do
 name=`basename $i | cut -d . -f 1`
 count=`zcat $i | wc -l`
 echo $name $count
done > fastq_line_counts.txt

Add column with read count (as opposed to the line count).

cat fastq_line_counts.txt | awk '{print $1,$2/4}' | sort -nk 2 > fastq_read_counts.txt (1)

1	We use AWK to divide the second column by 4.

3. Read mapping

We will use the tools BWA and samtools for read mapping and processing of alignment records, respectively.
```
module load bwa samtools
module list
```

We will first run a toy example with 10,000 reads of the HOR 337 sample.

zcat gbs_reads/HOR_337.fastq.gz | head -n 40000 > HOR_337_10k.fastq

ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta' (1)
bwa mem $ref HOR_337_10k.fastq > HOR_337_10k.sam (2)
samtools sort HOR_337_10k.sam > HOR_337_10k.bam (2)
samtools view HOR_337_10k.bam | less -S

1	We use the MorexV3 reference genome sequence.
2	Alignments are stored in the Sequence Alignment Map format (SAM) or its binary equivalent BAM.

Combine mapping and sorting into one command.

bwa mem $ref HOR_337_10k.fastq | samtools sort > HOR_337_10k.bam

Open a new PuTTY session. Start tmux to keep your commands running in the background.
```
k5reauth tmux (1)
```
1 k5reauth is used to prolong the Kerberos ticket for as long as the tmux session is running.

Try out tmux functionality.

tmux ls
tmux rename mapping
tmux ls
tmux detach
tmux ls
tmux attach -t mapping

Map all samples.

ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta'
for i in gbs_reads/*fastq.gz; do
 name=`echo $i | cut -d . -f 1` (1)
 bwa mem -t 12 $ref $i | samtools sort > $name.bam
done 2> bwa.err (2)

1	Strip the extension: HOR_337.fastq.gz become HOR_337.
2	To detach the tmux session, press `Ctrl-B` followed by `D`.

If you forget to start bwa inside a tmux session, there is no way to prevent your job from aborting when you shutdown your laptop. Also without k5reauth programs cannot access filer resource after a maximum of ten hours.

Open a new terminal. Look at your jobs in the table of processes (top).
```
top -u mascher (1)
```
1 Replace mascher with your username.
Once the mapping is finished, count the number of BAM files.
```
ls gbs_reads/*bam | wc -l
```
If you have fewer than 200 files and the mapping didn’t work for some reason, use the pre-computed files.
```
ln -fst gbs_reads /filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/data/bam_files/*bam
```
Count the number of mapped reads with a minimum quality score of 20 for the HOR 337 sample and compare the result to the input read count of that sample.
```
samtools view -cq 20 -cq20 gbs_reads/HOR_337.bam
grep HOR_337 fastq_read_counts.txt
```

Get mapped read counts for all samples.

for i in gbs_reads/*bam; do
 name=`basename $i | cut -d . -f 1`
 count=`samtools view -cq 20 $i`
 echo $name $count
done > mapped_reads.txt

Combine the raw read counts and the mapping rates into one table.

LC_ALL=C sort fastq_read_counts.txt > tmp1 (1)
LC_ALL=C sort mapped_reads.txt > tmp2 (1)

join tmp1 tmp2 | awk '{print $0,$3/$2*100}' | sort -nk 4 | column -t > mapping_stats.txt (2)

rm -f tmp1 tmp2

column -t mapping_stats.txt  | less -S

1	To combine two lists with join, both lists need to be sorted on the common ID column.
2	column is used to align columns.

4. Variant calling and filtration

Open a new tmux session and load bcftools.
```
tmux (1)
tmux rename variant_call
module load bcftools
```
1 The variant calling will run for some time, so run it inside tmux.
Get a list of all BAM files.
```
ls gbs_reads/*bam | sort > bam_list.txt
```

Run the variant calling.

ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta'
bamlist='/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/try_231222/bam_list.txt' (1)
vcf='bcftools_SNP_calling.vcf' (2)

bcftools mpileup --bam-list $bamlist --skip-indels --fasta-ref $ref --min-MQ 20 --annotate AD | bcftools call --consensus-caller --variants-only --output $vcf (3)

1	List of pre-computed BAM files.
2	Output file in variant call format (VCF). Here are the specifications of the VCF format.
3	We ignore insertions and deletions (`--skip-indels`), consider only SNPs with a quality score no smaller than 20 (`--min-MQ 20`) and add allelic depth information (`--annotate AD`) for all genotype calls.

Filter the variant calls.

filter='/filer/projekte/gbs_course/scripts/filter_vcf.zsh'
vcf='/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/try_231222/bcftools_SNP_calling.vcf' (1)
fvcf='bcftools_SNP_calling_filtered.vcf'

$filter --vcf $vcf --dphom 2 --dphet 4 --minmaf 0.2 --minpresent 0.9 --minhomp 0.9 > $fvcf (2)

1	Path to pre-computed VCF file.
2	We keep homozygous genotype calls if they have at least two supporting reads; heterozygous calls are accepted if they are supported by no fewer than four reads. SNPs with a minor allele frequency below 20 % or less than 90 % present calls or less than 90 % homozygous calls are discarded.

Review the VCF file.

grep -v '^##' bcftools_SNP_calling_filtered.vcf | column -t | less -S

5. Principal component analysis

Open R.
```
module load R/3.5.1
R
```
R is a widely used programming language in data science. There are many tutorials, e.g. this one.

Load the required libraries.

.libPaths(c("/filer-dg/agruppen/seq_shared/mascher/Rlib/3.5.1", "/opt/Bio/R_LIBS/3.5.1")) (1)

library(data.table) (2)
library(SeqArray) (3)
library(SNPRelate) (3)
library(countrycode) (4)

1	Set the paths where the R libraries are located.
2	data.table extends R core functionality when handling large tables.
3	seqArray and SNPRelate are two R packages to store and analyze SNP matrices.
4	The countrycode package will be used to make country abbreviations to geographic regions.

Convert the VCF file to the binary GDS (Genomic Data Structure) format used by seqArray and SNPRelate.

seqVCF2GDS(vcf.fn='/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/try_231222/bcftools_SNP_calling_filtered.vcf', out.fn='bcftools_SNP_calling_filtered.gds') (1)

1	This creates the GDS file in the current working directory.

Open the GDS file and get summary statistics.

seqOpen('bcftools_SNP_calling_filtered.gds') -> gds
seqSummary(gds)

Run a principal components analysis (PCA) on the data and extract the eigenvectors.

snpgdsPCA(gds, autosome.only=F) -> pca (1)

data.table(pca$sample.id, pca$eigenvect[, 1:2]) -> ev
setnames(ev, c("accession", "PC1", "PC2")) (2)
ev[, accession := sub(".bam$", "", sub(".*/", "", accession))] (3)

1	`autosome.only=F` is needed because chromsomes are named chr1H, chr2H … instead of 1, 2, …
2	Set proper column names.
3	Change the sample names inherited from the VCF file (BAM file names).

Read the passport data for the core200 panel and merge them with the PCA results.

data.table(read.xlsx("/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/data/core200_passport_data.xlsx")) -> pp
pp[ev, on="accession"] -> ev

Plot the first two principal components (PCs) with samples coloured by row type.

ev[, col := "gray"] (1)
ev[row_type == "6-rowed", col := "black"] (2)
ev[row_type == "2-rowed", col := "red"]

pdf("PCA1.pdf") (3)
ev[, plot(PC1, PC2, col=col, pch=19, xlab="PC1", ylab="PC2")] (4)
dev.off() (3)

1	Add a color column. The default color is gray.
2	If the the row type is six-rowed, set the color to black.
3	Open a PDF file as the plot device and close it after the plot function has been called.
4	`pch` specifies the plotting symbol. 19 means solid circles.

Repeat with samples colored by annual growth habit.

ev[, col2 := "gray"]
ev[annual_growth_habit == "spring", col2 := "black"]
ev[annual_growth_habit  == "winter", col2 := "red"]

pdf("PCA2.pdf")
ev[, plot(PC1, PC2, col=col2, pch=19, xlab="PC1", ylab="PC2")]
dev.off()

Map the countries to continents.

ev[, countrycode(country_of_origin, "iso3c", "continent")] (1)
ev[country_of_origin %in% c("DDR", "GER"), country_of_origin := "DEU"]
ev[country_of_origin == "SUN", country_of_origin := "RUS"]
ev[country_of_origin == "CSK", country_of_origin := "CZE"]

ev[, continent := countrycode(country_of_origin, "iso3c", "continent")]
ev[is.na(continent)] (2)

1	Some country code are invalid. Correct these.
2	Check for missing data.

Color according to country and plot the PCA again.

ev[, col3 := "gray"]
ev[continent == "Europe", col3 := "black"]
ev[continent == "Asia", col3 := "red"]
ev[continent == "Africa", col3 := "blue"]
ev[continent == "Americas", col3 := "green"]

pdf("PCA3.pdf")
ev[, plot(PC1, PC2, col=col3, pch=19, xlab="PC1", ylab="PC2")]
dev.off()

Add a title and legend. Change the orientation of the y-axis labels.

pdf("PCA4.pdf")
ev[, plot(PC1, PC2, col=col3, pch=19, xlab="PC1", ylab="PC2", las=1, main="PCA colored by geography")] (1)
legend("bottomright", col=c("black", "red", "blue", "green"), pch=19, legend=c("Europe", "Asia", "Africa", "Americas"))
dev.off()

1	`las=1` means "always horizontal".