1. Open a UNIX shell
-
Connect with PuTTY to the server slurm-24.
-
Obtain a Kerberos ticket to access our filer resources.
kinit klist (1)1 A ticket is valid for seven days. -
You will see the prompt of the UNIX command-line interpreter, the shell. It is called "shell" because it executes commands that do not belong to the operating system, the "kernel".
-
Check out your home directory.
pwd ls
2. Set up your analysis directory
-
Let’s have a look at the GBS data.
cd /filer/projekte/gbs_course cd data ls ls -lhS gbs_reads | less (1) man ls cd gbs_reads ls | wc -l zcat HOR_337.fastq.gz | less -S (2) cd .. du -ch gbs_reads1 To close less, pressq.2 The reads are in FASTQ format. -
Have a look at the passport data.
less core200_passport_data.tsv column -t core200_passport_data.tsv | less -S -
Use WinSCP to copy the XLSX file with the passport data to your local computer.
-
Create your working directory and change into it.
cd /filer/projekte/gbs_course mkdir -p mascher/gbs (1) cd mascher/gbs1 Change mascherto your name. -
Create symbolic links to the GBS reads and the phenotype data.
pwd (1) mkdir gbs_reads datadir="/filer/projekte/gbs_course/data/" ln -s $datadir/gbs_reads/*fastq.gz gbs_reads ls -l gbs_reads | less -S ls gbs_reads | wc -l1 Check that you are in your working directory ( /filer/projekte/gbs_course/USER/gbs). -
Count the reads in one FASTQ file.
zcat gbs_reads/HOR_337.fastq.gz | wc -l > HOR_337.len.txt cat HOR_337.len.txt -
We use a for-loop to get the read counts of all samples in one go.
for i in gbs_reads/*fastq.gz; do name=`basename $i | cut -d . -f 1` count=`zcat $i | wc -l` echo $name $count done > fastq_line_counts.txt -
Add column with read count (as opposed to the line count).
cat fastq_line_counts.txt | awk '{print $1,$2/4}' | sort -nk 2 > fastq_read_counts.txt (1)1 We use AWK to divide the second column by 4.
3. Read mapping
-
We will use the tools BWA and samtools for read mapping and processing of alignment records, respectively.
module load bwa samtools module list -
We will first run a toy example with 10,000 reads of the HOR 337 sample.
zcat gbs_reads/HOR_337.fastq.gz | head -n 40000 > HOR_337_10k.fastq ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta' (1) bwa mem $ref HOR_337_10k.fastq > HOR_337_10k.sam (2) samtools sort HOR_337_10k.sam > HOR_337_10k.bam (2) samtools view HOR_337_10k.bam | less -S1 We use the MorexV3 reference genome sequence. 2 Alignments are stored in the Sequence Alignment Map format (SAM) or its binary equivalent BAM. -
Combine mapping and sorting into one command.
bwa mem $ref HOR_337_10k.fastq | samtools sort > HOR_337_10k.bam -
Open a new PuTTY session. Start tmux to keep your commands running in the background.
k5reauth tmux (1)1 k5reauth is used to prolong the Kerberos ticket for as long as the tmux session is running. -
Try out tmux functionality.
tmux ls tmux rename mapping tmux ls tmux detach tmux ls tmux attach -t mapping -
Map all samples.
ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta' for i in gbs_reads/*fastq.gz; do name=`echo $i | cut -d . -f 1` (1) bwa mem -t 12 $ref $i | samtools sort > $name.bam done 2> bwa.err (2)1 Strip the extension: HOR_337.fastq.gz become HOR_337. 2 To detach the tmux session, press Ctrl-B followed by D. If you forget to start bwa inside a tmux session, there is no way to prevent your job from aborting when you shutdown your laptop. Also without k5reauth programs cannot access filer resource after a maximum of ten hours. -
Open a new terminal. Look at your jobs in the table of processes (
top).top -u mascher (1)1 Replace mascher with your username. -
Once the mapping is finished, count the number of BAM files.
ls gbs_reads/*bam | wc -l -
If you have fewer than 200 files and the mapping didn’t work for some reason, use the pre-computed files.
ln -fst gbs_reads /filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/data/bam_files/*bam -
Count the number of mapped reads with a minimum quality score of 20 for the HOR 337 sample and compare the result to the input read count of that sample.
samtools view -cq 20 -cq20 gbs_reads/HOR_337.bam grep HOR_337 fastq_read_counts.txt -
Get mapped read counts for all samples.
for i in gbs_reads/*bam; do name=`basename $i | cut -d . -f 1` count=`samtools view -cq 20 $i` echo $name $count done > mapped_reads.txt -
Combine the raw read counts and the mapping rates into one table.
LC_ALL=C sort fastq_read_counts.txt > tmp1 (1) LC_ALL=C sort mapped_reads.txt > tmp2 (1) join tmp1 tmp2 | awk '{print $0,$3/$2*100}' | sort -nk 4 | column -t > mapping_stats.txt (2) rm -f tmp1 tmp2 column -t mapping_stats.txt | less -S
4. Variant calling and filtration
-
Open a new tmux session and load bcftools.
tmux (1) tmux rename variant_call module load bcftools1 The variant calling will run for some time, so run it inside tmux. -
Get a list of all BAM files.
ls gbs_reads/*bam | sort > bam_list.txt -
Run the variant calling.
ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta' bamlist='/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/try_231222/bam_list.txt' (1) vcf='bcftools_SNP_calling.vcf' (2) bcftools mpileup --bam-list $bamlist --skip-indels --fasta-ref $ref --min-MQ 20 --annotate AD | bcftools call --consensus-caller --variants-only --output $vcf (3)1 List of pre-computed BAM files. 2 Output file in variant call format (VCF). Here are the specifications of the VCF format. 3 We ignore insertions and deletions ( --skip-indels), consider only SNPs with a quality score no smaller than 20 (--min-MQ 20) and add allelic depth information (--annotate AD) for all genotype calls. -
Filter the variant calls.
filter='/filer/projekte/gbs_course/scripts/filter_vcf.zsh' vcf='/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/try_231222/bcftools_SNP_calling.vcf' (1) fvcf='bcftools_SNP_calling_filtered.vcf' $filter --vcf $vcf --dphom 2 --dphet 4 --minmaf 0.2 --minpresent 0.9 --minhomp 0.9 > $fvcf (2)1 Path to pre-computed VCF file. 2 We keep homozygous genotype calls if they have at least two supporting reads; heterozygous calls are accepted if they are supported by no fewer than four reads. SNPs with a minor allele frequency below 20 % or less than 90 % present calls or less than 90 % homozygous calls are discarded. -
Review the VCF file.
grep -v '^##' bcftools_SNP_calling_filtered.vcf | column -t | less -S
5. Principal component analysis
-
Open R.
module load R/3.5.1 R -
R is a widely used programming language in data science. There are many tutorials, e.g. this one.
-
Load the required libraries.
.libPaths(c("/filer-dg/agruppen/seq_shared/mascher/Rlib/3.5.1", "/opt/Bio/R_LIBS/3.5.1")) (1) library(data.table) (2) library(SeqArray) (3) library(SNPRelate) (3) library(countrycode) (4)1 Set the paths where the R libraries are located. 2 data.table extends R core functionality when handling large tables. 3 seqArray and SNPRelate are two R packages to store and analyze SNP matrices. 4 The countrycode package will be used to make country abbreviations to geographic regions. -
Convert the VCF file to the binary GDS (Genomic Data Structure) format used by seqArray and SNPRelate.
seqVCF2GDS(vcf.fn='/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/try_231222/bcftools_SNP_calling_filtered.vcf', out.fn='bcftools_SNP_calling_filtered.gds') (1)1 This creates the GDS file in the current working directory. -
Open the GDS file and get summary statistics.
seqOpen('bcftools_SNP_calling_filtered.gds') -> gds seqSummary(gds) -
Run a principal components analysis (PCA) on the data and extract the eigenvectors.
snpgdsPCA(gds, autosome.only=F) -> pca (1) data.table(pca$sample.id, pca$eigenvect[, 1:2]) -> ev setnames(ev, c("accession", "PC1", "PC2")) (2) ev[, accession := sub(".bam$", "", sub(".*/", "", accession))] (3)1 autosome.only=Fis needed because chromsomes are named chr1H, chr2H … instead of 1, 2, …2 Set proper column names. 3 Change the sample names inherited from the VCF file (BAM file names). -
Read the passport data for the core200 panel and merge them with the PCA results.
data.table(read.xlsx("/filer-dg/agruppen/dg6/mascher/gbs_course2024_231222/data/core200_passport_data.xlsx")) -> pp pp[ev, on="accession"] -> ev -
Plot the first two principal components (PCs) with samples coloured by row type.
ev[, col := "gray"] (1) ev[row_type == "6-rowed", col := "black"] (2) ev[row_type == "2-rowed", col := "red"] pdf("PCA1.pdf") (3) ev[, plot(PC1, PC2, col=col, pch=19, xlab="PC1", ylab="PC2")] (4) dev.off() (3)1 Add a color column. The default color is gray. 2 If the the row type is six-rowed, set the color to black. 3 Open a PDF file as the plot device and close it after the plot function has been called. 4 pchspecifies the plotting symbol. 19 means solid circles. -
Repeat with samples colored by annual growth habit.
ev[, col2 := "gray"] ev[annual_growth_habit == "spring", col2 := "black"] ev[annual_growth_habit == "winter", col2 := "red"] pdf("PCA2.pdf") ev[, plot(PC1, PC2, col=col2, pch=19, xlab="PC1", ylab="PC2")] dev.off() -
Map the countries to continents.
ev[, countrycode(country_of_origin, "iso3c", "continent")] (1) ev[country_of_origin %in% c("DDR", "GER"), country_of_origin := "DEU"] ev[country_of_origin == "SUN", country_of_origin := "RUS"] ev[country_of_origin == "CSK", country_of_origin := "CZE"] ev[, continent := countrycode(country_of_origin, "iso3c", "continent")] ev[is.na(continent)] (2)1 Some country code are invalid. Correct these. 2 Check for missing data. -
Color according to country and plot the PCA again.
ev[, col3 := "gray"] ev[continent == "Europe", col3 := "black"] ev[continent == "Asia", col3 := "red"] ev[continent == "Africa", col3 := "blue"] ev[continent == "Americas", col3 := "green"] pdf("PCA3.pdf") ev[, plot(PC1, PC2, col=col3, pch=19, xlab="PC1", ylab="PC2")] dev.off() -
Add a title and legend. Change the orientation of the y-axis labels.
pdf("PCA4.pdf") ev[, plot(PC1, PC2, col=col3, pch=19, xlab="PC1", ylab="PC2", las=1, main="PCA colored by geography")] (1) legend("bottomright", col=c("black", "red", "blue", "green"), pch=19, legend=c("Europe", "Asia", "Africa", "Americas")) dev.off()1 las=1means "always horizontal".