Practical course in pan-genomics, transcriptomics and functional genomics

Martin Mascher, IPK Gatersleben, November 2025 mascher@ipk-gatersleben.de

1. Comparing two rice genome assemblies

1.1. Preparing the data

Enter the folder with the data for this course.
```
cd /home/iamz/materials_Mascher
```
Have a look at the IR64 assembly. It is described in this paper. Have a look at the file.
```
ls IR64.fasta
less IR64.fasta
```
Have a look at the IR64 assembly. It is described in this paper. Have a look at the file.
```
ls Nipponbare.fasta
less Nipponbare.fasta
```

1.2. Chromosome-wise alignment

Create the working directory and change into it.

mkdir -p ~/rice_assemblies
cd ~/rice_assemblies

Create symbolic links to the FASTA files of the assemblies in the downloads folder.

ln -s /home/iamz/materials_Mascher/IR64.fasta
ln -s /home/iamz/materials_Mascher/Nipponbare.fasta

Create index files that record the chromosome lengths in each assembly.
```
samtools faidx IR64.fasta
samtools faidx Nipponbare.fasta
```

Align the two assemblies with minimap2.

minimap2 -I500M -x asm5 Nipponbare.fasta IR64.fasta > Nipponbare_IR64.paf 2> Nipponbare_IR64.paf.err  (1) (2) (3)

1	The parameter preset `-x asm5` is suited to align assemblies from the same species.
2	The value of `-I` is reduced to 500 MB from the default 4 GB to reduce the memory footprint.
3	`>` and `2>` are used to direct the standard output (stdout) and standard error (stderr).

Send the minimap2 process to the background with Ctrl-Z. Resume it with bg. Check with top that it’s running.
In the meantime, take a look the definition of the PAF format at the end of the minimap2 manpage.
```
man minimap2
```

Once the minimap2 is done, have a look at output.

cat Nipponbare_IR64.paf
head Nipponbare_IR64.paf
less Nipponbare_IR64.paf
less -S Nipponbare_IR64.paf
column -t Nipponbare_IR64.paf | less -S

Load the PAF file into the Dotplotly app.

1.3. Comparing a single gene in two assemblies

Create a symbolic link to the FASTA file with the sequence of the PLASTOCHRON1 gene (PLASTOCHRON1). Take a look at the sequence.
```
ln -s /home/iamz/materials_Mascher/pla1.fasta
less pla1.fasta
```

Create BLAST database for the two genome assemblies.

makeblastdb -dbtype nucl -in Nipponbare.fasta
makeblastdb -dbtype nucl -in IR64.fasta

Run the BLAST alignment and output to tabular format.

blastn -db Nipponbare.fasta -query pla1.fasta -outfmt 6 > pla1_Nipponbare.txt (1)
blastn -db IR64.fasta -query pla1.fasta -outfmt 6 > pla1_IR64.txt

1	`-outfmt 6` means tabular output.

Extract the aligned sequence of the first exon with samtools faidx [man page].

samtools faidx IR64.fasta chr10:10413299-10414334 > pla1_IR64.fasta
samtools faidx Nipponbare.fasta Chr10:13659508-13660543 > pla1_Nipponbare.fasta

Extract the sequence of all BLAST hits.

cat pla1_IR64.txt | awk '$9 < $10 {print $2":"$9"-"$10} $10 < $9 {print $2":"$10"-"$9}' | sort | xargs samtools faidx  IR64.fasta  > pla1_IR64_all_hits.fasta

Align the two sequence using MAFFT and find SNPs between them with SNP-sites:

cat pla1_IR64.fasta pla1_Nipponbare.fasta | mafft - > pla1_mafft.fasta
snp-sites -v pla1_mafft.fasta
snp-sites pla1_mafft.fasta

Now we use GMAP for spliced alignment to extract and compare alignments of of the full transcript, not only the first exon.

Build the GMAP indices for both genomes.

gmap_build Nipponbare.fasta -D . -d Nipponbare_db > Nipponbare_build.out 2> Nipponbare_build.err & (1)
gmap_build IR64.fasta -D . -d IR64_db > IR64_build.out 2> IR64_build.err &

1	The `&` at the end of the line sends the process immediately to the background.

Align the PLA1 sequence to both genomes.

gmap -f 2 -D . -d Nipponbare_db pla1.fasta > pla1_Nipponbare_gmap.gff (1)
gmap -f 2 -D . -d IR64_db pla1.fasta > pla1_IR64_gmap.gff

1	`-f 2` generate GFF output. A description of the GFF format can be found here.

Extract the matched sequence with gffread (part of Cufflinks).

cat pla1_Nipponbare_gmap.gff | gffread -g Nipponbare.fasta -w pla1_Nipponbare_gmap_mRNA.fasta
cat pla1_IR64_gmap.gff | gffread -g IR64.fasta -w pla1_IR64_gmap_mRNA.fasta

Run the multiple sequence aligment and SNP extraction with the full transcript sequences.

cat pla1_Nipponbare_gmap_mRNA.fasta pla1_IR64_gmap_mRNA.fasta | mafft - > pla1_gmap_mafft.fasta
snp-sites -v pla1_gmap_mafft.fasta

2. Genebank genomics portal

Visit BRIDGE, the barley genebank genomics portal.

3. Analysis of RNA-seq of a wheat mutant and its wildtype

3.1. Creating a kallisto index

Create the project directory and change to it.
```
mkdir ~/wheat_rnaseq
cd ~/wheat_rnaseq
```

Create symbolic links to the annotation files and decompress them.

ln -s /home/iamz/materials_Mascher/iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa.zip
ln -s /home/iamz/materials_Mascher/Scripting/iwgsc_refseqv1.0_FunctionalAnnotation_v1.zip

Decompress the zip archives.

unzip iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa.zip
unzip iwgsc_refseqv1.0_FunctionalAnnotation_v1.zip

Count the number of sequences in the file.

grep -c '^>' iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa

Create an index for alignment with kallisto.

kallisto index --index wheat_index iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa > kallisto_index.out 2>  kallisto_index.err &

3.2. Quantifying transcript abundance

Create symbolic links to read files.

ln -s /home/iamz/materials_Mascher/GA*gz .
ln -s /home/iamz/materials_Mascher/WA*gz .

Run the quantification for a single sample.

kallisto quant --index wheat_index GA_0908-N_1_R1.fastq.gz GA_0908-N_1_R2.fastq.gz --output GA_0908-N_1_kallisto > GA_0908-N_1_kallisto.out 2> GA_0908-N_1_kallisto.err &

Run the quantification for all samples using a loop. This will take about three hours.

 find | grep R1 | cut -d _ -f -3 | sort | while read i; do
  kallisto quant --index wheat_index  ${i}_R1.fastq.gz ${i}_R2.fastq.gz --output ${i}_kallisto > ${i}_kallisto.out 2>  ${i}_kallisto.err
 done

Check that there are results for all samples.

head GA_0908-N_1_kallisto/abundance.tsv | column -t
find | grep abundance.tsv | wc  -l
find | grep abundance.tsv | xargs wc -l
grep -c '^>' iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa

If there are issues, download the pre-computed kallisto results. Delete the current output directories. Unzip the archive.

wget 'https://files.ipk-gatersleben.de/file/RBWEFkcdORboj1QM/zrsvWx9hzVxeEWN2/kallisto.zip'
rm -rf *_kallisto
unzip kallisto.zip

Further analyses will be run 3D RNA-seq (paper). We will now create the necessary input file.

Create tables with the metadata.

grep '^>' iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa | tr -d '>' | cut -d ' ' -f 1 > transcripts.txt
cut -d . -f 1 transcripts.txt > genes.txt
paste -d , transcripts.txt genes.txt  > transcript_genes.csv (1)

find -type d | grep kallisto  | cut -d / -f 2 | sort > kallisto.txt (2)

cat kallisto.txt  | tr _- '\t' | awk '{print $1","$3","$4}' | paste -d , - kallisto.txt | awk 'BEGIN{print "stage,allele,rep,folder"} {print}' > sample_info.csv (3)

1	Assignment of transcript isoforms to genes.
2	List of Kallisto output directories.
3	Table with experimental design.

Create a zip archiv containing the kallisto output folders. Not needed if you use the download "kallisto.zip" file.
```
zip -r kallisto.zip *_kallisto (1)
```
1 -r instruct zip to travel the directory structure recursively, that is, to include all files in the folders.

Open the 3D RNA-seq app.
Upload the datasets in the Data generation tab.
Follow the 3D DNA-seq steps. Click on the question mark symbol to get more guidance.

4. Transcriptome atlases

Visit BAR, the Bio-Analytic Resource for Plant Biology (paper).

5. Barley epigenome browser

Visit the Barley epigenome browser.

6. Cereal pangenomes at grain genes

Cereal pangenomes are hosted at GrainGenes.