Martin Mascher, IPK Gatersleben, November 2025 mascher@ipk-gatersleben.de
1. Comparing two rice genome assemblies
1.1. Preparing the data
-
Enter the folder with the data for this course.
cd /home/iamz/materials_Mascher -
Have a look at the IR64 assembly. It is described in this paper. Have a look at the file.
ls IR64.fasta less IR64.fasta -
Have a look at the IR64 assembly. It is described in this paper. Have a look at the file.
ls Nipponbare.fasta less Nipponbare.fasta
1.2. Chromosome-wise alignment
-
Create the working directory and change into it.
mkdir -p ~/rice_assemblies cd ~/rice_assemblies -
Create symbolic links to the FASTA files of the assemblies in the downloads folder.
ln -s /home/iamz/materials_Mascher/IR64.fasta ln -s /home/iamz/materials_Mascher/Nipponbare.fasta -
Create index files that record the chromosome lengths in each assembly.
samtools faidx IR64.fasta samtools faidx Nipponbare.fasta -
Align the two assemblies with minimap2.
minimap2 -I500M -x asm5 Nipponbare.fasta IR64.fasta > Nipponbare_IR64.paf 2> Nipponbare_IR64.paf.err (1) (2) (3)1 The parameter preset -x asm5is suited to align assemblies from the same species.2 The value of -Iis reduced to 500 MB from the default 4 GB to reduce the memory footprint.3 >and2>are used to direct the standard output (stdout) and standard error (stderr). -
Send the
minimap2process to the background with Ctrl-Z. Resume it withbg. Check withtopthat it’s running. -
In the meantime, take a look the definition of the PAF format at the end of the minimap2 manpage.
man minimap2 -
Once the
minimap2is done, have a look at output.cat Nipponbare_IR64.paf head Nipponbare_IR64.paf less Nipponbare_IR64.paf less -S Nipponbare_IR64.paf column -t Nipponbare_IR64.paf | less -S
-
Load the PAF file into the Dotplotly app.
1.3. Comparing a single gene in two assemblies
-
Create a symbolic link to the FASTA file with the sequence of the PLASTOCHRON1 gene (PLASTOCHRON1). Take a look at the sequence.
ln -s /home/iamz/materials_Mascher/pla1.fasta less pla1.fasta -
Create BLAST database for the two genome assemblies.
makeblastdb -dbtype nucl -in Nipponbare.fasta makeblastdb -dbtype nucl -in IR64.fasta -
Run the BLAST alignment and output to tabular format.
blastn -db Nipponbare.fasta -query pla1.fasta -outfmt 6 > pla1_Nipponbare.txt (1) blastn -db IR64.fasta -query pla1.fasta -outfmt 6 > pla1_IR64.txt1 -outfmt 6means tabular output. -
Extract the aligned sequence of the first exon with
samtools faidx[man page].samtools faidx IR64.fasta chr10:10413299-10414334 > pla1_IR64.fasta samtools faidx Nipponbare.fasta Chr10:13659508-13660543 > pla1_Nipponbare.fasta -
Extract the sequence of all BLAST hits.
cat pla1_IR64.txt | awk '$9 < $10 {print $2":"$9"-"$10} $10 < $9 {print $2":"$10"-"$9}' | sort | xargs samtools faidx IR64.fasta > pla1_IR64_all_hits.fasta -
Align the two sequence using MAFFT and find SNPs between them with SNP-sites:
cat pla1_IR64.fasta pla1_Nipponbare.fasta | mafft - > pla1_mafft.fasta snp-sites -v pla1_mafft.fasta snp-sites pla1_mafft.fasta -
Now we use GMAP for spliced alignment to extract and compare alignments of of the full transcript, not only the first exon.
-
Build the GMAP indices for both genomes.
gmap_build Nipponbare.fasta -D . -d Nipponbare_db > Nipponbare_build.out 2> Nipponbare_build.err & (1) gmap_build IR64.fasta -D . -d IR64_db > IR64_build.out 2> IR64_build.err &1 The &at the end of the line sends the process immediately to the background. -
Align the PLA1 sequence to both genomes.
gmap -f 2 -D . -d Nipponbare_db pla1.fasta > pla1_Nipponbare_gmap.gff (1) gmap -f 2 -D . -d IR64_db pla1.fasta > pla1_IR64_gmap.gff1 -f 2generate GFF output. A description of the GFF format can be found here. -
Extract the matched sequence with gffread (part of Cufflinks).
cat pla1_Nipponbare_gmap.gff | gffread -g Nipponbare.fasta -w pla1_Nipponbare_gmap_mRNA.fasta cat pla1_IR64_gmap.gff | gffread -g IR64.fasta -w pla1_IR64_gmap_mRNA.fasta -
Run the multiple sequence aligment and SNP extraction with the full transcript sequences.
cat pla1_Nipponbare_gmap_mRNA.fasta pla1_IR64_gmap_mRNA.fasta | mafft - > pla1_gmap_mafft.fasta snp-sites -v pla1_gmap_mafft.fasta
2. Genebank genomics portal
-
Visit BRIDGE, the barley genebank genomics portal.
3. Analysis of RNA-seq of a wheat mutant and its wildtype
3.1. Creating a kallisto index
-
Create the project directory and change to it.
mkdir ~/wheat_rnaseq cd ~/wheat_rnaseq -
Create symbolic links to the annotation files and decompress them.
ln -s /home/iamz/materials_Mascher/iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa.zip ln -s /home/iamz/materials_Mascher/Scripting/iwgsc_refseqv1.0_FunctionalAnnotation_v1.zip -
Decompress the zip archives.
unzip iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa.zip unzip iwgsc_refseqv1.0_FunctionalAnnotation_v1.zip -
Count the number of sequences in the file.
grep -c '^>' iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa -
Create an index for alignment with kallisto.
kallisto index --index wheat_index iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa > kallisto_index.out 2> kallisto_index.err &
3.2. Quantifying transcript abundance
-
Create symbolic links to read files.
ln -s /home/iamz/materials_Mascher/GA*gz . ln -s /home/iamz/materials_Mascher/WA*gz . -
Run the quantification for a single sample.
kallisto quant --index wheat_index GA_0908-N_1_R1.fastq.gz GA_0908-N_1_R2.fastq.gz --output GA_0908-N_1_kallisto > GA_0908-N_1_kallisto.out 2> GA_0908-N_1_kallisto.err & -
Run the quantification for all samples using a loop. This will take about three hours.
find | grep R1 | cut -d _ -f -3 | sort | while read i; do kallisto quant --index wheat_index ${i}_R1.fastq.gz ${i}_R2.fastq.gz --output ${i}_kallisto > ${i}_kallisto.out 2> ${i}_kallisto.err done -
Check that there are results for all samples.
head GA_0908-N_1_kallisto/abundance.tsv | column -t find | grep abundance.tsv | wc -l find | grep abundance.tsv | xargs wc -l grep -c '^>' iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa -
If there are issues, download the pre-computed kallisto results. Delete the current output directories. Unzip the archive.
wget 'https://files.ipk-gatersleben.de/file/RBWEFkcdORboj1QM/zrsvWx9hzVxeEWN2/kallisto.zip' rm -rf *_kallisto unzip kallisto.zip -
Further analyses will be run 3D RNA-seq (paper). We will now create the necessary input file.
-
Create tables with the metadata.
grep '^>' iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa | tr -d '>' | cut -d ' ' -f 1 > transcripts.txt cut -d . -f 1 transcripts.txt > genes.txt paste -d , transcripts.txt genes.txt > transcript_genes.csv (1) find -type d | grep kallisto | cut -d / -f 2 | sort > kallisto.txt (2) cat kallisto.txt | tr _- '\t' | awk '{print $1","$3","$4}' | paste -d , - kallisto.txt | awk 'BEGIN{print "stage,allele,rep,folder"} {print}' > sample_info.csv (3)1 Assignment of transcript isoforms to genes. 2 List of Kallisto output directories. 3 Table with experimental design. -
Create a zip archiv containing the kallisto output folders. Not needed if you use the download "kallisto.zip" file.
zip -r kallisto.zip *_kallisto (1)1 -rinstructzipto travel the directory structure recursively, that is, to include all files in the folders.
-
Open the 3D RNA-seq app.
-
Upload the datasets in the Data generation tab.
-
Follow the 3D DNA-seq steps. Click on the question mark symbol to get more guidance.
5. Barley epigenome browser
-
Visit the Barley epigenome browser.
6. Cereal pangenomes at grain genes
-
Cereal pangenomes are hosted at GrainGenes.