Martin Mascher, IPK Gatersleben, November 2025 mascher@ipk-gatersleben.de

1. Comparing two rice genome assemblies

1.1. Preparing the data

  1. Enter the folder with the data for this course.

    cd /home/iamz/materials_Mascher
  2. Have a look at the IR64 assembly. It is described in this paper. Have a look at the file.

    ls IR64.fasta
    less IR64.fasta
  3. Have a look at the IR64 assembly. It is described in this paper. Have a look at the file.

    ls Nipponbare.fasta
    less Nipponbare.fasta

1.2. Chromosome-wise alignment

  1. Create the working directory and change into it.

    mkdir -p ~/rice_assemblies
    cd ~/rice_assemblies
  2. Create symbolic links to the FASTA files of the assemblies in the downloads folder.

    ln -s /home/iamz/materials_Mascher/IR64.fasta
    ln -s /home/iamz/materials_Mascher/Nipponbare.fasta
  3. Create index files that record the chromosome lengths in each assembly.

    samtools faidx IR64.fasta
    samtools faidx Nipponbare.fasta
  4. Align the two assemblies with minimap2.

    minimap2 -I500M -x asm5 Nipponbare.fasta IR64.fasta > Nipponbare_IR64.paf 2> Nipponbare_IR64.paf.err  (1) (2) (3)
    1 The parameter preset -x asm5 is suited to align assemblies from the same species.
    2 The value of -I is reduced to 500 MB from the default 4 GB to reduce the memory footprint.
    3 > and 2> are used to direct the standard output (stdout) and standard error (stderr).
  5. Send the minimap2 process to the background with Ctrl-Z. Resume it with bg. Check with top that it’s running.

  6. In the meantime, take a look the definition of the PAF format at the end of the minimap2 manpage.

    man minimap2
  7. Once the minimap2 is done, have a look at output.

    cat Nipponbare_IR64.paf
    head Nipponbare_IR64.paf
    less Nipponbare_IR64.paf
    less -S Nipponbare_IR64.paf
    column -t Nipponbare_IR64.paf | less -S
  1. Load the PAF file into the Dotplotly app.

1.3. Comparing a single gene in two assemblies

  1. Create a symbolic link to the FASTA file with the sequence of the PLASTOCHRON1 gene (PLASTOCHRON1). Take a look at the sequence.

    ln -s /home/iamz/materials_Mascher/pla1.fasta
    less pla1.fasta
  2. Create BLAST database for the two genome assemblies.

    makeblastdb -dbtype nucl -in Nipponbare.fasta
    makeblastdb -dbtype nucl -in IR64.fasta
  3. Run the BLAST alignment and output to tabular format.

    blastn -db Nipponbare.fasta -query pla1.fasta -outfmt 6 > pla1_Nipponbare.txt (1)
    blastn -db IR64.fasta -query pla1.fasta -outfmt 6 > pla1_IR64.txt
    1 -outfmt 6 means tabular output.
  4. Extract the aligned sequence of the first exon with samtools faidx [man page].

    samtools faidx IR64.fasta chr10:10413299-10414334 > pla1_IR64.fasta
    samtools faidx Nipponbare.fasta Chr10:13659508-13660543 > pla1_Nipponbare.fasta
  5. Extract the sequence of all BLAST hits.

    cat pla1_IR64.txt | awk '$9 < $10 {print $2":"$9"-"$10} $10 < $9 {print $2":"$10"-"$9}' | sort | xargs samtools faidx  IR64.fasta  > pla1_IR64_all_hits.fasta
  6. Align the two sequence using MAFFT and find SNPs between them with SNP-sites:

    cat pla1_IR64.fasta pla1_Nipponbare.fasta | mafft - > pla1_mafft.fasta
    snp-sites -v pla1_mafft.fasta
    snp-sites pla1_mafft.fasta
  7. Now we use GMAP for spliced alignment to extract and compare alignments of of the full transcript, not only the first exon.

  8. Build the GMAP indices for both genomes.

    gmap_build Nipponbare.fasta -D . -d Nipponbare_db > Nipponbare_build.out 2> Nipponbare_build.err & (1)
    gmap_build IR64.fasta -D . -d IR64_db > IR64_build.out 2> IR64_build.err &
    1 The & at the end of the line sends the process immediately to the background.
  9. Align the PLA1 sequence to both genomes.

    gmap -f 2 -D . -d Nipponbare_db pla1.fasta > pla1_Nipponbare_gmap.gff (1)
    gmap -f 2 -D . -d IR64_db pla1.fasta > pla1_IR64_gmap.gff
    1 -f 2 generate GFF output. A description of the GFF format can be found here.
  10. Extract the matched sequence with gffread (part of Cufflinks).

    cat pla1_Nipponbare_gmap.gff | gffread -g Nipponbare.fasta -w pla1_Nipponbare_gmap_mRNA.fasta
    cat pla1_IR64_gmap.gff | gffread -g IR64.fasta -w pla1_IR64_gmap_mRNA.fasta
  11. Run the multiple sequence aligment and SNP extraction with the full transcript sequences.

    cat pla1_Nipponbare_gmap_mRNA.fasta pla1_IR64_gmap_mRNA.fasta | mafft - > pla1_gmap_mafft.fasta
    snp-sites -v pla1_gmap_mafft.fasta

2. Genebank genomics portal

  1. Visit BRIDGE, the barley genebank genomics portal.

3. Analysis of RNA-seq of a wheat mutant and its wildtype

3.1. Creating a kallisto index

  1. Create the project directory and change to it.

    mkdir ~/wheat_rnaseq
    cd ~/wheat_rnaseq
  2. Create symbolic links to the annotation files and decompress them.

    ln -s /home/iamz/materials_Mascher/iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa.zip
    ln -s /home/iamz/materials_Mascher/Scripting/iwgsc_refseqv1.0_FunctionalAnnotation_v1.zip
  3. Decompress the zip archives.

    unzip iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa.zip
    unzip iwgsc_refseqv1.0_FunctionalAnnotation_v1.zip
  4. Count the number of sequences in the file.

    grep -c '^>' iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa
  5. Create an index for alignment with kallisto.

    kallisto index --index wheat_index iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa > kallisto_index.out 2>  kallisto_index.err &

3.2. Quantifying transcript abundance

  1. Create symbolic links to read files.

    ln -s /home/iamz/materials_Mascher/GA*gz .
    ln -s /home/iamz/materials_Mascher/WA*gz .
  2. Run the quantification for a single sample.

    kallisto quant --index wheat_index GA_0908-N_1_R1.fastq.gz GA_0908-N_1_R2.fastq.gz --output GA_0908-N_1_kallisto > GA_0908-N_1_kallisto.out 2> GA_0908-N_1_kallisto.err &
  3. Run the quantification for all samples using a loop. This will take about three hours.

     find | grep R1 | cut -d _ -f -3 | sort | while read i; do
      kallisto quant --index wheat_index  ${i}_R1.fastq.gz ${i}_R2.fastq.gz --output ${i}_kallisto > ${i}_kallisto.out 2>  ${i}_kallisto.err
     done
  4. Check that there are results for all samples.

    head GA_0908-N_1_kallisto/abundance.tsv | column -t
    find | grep abundance.tsv | wc  -l
    find | grep abundance.tsv | xargs wc -l
    grep -c '^>' iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa
  5. If there are issues, download the pre-computed kallisto results. Delete the current output directories. Unzip the archive.

    wget 'https://files.ipk-gatersleben.de/file/RBWEFkcdORboj1QM/zrsvWx9hzVxeEWN2/kallisto.zip'
    rm -rf *_kallisto
    unzip kallisto.zip
  6. Further analyses will be run 3D RNA-seq (paper). We will now create the necessary input file.

  7. Create tables with the metadata.

    grep '^>' iwgsc_refseqv1.0_HighConf_CDS_2017Mar13.fa | tr -d '>' | cut -d ' ' -f 1 > transcripts.txt
    cut -d . -f 1 transcripts.txt > genes.txt
    paste -d , transcripts.txt genes.txt  > transcript_genes.csv (1)
    
    find -type d | grep kallisto  | cut -d / -f 2 | sort > kallisto.txt (2)
    
    cat kallisto.txt  | tr _- '\t' | awk '{print $1","$3","$4}' | paste -d , - kallisto.txt | awk 'BEGIN{print "stage,allele,rep,folder"} {print}' > sample_info.csv (3)
    1 Assignment of transcript isoforms to genes.
    2 List of Kallisto output directories.
    3 Table with experimental design.
  8. Create a zip archiv containing the kallisto output folders. Not needed if you use the download "kallisto.zip" file.

    zip -r kallisto.zip *_kallisto (1)
    1 -r instruct zip to travel the directory structure recursively, that is, to include all files in the folders.
  1. Open the 3D RNA-seq app.

  2. Upload the datasets in the Data generation tab.

  3. Follow the 3D DNA-seq steps. Click on the question mark symbol to get more guidance.

4. Transcriptome atlases

  1. Visit BAR, the Bio-Analytic Resource for Plant Biology (paper).

5. Barley epigenome browser

6. Cereal pangenomes at grain genes

  1. Cereal pangenomes are hosted at GrainGenes.