Cryptococcus siRNAs¶

Cryptococcus is one of the few fungi that can become virulent, especially in immunocompromised individuals. Because this fungus has the capacity to grow at temperatures of around 37^oC it has been proposed that, apart from a developed immune system, sustaining a high body temperature prevents fungal invasion [Casadevall-2020].

One of the driving forces to develop Coalispr was to create an alternative tool for counting RNA-Seq data obtained for siRNA libraries. We had generated a large number (>100) of such libraries for Cryptococcus deneoformans JEC21 (see About). HTSEQ, the commonly used program to obtain read-counts, required the assembly of an annotation file for siRNA-targets to direct the counting. Small non-coding RNAs have not been annotated, so this had to be done from scratch. A GTF was prepared by manually scanning bedgraph traces in a genome browser like IGB and copying genome coordinates over to a text file and name these regions. From this file a GTF was constructed with a feature useful for counting. Apart from being laborious, this method is error-prone and not very scalable to other organisms or when an update of the genome is published. The counting was also very time-consuming at the time.

This tutorial is based around published data for the related fungus, C. neoformans H99 [Dumesic-2013], [Burke-2019]. As shown below, it can be demonstrated that exons in transcripts, rather than introns [Dumesic-2013], are the target of the RNAi machinery. In C. deneoformans JEC21, siRNAs also align to exons and for both fungi siRNAs crossing splice junctions are identified after mapping with STAR, e.g. at locus CNAG_03231. These findings demonstrate that RNAi acts on spliced transcripts, thus downstream from (instead of parallel to) splicing.

**Figure 1.** C. *neoformans* siRNAs target exons, not introns¶

Left: Comparison of Fig. 1E from [Dumesic-2013] to data from [Dumesic-2013] and [Burke-2019] aligned with STAR as done in this tutorial and displayed in the IGB genome browser. Bars represent annotations in GTFs from [Janbon-2014] and [Wallace-2020].

Right: Chromosme 8 context of CNAG_03231 and RNA-Seq by [Wallace-2020] for wild type (H99) and strains with an *AGO1* deletion (Δago1; Top). Skipped introns have canonical splice site sequences (*GU..AG*; Bottom [CNAG_03231]).

Compared to other eukaryotes, and possibly related to their ability to grow at 37^oC, the GC-content in C. (de)neoformans is very high (~48%). Almost all transcripts carry introns (~98%), often about 5-6 per gene, which appear essential for stable expression [Goebbels-2013], [Janbon-2018]. These unique characteristics [1] make it difficult to predict genes and splicing patterns, which we have encountered when designing tagging strategies. This might also be the reason for the inaccurate model Dumesic et al. put forward in their seminal paper of 2013 [Dumesic-2013]. They proposed that in C. neoformans H99 a pre-mRNA was either spliced or redirected to the RNAi machinery, which was based on the observation that many siRNAs mapped to introns. The annotation file they had used must have been a premature version, as almost all siRNAs actually align to exon sequences described in GTFs published since, including those [2] for the original genomic sequence of this fungus [Janbon-2014]. For example, one of their signature genes, CNAG_07721, has been re-annotated, with the chromosome 8 locus [3] renamed to CNAG_03231.

Bedgraphs, count data accumulated with HTSEQ and hand-made annotations for siRNA loci, supported development of Coalispr and the creation of images for the ‘How-to guides’. This tutorial will show how these images were made. The data will be from publications [Burke-2019] and [Dumesic-2013] which analyzed various genes for their contribution to siRNA accumulation after their disruption with NAT or HYG ([Burke-2019]).

Disrupted Factor	Reference	Gene::Marker	Chromosome	Strand	Coordinates	Short
Ago1	[Burke-2019]	CNAG_04609::NATR	`10`	plus	`89445-899067`	`a1`
Rdp1	[Burke-2019]	CNAG_03466::NATR	`8`	minus	`1025052-1030297`	`r1`
Rde1	[Burke-2019]	CNAG_01848::NATR	`11`	minus	`1077327-1082277`	`re1`
Rde2	[Burke-2019]	CNAG_04954::NATR	`4`	plus	`68624-73005`	`re2`
Rde3	[Burke-2019]	CNAG_06643::NATR	`7`	plus	`370190-371978`	`re3`
Rde4	[Burke-2019]	CNAG_01157::NATR	`5`	minus	`1052269-1054443`	`re4`
Rde5	[Burke-2019]	CNAG_04791::NATR	``10`	plus	`391068-392582`	`re5`
Rrp6	[Burke-2019]	CNAG_02773::NATR	`3`	minus	`799012-802788`	`r6`
Clr4	[Burke-2019]	CNAG_05404_Cterm∆::HYG	`14`	plus	`218693-225267`	`c4`
Ezh2	[Burke-2019]	CNAG_07553::NATR	`3`	minus	`1186967-1190031`	`e2`

The organization of reference files will be different from that in the yeast and mouse tutorials. Most downloaded and processed references used for the running of Coalispr are placed in a subfolder, source/, of the work directory.

Note

Outcome of below data-preparation can be downloaded from xenodo.org. See/skip to here

First, - create a work directory (say, Burke-2019/)

Then, move some files that have been shipped with the program:

copy the contents of this folder in the source distribution: coalispr/docs/_source/tutorials/H99/shared/ to Burke-2019/.

In a terminal, change directory to the created work environment (from which all scripts and commands will be run):

cd /<path to>/Burke-2019/

Dataset¶

Following the mouse and yeast tutorials we first generate bedgraph files for C. neoformans data that has been collapsed and uncollapsed. We can obtain bedgraphs after aligning the data to the reference genome. To do this, the raw sequencing data is used and downloaded from the Gene Expression Omnibus (GEO) database with the relevant accession numbers retrieved from the literature:

H99-data	Reference	GEO acc. no.	SRA table	SRA study
WT, mutants	[Burke-2019]	`GSE128009`	GSE128009	SRP187879
WT 2010	[Dumesic-2013]	`GSE43363`	GSE43363	SRP017845

As described for the yeast tutorial:

Open the GEO Accession Display page for the GSE128009 experiment of [Burke-2019].
Enter the accession no. into the field GEO accession and press GO.
Access SRA Run Selector from the bottom of the GEO accession display page that has been opened.
On the SRA Run Selector webpage, select all Assay Type lanes for miRNA-Seq and choose Selected in the Select pane.
Then click on the Accession List button in the Selected row of the Select pane.
Save the file to the work directory and, because it will be combined with other data, add a prefix.

Save as 1_SRR_Acc_List.txt.
For detailed info, click the Metadata button for a file describing the experiment.

Save as SraRunTable.txt.
Do the same for the [Dumesic-2013] sample.

Save the Accession List from the Total Select panel as 2_SRR__Acc_List.txt.

Save the Metadata as 2_SraRunTable.txt.

Collate the lists:

cat 1-SRR_Acc_List.txt 2-SRR_Acc_List.txt > SRR_Acc_List.txt

Download, extract fastq and compress the data (These steps create a directory structure (see Mouse miRNAs) in the working folder the scripts rely on).

sh 0_0-SRAaccess.sh

This takes awhile, and so does:
sh 0_1-SRAaccess.sh
sh 0_2-gzip-fastq.sh

Clean and collapse¶

Adapters present on the reads are removed with:

sh 1_0-flexbar-trim.sh

and a fasta file with collapsed reads is prepared with:

sh 2_0-collapse.sh

Reference traces¶

Traces for gene-expression in cells can be included in Coalispr. Comparing siRNA signals to mRNA traces is possible because various RNA-Seq datasets are available. Here, we use the bedgraphs prepared by [Wallace-2020] and their annotation file. The reference files will be saved in subfolders of the folder source in the work directory Burke-2019/:

mkdir source

The mRNA reads for parental strain H99 and deletion strains for GWO1 and AGO1 are at GEO accession no. GSE133125 [Wallace-2020]; other samples - not used here - are at GSE206758 [Freitas-2022].

RNA-Seq	SRA acc. no.	Run	Bedgraph
H99r1	`SRX6102296`	`SRR9336390`	GSM3900523
H99r2	`SRX6102300`	`SRR9336394`	GSM3900527
HdGWO1	`SRX6102298`	`SRR9336392`	GSM3900525
HdAGO1	`SRX6102302`	`SRR9336396`	GSM3900529

Bedgraphs are downloaded and saved to their own folders; you could create these first:

cd source
mkdir GSM3900523 GSM3900527 GSM3900525 GSM3900529
cd ..

Now, for each experiment get the compressed bedgraph files for the PLUS [4] and MINUS strand from the GSM3900523 to GSM3900529 linked pages and save these, ending up with the following folder-hiearchy:

burke-2019
└── source
    ├── GSM3900523
    |   ├── GSM3900523_PN5_index1_ATCACG_L004_R1_001_hisat2_minus.bedgraph.gz
    |   └── GSM3900523_PN5_index1_ATCACG_L004_R1_001_hisat2_plus.bedgraph.gz
    ...
    └── GSM3900529
        ├── GSM3900529_PN10_index18_GTCCGC_L008_R1_001_hisat2_minus.bedgraph.gz
        └── GSM3900529_PN10_index18_GTCCGC_L008_R1_001_hisat2_plus.bedgraph.gz

Then decompress and clean up the files [5]:

for i in source/GSM39*/GSM39*.bedgraph.gz; do gunzip -c $i | sed 's/chr//g' > ${i%.*}

Alignment¶

Although siRNA reads are expected to be short (~19-24nt) the high density of introns in Cryptococcus makes it feasible that these RNAs can be complementary to regions that cross splice junctions. In other words, it is possible that siRNAs target transcript sections that have undergone splicing. An RNA-focused mapper like STAR can insert gaps that coincide with possible introns when aligning reads to a reference genome. This is the reason why this aligner features in these tutorials.

References¶

For accurate assignment of introns by STAR it is important to use all information available. One of these resources will be the annotation files describing exons for genes. A second resource is a database of possible splice-junctions that STAR accumulates while alligning reads in an initial pass. Maybe redundant, but we will use both. When generating a genome index, splice-junctions will be retrieved from a recent annotation file; while those found previously on the fly have been collected in a specific file. This file, SpliceJunctionDataBase-sorted-uniq.txt, informs STAR during alignments made in this tutorial. It will also be shown how this file can be updated.

Genome¶

Download a reference genome for aligning reads to the source/ subfolder:

wget -cN --show-progress -P source "https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-55/fungi/fasta/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99_gca_000149245/dna/Cryptococcus_neoformans_var_grubii_h99_gca_000149245.CNA3.dna_sm.toplevel.fa.gz"

Decompress and simplify seq_id:

gunzip -c source/Cryptococcus_neoformans_var_grubii_h99_gca_000149245.CNA3.dna_sm.toplevel.fa.gz | sed 's/ dna_sm:chromosome [a-z0-9A-Z: ]*$//g' > source/h99.fasta

GTFs¶

A GFF3 file with transcript annotations used by [Wallace-2020] can be obtained from their github repository:

wget -cN --show-progress -P source "https://github.com/ewallace/CryptoTranscriptome2018/raw/master/CnGFF/H99.10p.aATGcorrected.longestmRNA.2019-05-15.RiboCode.WithStart.gff3

The file has the format of a GTF file with gene-id entries, but gene names can be simplified [6]:

sort -k 1n,1n -k 4n,4n -k 5n,5n source/H99.10p.aATGcorrected.longestmRNA.2019-05-15.RiboCode.WithStart.gff3 | sed 's/"CN[A-N]:/"/g' > source/h99.gtf

The file lacks descriptions for the mitochondria which is obtained below.

To get a reference for non-coding RNAs and pseudo transcripts, get the GenBank annotations

wget -cNrnd --show-progress --timestamping -P source "ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/fungi/Cryptococcus_neoformans/latest_assembly_versions/GCA_000149245.3_CNA3/GCA_000149245.3_CNA3_genomic.gbff.gz"

Copy coalispr/coalispr/resources/share/genbank2gtf.py to the work folder [7] and run:

python3 genbank2gtf.py -f source/GCA_000149245.3_CNA3_genomic.gbff.gz -p /<path to source folder>/coalispr/coalispr/resources/share/conversion_tables -c H99 -k 4 -o source/ncRNA-pseu.gtf

Get descriptions for the MT chromosome and add these to the general GTF.

python3 genbank2gtf.py -f source/GCA_000149245.3_CNA3_genomic.gbff.gz -p /<path to source folder>/coalispr/coalispr/resources/share/conversion_tables -c H99 -k 3 -o source/mito.gtf
cat source/mito.gtf | sed '1,2d' >> source/h99.gtf

A reference file for counting with HTSeq had been prepared by copy pasting IGB information to a tsv file, source/h99-siRNAsegments.tsv from which a GTF with annotations was created for counting specific reads. In this GTF a specific feature (GTFFEAT) characterizes siRNA segments. The file forms a positive-control reference (GTFSPEC) in Coalispr displays of bedgraph traces. To make this GTF, copy coalispr/coalispr/resources/share/gtf_frm_tsv.py to the work directory (Burke-2019/) and run

python3 gtf_frm_tsv.py -f source/h99-siRNAsegments.tsv -c 2

The option -c 2 indicates that the input file (-f) relates to H99.

Two other files, h99-xtra.gtf and h99-ncRNAxtra.gtf are shipped with the tutorial that contain additions based on:

possible extraneous DNAs added to the genomes of knockout mutants (i.e. marker genes conferring resistance to nourseothricin (NAT), neomycin/geneticin/G418 (G418) or hygromycin (HYG).
observed differences between data traces and downloaded annotations.

These files will be automatically included when referred to (via GTFXTRA, GTFXTRANAM and GTFUNXTR, GTFUNXTNAM) in the settings file 3.h99.txt and loaded with coalispr showgraphs.

Index¶

With the annotation file for intron definitions, we can now generate with STAR a genome index for mapping reads to the reference genome. This creates a new folder, the name of which depends on the provided argument. Best practice is that the name for this folder contains EXP, which is set during coalispr init; here we take h99:

sh 3_0-createSTARindex.sh h99

Note that this file import the fasta file with the sequence for the extra annotations as set in h99-xtra.gtf.

The file with chromosome lengths (LENGTHSNAM), created by STAR, can be linked to; place the symbolic shortcut in source/, with other resources for Coalispr:

cd source; ln -sf ../star-h99/chrNameLength.txt h99-chr-lengths.txt; cd ..

Mapping¶

Map collapsed reads first, this should be fairly quick. Add argument EXP to find the folder with the correct genome information. A second argument would define the number of allowed mismatches; for stringent end-to-end mapping use 0. A third argument can be added to indicate that splice junctions have to be re-recorded from gapped alignments. Set this to 1 if needed; by default this is skipped because such a record comes with the tutorial (also see H99-shared).

sh 3_1-run-star-collapsed-14mer-strict.sh h99 0

If splice junction data have been collected [8], prepare the new database with:

sh 3_1_1-run-make-SJDB.sh h99

If this all finishes ok, map the uncollapsed data; by default the available database of spliced junctions will be used.

sh 4_1-run-star-uncollapsed-14mer-strict.sh h99 [9]

Then, bedgraphs can be made. The scripts take two parameters, for EXP and the number (0 by default) of mismatches.

sh 3_2-run-make-bedgraphs.sh h99
sh 4_2-run-make-bedgraphs-uncollapsed.sh h99

For quality control information, the following script (with the same two parameters) is available:

sh 5_1-QC-reports.sh h99

As a check for the mapping we inspect in a genome browser the accumulation of cDNAs, i.e. the collapsed data, at the CNAG_03231 locus on chromosome 8. The number of sense cDNAs is in a comparable range as the antisense cDNAs, but far more copies of antisense siRNAs are expressed as shown in Figure 1 where the uncollapsed sense signals are not coming above the base line.

cDNAs at CNAG_03231 — **Figure 2.** C. *neoformans* siRNA cDNAs target exons, not introns¶

Bedgraph traces for a chromosome 8 section in collapsed data from [Dumesic-2013] and [Burke-2019] aligned with STAR as done in this tutorial and displayed in the IGB genome browser. RNA-Seq by [Wallace-2020] for wild type (H99) and strains with an *AGO1* deletion (Δago1) form reference traces. Bars represent annotations in GTFs from [Janbon-2014] and [Wallace-2020]. Top: two snapshots with different zoom-levels. Gray frames delineate the region with CNAG_03231 and CNAG_03232. Bottom: CNAG_03231 region targeted by siRNAs.

Note

Datasets prepared above were uploaded to xenodo.org (DOI 10.5281/zenodo.12822543). These can also be used for testing Coalispr.

cd <path to workfolder>
wget -cNrnd --show-progress --timestamping "https://zenodo.org/records/12822544/files/Burke-2019.tar.gz?download=1"
tar -xvzf Burke-2019.tar.gz

Coalispr¶

The input materials, the bedgraph files, are available. The analysis begins with setting up the program for this experiment (EXP), which will be named to h99.

cd /<path to>/Burke-2019

coalispr init

bash-5.2$ coalispr init

Please provide a name for this new session/experiment.

This name will be used in commands and in file paths.
(therefore best to keep it short and memorable, without
spaces or special characters).

Experiment name:
('stop' to cancel)
h99


Experiment name will be set to: 'h99'
Is that OK?

1: Yes
2: Cancel

?:
1

Experiment name has been set to 'h99'.

Now, set the work folder for this session/experiment.

Files are stored in folder 'Coalispr' either in the user's
home directory, or in the current folder (before running the
 `init` command, change directory to this destination first).
Alternatively keep files near the source code, in the
coalispr installation folder. (The directory with the
installation folder will be shown as option in this case.)

Configuration files, figures and data go in sub-folders:
'constant_in' for configuration files, 'figures', and
'data' for storage of .pkl and .tsv count files.

Please choose a folder.
('stop' to cancel)

    1: home (rob)
    2: current (Burke-2019)
    3: source (COALISPR)

Enter an option: 2


Path to Coalispr folder will be set to: '/<path to>/Burke-2019/Coalispr'
Is that OK?

1: Yes
2: Cancel

?:
1

Configuration files to edit are in:
'/<path to>/Burke-2019/Coalispr/config/constant_in'
The path '/<path to>/Burke-2019/Coalispr' will be set as 'SAVEIN' in 3_h99.txt.

Configuration¶

Explained in the How-to guides is the preparation of configuration files with settings, especially the 3_EXP.txt, and the file (EXPFILE) describing the data for the experiment.

Note

When using the alignments downloaded from xenodo.org, the EXPFILE, /<path to>/Burke-2019/source/h99-experiment_table.tsv, is ready and only the shipped configuration file /<path to>/Burke-2019/Coalispr/constant_in/3_h99_zenodo.txt needs to be adapted to your local setup:

From 3_h99.txt created by coalispr init, copy relevant [12] values for:
- SETBASE
- SAVEIN (BASEDIR / PRGNAM).
and replace those in 3_h99_zenodo.txt
Remove/rename 3_h99.txt
Rename/link 3_h99_zenodo.txt to 3_h99.txt

Experiment file¶

The file describing the experiment has been assembled from the SraRunTable.txt for the two GEO datasets as follows:

Short   Run       Category  Method   Group  Fraction  Condition  Read-density  Experiment  Variation  GEO_Accession (exp)  DOI
wt_0    SRR646636      S    total             WCE                   NaN        SRX215659    WT 2010      GSM1061025        10.1016/j.cell.2013.01.046
a1_1    SRR8697580     U    total    a1       WCE                   NaN        SRX5493436    ago1        GSM3660325        10.25387/g3.7829429
a1_2    SRR8697581     U    total    a1       WCE                   NaN        SRX5493437    ago1        GSM3660326        10.25387/g3.7829429
c4_1    SRR8697582     M    total    meth     WCE                   NaN        SRX5493438    clr4        GSM3660327        10.25387/g3.7829429
c4_2    SRR8697583     M    total    meth     WCE                   NaN        SRX5493439    clr4        GSM3660328        10.25387/g3.7829429
e2_1    SRR8697584     M    total    meth     WCE                   NaN        SRX5493440    ezh2        GSM3660329        10.25387/g3.7829429
e2_2    SRR8697585     M    total    meth     WCE                   NaN        SRX5493441    ezh2        GSM3660330        10.25387/g3.7829429
re1_1   SRR8697586     M    total    re1      WCE                   NaN        SRX5493442    rde1        GSM3660331        10.25387/g3.7829429
re1_2   SRR8697587     M    total    re1      WCE                   NaN        SRX5493443    rde1        GSM3660332        10.25387/g3.7829429
re2_1   SRR8697588     M    total    re2      WCE                   NaN        SRX5493444    rde2        GSM3660333        10.25387/g3.7829429
re2_2   SRR8697589     M    total    re2      WCE                   NaN        SRX5493445    rde2        GSM3660334        10.25387/g3.7829429
re3_1   SRR8697590     M    total    re3      WCE                   NaN        SRX5493446    rde3        GSM3660335        10.25387/g3.7829429
re3_2   SRR8697591     M    total    re3      WCE                   NaN        SRX5493447    rde3        GSM3660336        10.25387/g3.7829429
re4_1   SRR8697592     M    total    re4      WCE                   NaN        SRX5493448    rde4        GSM3660337        10.25387/g3.7829429
re4_2   SRR8697593     M    total    re4      WCE                   NaN        SRX5493449    rde4        GSM3660338        10.25387/g3.7829429
re5_1   SRR8697594     M    total    re5      WCE                   NaN        SRX5493450    rde5        GSM3660339        10.25387/g3.7829429
re5_2   SRR8697595     M    total    re5      WCE                   NaN        SRX5493451    rde5        GSM3660340        10.25387/g3.7829429
r1_1    SRR8697596     U    total    r1       WCE                   NaN        SRX5493452    rdp1        GSM3660341        10.25387/g3.7829429
r1_2    SRR8697597     U    total    r1       WCE                   NaN        SRX5493453    rdp1        GSM3660342        10.25387/g3.7829429
r6_1    SRR8697598     M    total    r6       WCE                   NaN        SRX5493454    rrp6        GSM3660343        10.25387/g3.7829429
r6_2    SRR8697599     M    total    r6       WCE                   NaN        SRX5493455    rrp6        GSM3660344        10.25387/g3.7829429
wt_1    SRR8697600     S    total             WCE                   NaN        SRX5493456    Kn99alpha   GSM3660345        10.25387/g3.7829429
wt_2    SRR8697601     S    total             WCE                   NaN        SRX5493457    Kn99alpha   GSM3660346        10.25387/g3.7829429
h99_1   GSM3900523     R    rnaseq                                  1          SRX6102296    RNA_H99r1   GSM3900523        10.1093/nar/gkaa060
h99_2   GSM3900527     R    rnaseq                                  4          SRX6102300    RNA_H99r2   GSM3900527        10.1093/nar/gkaa060
h99-a1  GSM3900529     R    rnaseq                                  5          SRX6102302    RNA_HdAGO1  GSM3900529        10.1093/nar/gkaa060
h99-g1  GSM3900525     R    rnaseq                                  2          SRX6102298    RNA_HdGWO1  GSM3900525        10.1093/nar/gkaa060

For columns METHOD, FRACTION or GROUP only values relating to particular methods, fractions or conditions are filled in. When one value applies to all samples it can be omitted; but then the group will not be shown in coalispr showgraphs. Thus, in the example above, METHOD and FRACTION will be visible but CONDITION will not be indicated. The numbers in the Read-density (READDENS) column for the mRNA references lead to an adaptation of the traces so that their overall display is at a comparable level. This can be done because the expression of most transcripts do not seem to be affected by the deletions. In other words, it is an application of the principle guiding normalization when differential expression of genes is assessed (see About) [10]. Adjusting the relative Read-density makes it easier to observe putative effects of the absence of Ago1 or Gwo1 on expression of transcripts targeted by siRNAs.

Settings file¶

For the configuration, in the file 3_h99.txt (copied from 3_EXP.txt in the init step) the following fields [11] were set to the given values:

EXP : “h99” [12]
CONFNAM : “3_h99.txt” [12]
EXPNAM : “C. neoformans”
UNSPECLOG10 : 0.78
MAXGAP : 400
XLIM0 : 14
XLIM1 : 36
XLIM00 : 18
SETBASE : “/<path to>/Burke-2019/”
REFS : SOURCE
EXPFILNAM : “h99-experiment_table.tsv”
RIP1 : “”
RIP2 : “”
NOTAGIP : “”
MUTGROUPS{

“a1”:”ago1\u0394”, “re1”:”rde1\u0394”, “re2”:”rde2\u0394” [13]

“re3”:”rde3\u0394”, “re4”:”rde4\u0394”, “re5”:”rde5\u0394”,

“r1”:”rdp1\u0394”, “r6”:”rrp6\u0394”, OTHR:”methyl\u0394”,

}
FRACTIONS{

WCE:”Cell extract”,

}
UNSPECIFICS[

“a1”, “r1”,

]
MUTANTS[

“re1”, “re2”, “re3”, “re4”, “re5”, “r6”, OTHR,

]
LENGTHSNAM : “h99-chr-lengths.txt”
DNAXTRNAM : “NAT_G418_HYG.fa”
LENXTRA : “7062”
CHRXTRA : XTRA
GTFSPECNAM : “h99-siRNAsegments.gtf”
GTFUNSPNAM : “ncRNA-pseu.gtf”
GTFREFNAM : “h99.gtf”
GTFXTRANAM : “h99-xtra.gtf”
GTFUNXTNAM : “h99-ncRNAxtra.gtf”
# SAVEIN : Path(“/<path to>/Burke-2019/Coalispr”) [12]

SAVEIN : BASEDIR / PRGNAM [14]
LENGTHSFILE : BASEDIR / SOURCE / LENGTHSNAM
DNAXTRA : BASEDIR / SOURCE / DNAXTRNAM
EXPFILE : BASEDIR / SOURCE / EXPFILNAM
GTFSPEC : BASEDIR / SOURCE / GTFSPECNAM
GTFUNSP : BASEDIR / SOURCE / GTFUNSPNAM
GTFREF : BASEDIR / SOURCE / GTFREFNAM
GTFXTRA : BASEDIR / SOURCE / GTFXTRANAM
GTFUNXTR : BASEDIR / SOURCE / GTFUNXTNAM

Analysis¶

Bedgraph data are loaded and compared in Coalispr by means of Pandas dataframes. For repeated use, the data are stored in a binary (pickle) format specific for the version of Python used to run Coalispr. The coalispr storedata steps below accomplish this.

Note

When using the archive with alignments downloaded from xenodo.org, this step can be skipped and binary, processed data that is used in Coalispr can be obtained with coalispr storedata -d4 after which subsequent commands (like coalispr showgraphs or coalispr countbams) can be tested. After this, or when generating your own data files through coalispr storedata -d{1,2} -t{1,2} remove or rename the backup_from_pickled link (to prevent that the contents shipped with the archive are overwritten by subsequent coalispr storedata -d3 commands).

Store uncollapsed, then collapsed data (-t2), while processing both data (-d1) and reference (-d2) bedgraph files.

coalispr storedata -d1
coalispr storedata -d2
coalispr storedata -d1 -t2
coalispr storedata -d2 -t2

Example output of last run, skipping files already processed.

bash-5.2$ coalispr storedata -d2 -t2

Processing references ...

Binned bedgraph-frames available for 'h99-a1'; will be skipped
Binned bedgraph-frames available for 'h99_1'; will be skipped
Binned bedgraph-frames available for 'h99_2'; will be skipped
Binned bedgraph-frames available for 'h99-g1'; will be skipped
Nothing left to bin, skip this.
Pickle-file for 'reference_merged' saved
Merged references to filter: ['h99_1', 'h99_2', 'h99-a1', 'h99-g1']
Using ['a1_2', 'r1_2', 'a1_1', 'r1_1'] for removing aspecific signals from references
Pickle-file for 'reference_specific_0-78_collapsed' saved
Pickle-file for 'reference_unspecific_0-78_collapsed' saved

The outputs written to the terminal also end up in the log file in /Burke-2019/Coalispr/logs/h99/. Check the logs to get an idea about the involved steps, whereby contiguous indices for regions with signals are combined to segments. Further, files are saved for use by other parts of the program based on the threshold UNSPECLOG10 (of 0.78; representing a minimal ~6 fold difference in the case of overlapping signals for SPECIFIC/MUTANT samples vs. UNSPECIFIC samples in bins marked by the indices).

Centromeres — **Figure 3.** Visualization of centromeres.¶

Normally, centromeres are not transcribed, as indicated by the gap in mRNA transcripts (black). Inactivation of histone H3 methylation by deletions in genes for Clr4 (c4) or Ezh2 (e2) leads to loss of silencing of centromeric transcripts to which siRNAs are raised [Burke-2019]. With: `coalispr showgraphs -c 8 -w2`.

Graphs¶

Check traces:

coalispr showgraphs -c 8 -w2

Proteins involved in methylation of DNA and of histone H3, leading to heterochromatin formation and thereby silencing of genes in these chromosome regions, are important to reduce the production of siRNAs complementary to transcripts expressed from centromeres [Burke-2019]. Centromeres in Cryptococcus are long and consist of repetitive DNA derived from transposable elements [1]. The length of centromeres, together with the absence of RNAi against transposons in an organism, appears related to virulent behaviour of fungi [Yadav-2018]. The results from [Burke-2019] are helping to light up centromeric regions in bedgraph displays by Coalispr.

Reads at CNAG_03231 and CNAG_03387 — **Figure 4.** CNAG_03231 and CNAG_03387¶

Segments representing an siRNA locus have been annotated as a ncRNA. Not all mutants have an equal effect on siRNA formation. One group appears as wild type (middle) and another as the negative controls (bottom). Traces for Specific reads are shown (with: `coalispr showgraphs -c8 -w4`).

To revisit CNAG_03231, and see the impact of mutants on siRNA formation (see Fig. 4, run:

coalispr showgraphs -c8 -w4

And zoom in to regions 399000-403600 for CNAG_03387 or 814700-821200 for CNAG_03387. Alternatively, using the region option -r, for CNAG_03387 run:

coalispr showgraphs -c8 -w4 -r 399000-403600

For CNAG_03387:

coalispr showgraphs -c8 -w4 -r 814700-821200

This will show traces after filtering out the unspecific reads [15]. The C-terminal encoding regions of the transcripts are targeted by siRNAs. That introns have been removed by splicing is better visible in a genome browser; the 50-fold reduced resolution in the Coalispr display can only hint at this. Note that in absence of RNAi (in the h99-a1 mutant lacking Ago1) relatively higher mRNA expression levels are detectable for these loci compared to wild type (h99) and surrounding loci.

In the other tutorials, to support the article , rDNA loci have been analyzed.

Reads at the rDNA locus — **Figure 5.** C. *neoformans* rDNA locus [16]¶

Reads representing rRFs are abundant in all samples to a comparable extent. In some clones, but not all, reads complementary to rRNA are formed. For the rde1Δ (top mid) and rde2Δ (top right) strains these reads counter regions of large subunit rRNAs 25S and 5.8S, while in wild type (top left) or rrp6Δ (bottom mid) the small subunit 18S rRNA region and flanking transcribed spacers are targeted. No effect is visible for strains lacking Rde3, Rde4 or Rde5 (bottom left) or in strains without DNA or histone methylation activity (bottom right). Although a contamination with rDNA might have caused this, the signals are typical for the strains and could have another source of origin. Note the rRNA peaks of the reference RNA-Seq libraries that run off the scale (black). Traces for all reads are shown (with: `coalispr showgraphs -c2 -w2` ).

In Cryptococcus the rDNA is on chromosome 2, on the minus strand. All reads are shown with:

coalispr showgraphs -c2 -w2
Using the zoom tool, exposes the rDNA region from 270000 to 284000

This command produces a display of traces that shows that rRFs are common to all samples. In some strains, including the wild type (from [Dumesic-2013]), reads are mapped that are antisense to the rRNA. Can these be the result of an RNAi response? Or are they generated by another, possibly artefactual mechanism?

Add count data¶

To obtain a more detailed insight in the kind of reads in the various libraries, we start with getting count data.

Counts for the input:

coalispr countbams -rc 1

(for uncollapsed data)
coalispr countbams -rc 2

(for collapsed data)

Counts for aligned reads:

coalispr countbams

(for specific data)
coalispr countbams -k2 -u 1

(for unspecific data (-k2) including unselected reads (-u1))

With this information, we can inspect cDNA traces (of collapsed reads) with siRNA-like lengths and start-nucleotide [17] that are present in the negative data:

coalispr showgraphs -c2 -t2 -u1

This shows that such common small RNAs can derive from highly expressed transcripts (top left panel in the figure, including rRNA. Fitting the finding that siRNA populations consist of sense and antisense reads, such small RNAs are found to map to the rDNA, although mostly in a similar manner as found for the negative control lacking Rrp1 (r1). A fraction of reads that does not overlap with negative control reads is found in the wild type and a strain with mutant RRP6. Does this mean that rRFs are actually part of the siRNA population associated with Ago1 in a biologically relevant manner? Or is it that rRNA, rather than supplier, becomes a target of RNAi? Because of the mix-up with rRNA as common contaminant and the possibility that unselected siRNAs obtained their features by chance, these questions cannot be answered on the basis of these observations. For example, there is no idea about the length-distributions of the mapped reads; do the siRNA-like reads stand out from the remainder? This question is answered below.

Mapped cDNAs at rDNA locus — **Figure 6.** C. *neoformans* rDNA locus [16], mapped cDNAs.¶

cDNAs of unselected reads adhere to siRNA characteristics but fall in the class of unspecific sequences due to an overlap with common signals found in negative control samples. Many of these reads derive from highly expressed transcripts (orange traces on top of black mRNA signals, top left). This is also the case for the DNA regions with genes for rRNA (top middle), although a diverse population of cDNAs is found to be anti-sense to 25S and 5.8S or 18S rRNAs, which is also observed for the negative control lacking a functional *RRP1* gene (top right). Anti-sense cDNAs against the 18S rRNA region and flanking transcribed spacers are specific for the wild-type strain or seen in the absence of Rrp6 (bottom) (with: `coalispr showgraphs -c2 -t2 -u1` ).

Count diagrams¶

Coalispr is a tool to check libraries in a data set, and obtain counts for specified reads as done above. How different are the libraries with respect to coverage and number of mapped reads? For this, we can look at the uncollapsed and collapsed reads, respectively, using alignment information:

coalispr showcounts -rc 1
coalispr showcounts -rc 2

Mapped vs. unmapped reads — **Figure 7.** C. *neoformans* libraries, mapped (stranded) vs. unmapped reads.¶

Uncollapsed read data on the left, collapsed data (cDNAs) on the right.

According to these diagrams, the libraries are fairly comparable although the original data (wt_0) [Dumesic-2013] contain a high number of unmapped reads, possibly due to adapter remnants that interfere with the stringent EndToEnd mapping [9]. In general the background is fairly high. This observation is based on the overlap between the positive and negative control samples, with respect to a) the large fraction of common minus-strand reads (the rDNA gene is on that strand on chromosome 2) in the positive and mutant samples, while b) negative control samples contain reads that are specified as specific [18]. The characteristics for the libraries of mutant rde1Δ (re1) are very similar to that of the negative controls, suggesting that RNAi has been inactivated in the absence of Rde1. The strains lacking Rrp6 (r6) yielded a relatively low number of reads. According to Burke et al [Burke-2019] for all samples 20 μg RNA was used for library preparation of siRNAs. Assuming that experimental parameters for library preparation were comparable, are the low siRNA yields for rrp6Δ strains indicative for a poor state of these cells (slow growth, reduced RNA synthesis or minimal RNAi activity)?

Checking the overall counts for specific and unspecific reads gives other information:

coalispr showcounts -lc {1-7}

Library counts — **Figure 8.** C. *neoformans* library counts, specific vs. unspecific reads.¶

Top: Uncollapsed read data (`-lc 1`; left) and collapsed data, cDNAs (`-lc 2`; right).

Middle: Introns in uncollapsed read data (`-lc 4`; left), and in cDNAs (`-lc 5`; right).

Bottom: Unspecified reads in unspecific data vs specific reads (`-lc 6`; left); reads skipped during counting (`-lc 7`; right).

The results obtained with option -lc 1 indicate that RNAi activity seems almost completely abolished not only in mutant rde1Δ, but also in rde4Δ, rde5Δ and possibly rde3Δ (re1, re4, re5, and re3, resp.). This can be inferred from the finding that (nearly) all mapped reads are part of a common background (top left). In mutant rde2Δ (re2) RNAi appears to be affected as well, although partially, in view of the enhanced number of unspecific reads at the expense of those mapping to specific siRNA targets. Note that an opposite observation, more specific and less unspecific reads, can be made for mutant clr4Δ (c4)

The fact that Cryptococcus is intron-rich is exemplified by the many specific reads with gaps that could be introns (about one in 25 cDNAs; middle).

One to three percent of mapped reads were skipped during the counting because their end-to-end alignment was not solely based on matches with, possibly, one intron-sized gap (i.e. larger than MININTRON; cf. bottom right to top-left).

Length distributions of the mapped reads are visualized with:

coalispr showcounts -lo {1-5}

for an overview for all libraries, and, for individual samples:
coalispr showcounts -ld {1-6}

with option -k2 lengths for unspecific reads are shown and with option -ma lengths for uniq reads or multimappers (see below).

The distribution of read lengths to a narrow range of 20-24 nt with the majority starting with a T (U in RNA) fits expectations for siRNAs bound to Argonaute proteins in fungi [Dumesic-2013]. Furthermore, the separation of specific reads from unspecific reads seems to be adequate in view of the hugely different length distributions for these categories of reads. That putative siRNAs are still present among the unspecific reads in the rde1Δ, rde2Δ or rrp6Δ strains (re1, re2, or r6, bottom middle) appears to be supported by length-distributions for these unselected reads (bottom right) and observations for the rDNA region described below (see Region-analysis).

Multimappers¶

Transposons that elicit an RNAi response are often present in multiple copies, although some loci might be more active than others, possibly depending on an accumulation of mutations. Therefore, it is not obvious how to account for such repeats during mapping. In the alignment scripts, STAR was set up to map uncollapsed multimappers randomly to the possible loci, while collapsed reads are aligned with each possible position (to keep these hits visible in the bedgraphs) but counted as 1/NH (with NH the number of hits found in the bam alignment line).

The above figure indicates that most of the reads and cDNAs align to a single site (’unique reads’, top) in the reference genome rather than to multiple loci (‘multimappers’, bottom). Centromeres are normally silenced by packaging of nucleosomes into heterochromatin, which coincides with methylation of H3K9 (lysine 9 in histone H3) by Clr4. When this methyltransferase is absent (Mutants c4), the number of multimappers increases (relative to total and unique populations) compared to wild type (Positive wt). Centromeres in Cryptococcus consist mainly of repetitive sequences derived from retrotransposons Tcn1–Tcn5 (that belong to the Ty3-gypsy family of retroelements) and Tcn6 (like Ty1-copia) [Yadav-2018]. Therefore, the increase of multimappers in the absence of Clr4 fits a scenario whereby centromeric transcripts, the expression of which is no longer suppressed by methylation, are silenced by RNAi instead. Notably, genes on euchromatic regions of the chromosomes to which large numbers of siRNAs align to often share homology with domains found in transposable elements but are not clearly related to Tcn1-6.

Lengths for unique reads and multimappers — **Figure 11.** C. *neoformans* length distributions of unique siRNAs and multimappers.¶

Unique reads (`-ma 1`, top) and multimappers (`-ma 2`, bottom);

Shown are unspecific (left), or specific (middle) reads, and cDNAs (right).

In the absence of Rde1 only a few siRNAs are found (see Figure 8), defined by cDNAs with a wild-type length distribution (coalispr showcounts -ld 2 -ma {1,2}). Still, most of the multimapping siRNAs (coalispr showcounts -ld 1 -ma 2) in this mutant appear to be 1 nt shorter, peaking with a length of 21 nt, instead of 22 nt (Fig. 11, middle panel, bottom row). Section ‘Multimapping 21-mers in rde1Δ’ tries to find loci responsible for this. One candidate turns out to be the rDNA locus, where such siRNAs map against the pre-rRNA transcript.

Intron-skipping siRNAs¶

With options -lo {4,5} or -ld {4,5} intron lengths are displayed in uncollapsed reads or collapsed cDNAs, respectively [19]. For a large range of lengths no hits are present. The resolution of the distribution can be enhanced by skipping this empty section. For this:

Edit the field SKIPINT in 3_h99.txt, set it to = 190, 350 and save the file.
Activate the new configuration by rerunning the setexp -e h99 -p2 command.
Redraw the figures.

Of the intron-skipping reads, both unique and multimappers bridging an intron of ~62 nt are enriched in the absence of Rde1 (lanes re1 middle section); the number of cDNAs is also relatively increased for these reads, predominantly for those that align to one locus (bottom section, left and middle vs. right panels. In cells without Rde2, multimappers (reads as well as cDNAs) over a spliced intron of ~73 nt stand out (lanes re2 middle and bottom, right). Based on this information, the section ‘Intron-skipping siRNAs’ describes how putative loci can be identified for these reads.

The unique locus from which siRNAs with a 62 nt gap are accumulating in the absence of Rde1 turns out to harbour a unique pseudo-gene, CNAG_07650, on chromosome 6. Centromeric loci account for the multimapper in rde2Δ cells.

Region-analysis¶

Sometimes, read counts and length-distributions that are valid for just a small section of the overall transcriptome can be highly infomative. Coalispr facilitates this with the coalispr region call. For example, let’s analyze in more detail the rDNA unit in C. neoformans.

rDNA¶

Obtain images and coordinates for the complete rDNA in H99 from a graph displaying all reads:
- Run coalispr showgraphs -c2 -w2.
- Use the zoom-tool to select the region covered by the large peak near position 300.000 [16].
- Activate mutant traces with comparable traces (rde1Δ, rde2Δ, rde3Δ, rde4Δ, rde5Δ, methylΔ, or rrp6Δ) and save each image as png.
- Clicking the x-axis will generate in the terminal the genomic coordinates for the region displayed; e.g. 2 (271985, 279663).
- Highlight the coordinates (271985, 279663) and copy these with Shift+Ctrl+C to the clip-board.
Get counts for the rDNA and count-displays by:
- In another terminal type ``coalispr region -c 2 -r ``
- After option -r ``, paste the coordinates copied above with ``Shift+Ctrl+V.
- To use all available samples, include option -s 2.
- Set up strand-specific analysis by including option -st.
- Run coalispr region -c2 -r 271985, 279663 -s2 -st 2 for PLUS reads and associated diagrams.
The terminal will display which samples are counted; the number of skipped reads, a dataframe showing total numbers and feedback for the figures that are made and automatically saved to the indicated folders [21]. The generated count data are saved as separate files in /<path_to/Burke-2019/Coalispr/data/h99/tsvfiles/region_readcounts_collapsed-bam/region_2_271985-279663 [22].
- Run coalispr region -c2 -r 271985, 279663 -s2 -st 3 for MINUS reads and associated diagrams.

The obtained diagrams have been combined into Figure 13:

Analysis rDNA locus — **Figure 13.** Region analysis of C. *neoformans* rDNA on chromosome 2.¶

Graphs, with traces for various mutants highlighted (top) and count diagrams for plus (middle) and minus strand (bottom);

Various bits of information can be distilled from this figure. When focusing on reads mapping to the rDNA sense strand (minus strand):

RNA fragments formed from ribosomal RNA (the rRFs) are very abundant in negative controls and rde mutants affecting RNAi (re1 to re5) and more so than in strains producing siRNAs (Positive) that could participate in reverse transcription reactions that create the input for the sequencing (see bottom row, left panel).
rRFs form the majority of overall unspecific library counts (cf. Fig. 8, top left panel)
The rRFs do not meet the criteria for Ago-associated small RNAs that have a length of 19-23 nt and start with a U (T in cDNA) (see bottom row, middle panel for reads and right panel for cDNAs [23]).
That, despite the high number of counts, rRFS represent uninformative background is illustrated by the coverage pattern of these RNAs, which is the same for all samples (see top panel).

Interestingly, RNAs mapping to the antisense rDNA strand, suggest that siRNAs have been generated against rRNA:

A large number of antisense, plus strand reads were counted for strains rde1Δ (re1), rde2Δ (re2) and a positive control (wt_0) (Middle row, left panel). See also Fig. 16 (top) below.
In all samples, these antisense reads conform to siRNA characteristics.
Different from wild type siRNAs, those in mutant strains appear in general to be 1 nt shorter.

It cannot be ruled out that for samples with low counts these putative siRNAs are an experimental artefact or a kind of background. Still, the high levels of these reads for rde1Δ and rde2Δ strains suggest that in these mutants the RNAi activity could have been redirected to rRNA from transcripts that, somehow, no longer form a target. Thus, a step that would have lead to the generation of specific siRNAs against transcripts from pseudo-genes or transposable-elements is no longer happening because of these mutations.

21-mers¶

Are the siRNAs only shorter for the rDNA locus in strains rde1Δ or rde2Δ? We can check this for CNAG_07650, one of the few loci with siRNAs in the absence of Rde1 (see Figure 12). The coverage graphs in the section ‘Intron-skipping siRNAs’ show that the major target is the transcript from this pseudo-gene on the plus strand. After zooming the display to only show the region with siRNA traces, obtain the coordinates (say 1371800-1372650) by clicking the x-axis and copy-paste the numbers to run:

Run coalispr region -c6 -r 1371800-1372650 -s2 -st {2,3}

The combined output indicates that at this locus siRNAs are not shorther in mutant cells, compared to wild type:

Analysis CNAG_07650 locus — **Figure 14.** Region analysis of C. *neoformans* CNAG_07650 on chromosome 6.¶

Count diagrams for plus (top) and minus strand (bottom).

CNAG_07650 sequences are not repeated in the H99 genome and this outcome fits the length-distribution for unique loci (Fig. 11, middle panel, top row).

For rde1Δ cells a multimapper was responsible for relative accumulation of siRNAs of 21 nt (see Fig. 11 middle panel, bottom row). This quick analysis suggest that shortening of siRNAs could be linked to the rDNA locus.

Note that for strains with inactivated RNAi (ago1Δ or rdp1Δ), RNAs are still mapping to the CNAG_07650 locus with siRNA length-distributions. For other RNAi loci such background is observed as well. Whether incomplete RNAi-inactivation or library contamination could be the reason for this, is unclear.

auto-RNAi¶

Above analysis for the rDNA showed that in some strains siRNAs are detected that are antisense to rRFs (Fig. 13). Are other common ncRNAs prone to be targeted by RNAi? It looks that this can happen in the case of SRP RNA as well, another prominent source of unspecific reads. SRP RNA gets expressed from the minus strand on chromosome 1 and is part of a cytoplasmic RNP that associates with ribosomes to regulate translation of proteins that localize to the ER.

Run coalispr showgraphs -c1 -r 801150-801450 -w2 for the coverage display.
Run coalispr region -c1 -r 801150-801450 -s2 -st {2,3} for the count diagrams.

Analysis SRP RNA locus — **Figure 15.** Region analysis of C. *neoformans* SRP RNA locus.¶

Coverage display (on the left) and count diagrams for plus (top) and minus strand (bottom).

Note that, for this unspecific locus in RNAi capable strains, cDNAs are found representing CORB sense siRNAs and the more abundant MUNR antisense siRNAs. In strains where RNAi has been affected, the cDNAs for CORB cDNAs do not adhere to a siRNA length-distribution. At loci with specific reads, such reads were observed in low amounts for strains where RNAi had been inactivated (for example CNAG_07650). Such ‘background’(?) signals may have been too low at unspecific loci to stand out as siRNAs in these strains.

If these siRNAs against ncRNAs are really formed, these siRNAs show activity against molecules representing ‘self’, like a sort of auto-immune response but then for RNAi, ‘auto-RNAi’. This also occurs in C. deneoformans, against similar ncRNAs (rRNA, SRP RNA). Such events might indicate that processes involving these ncRNAs triggered RNAi. A common factor is translation. [26]

Annotate count files¶

The available reference files with genome annotations can be scanned for overlap with segments to which siRNAs map. For this run the coalispr annotate command. Include the mRNA references with -rf 1 an, to easier assess large count numbers, take their log2 values:

coalispr annotate -lg 1 -rf 1

And for unspecific reads (-k 2):

coalispr annotate -lg 1 -rf 1 -k2

To get counts for unspecific reads opposite rRNA sequences, include the CORB strand option (-st 3):

coalispr annotate -lg 1 -rf 1 -k2 -st 3

**Figure 16.** Annotated counts (log2 values).¶

Top: Unspecific reads, hits opposite major strand like on the plus strand opposite the rDNA on the minus strand (highlighted)

Mid: All specific reads, siRNAs against targets on the opposite strand are annotated with ‘@’.

Bottom: All unspecific reads.

(only top-section of each table is shown; duplicates are omitted to reduce figure width.)

As found for mouse, in these small RNA-Seq libraries for C. neoformans the counts for rRFs are among the highest read numbers. Most unspecific reads relate to rRNA, tRNA and mitochondrial transcripts; most specific reads are opposite uncharacterized genes.

Conclusion¶

In the analysis of siRNAs and unspecific RNAs for various libraries prepared by [Dumesic-2013] and [Burke-2019] many observations were linked to rde1Δ and rde2Δ. RDE1 (CNAG_01848) codes for a G-patch protein similar to Pfa1/Sqs1 in Saccharomyces cerevisiae, one of the cofactors for Prp43. This multi-functional RNA helicase relies on G-patch proteins for functional specificity [Bohnsack-2022]. In yeast, the Prp43/Pfa1 complex is linked to the last step in the maturation of 18S rRNA, cytoplasmic cleavage of 20S pre-rRNA [Lebaron-2009, Pertschy-2009]. In association with different G-patch proteins, Prp43 is further involved in dissociation of spliceosomes and the removal of snoRNPs responsible for 2’-O methylation of 18S and 25S rRNA [Bohnsack-2009].

Other RNA-processing factors were identified by Burke et al., namely the RNAse III Rnt1 (CNAG_06643, or RDE3), which, in yeast, is a nuclear protein (for a review see [Elela-2018]) and involved in co-transcriptional pre-rRNA processing, analogous to bacterial RNAse III [Abou Elela-1996], and maturation of snoRNAs such as U14 [Chanfreau-1998a, Chanfreau-1998b]. Overall, 13% of the yeast genes are differentially overexpressed in rnt1Δ cells. Yeast Rnt1 carries a characteristic C-terminal RNA-binding domain (RBM0) not present in homologous proteins in fungi that harbour an RNAi machinery. Cryptococcus Rnt1/Rde3 lacks this domain [25] and therefore might not be constrained to double strand RNA substrates formed by NGNN stemloops recognized by RBM0 as is the case for yeast Rnt1 [Elela-2018]. Rde5 (CNAG_04791), with no known orthologues, appeared to be a Rnt1-interacting protein [Burke-2019]. Rde2 (CNAG_04954) is placed among the orthologues of human EDC4, an enhancer of mRNA decapping enzymes, while Rde4 (CNAG_01157) contains domains that suggest it is an orthologue of the TENT5 family of terminal nucleotidyltransferases involved in polyadenylation during 3’ end processing of RNA and linked to the regulation of the innate immune response in animals [Warkocki-2018, Liudkovska-2022].

Model RNAi target learning — **Figure 17.** RNAi onset occurs after splicing.¶

Because siRNAs are raised against spliced transcripts, the

RNAi response must be triggered by events downstream.

Blocked translation could provide a template to create

siRNAs, namely a piece of mRNA protected by piled-up

ribosomes. With these siRNAs problematic transcripts can

be intercepted, evading further rounds of costly removal

of unproductive translation complexeses [27].

Disruption of these factors activated the Harbinger transposon to jump from the context of a URA5 reporter gene, and resulted in reduced siRNA production. Because such deletions can be expected to primarily affect ribosome synthesis or mRNA degradation, it cannot be ruled out that a subdued RNAi response could be an indirect result of these deletions. Information from experiments that assess splicing, RNA turnover, and maturation of pre-rRNA and ribosomal subunits in rdeΔ strains will be needed to understand what is actually happening in these cells.

Either way, in rdeΔ strains an RNAi response would be alleviated when this process is triggered by stalled ribosomes and thus directed to ‘difficult’ mRNAs that cause this (Fig. 17). After ribosomes have stalled, a quality control mechanism involving mRNA degradation is initiated by decapping enzymes and 3’ end processing. A lack of sufficient numbers of ribosomes might reduce their chance of collapsing into each other and thereby the onset of clearing processes that also involve RNAi and the generation of siRNAs.

The model cannot explain how siRNAs are maintained and propagated during cell division if the relevant transcripts are not synthesized [24]. That translation might provide a relevant context for the onset of RNAi in Cryptococcus is indicated by the fact that many siRNAs mapped to pseudo-genes and TE-loci, but - in C. deneoformans - not to T1 [Janbon-2010], an active, non-coding transposable element [van.Nues-202?]. Recently it has been found that “piRNA generation is associated with the pioneer round of translation in stem cells” of planarians (a kind of flatworm), directly involving the argonaute-like PIWI protein SMEDWI-1 [Parambil-2023]. Thus, route 1 in the model, priming RNAi on problematic transcripts, might occur during such a pioneer round or through common translation on cytoplasmic ribosomes [27]. Another hint in the case of Cryptococcus might be given by the amounts of siRNAs that map anti-sense to mature rRNA (Figure 13), or SRP RNA (Figure 15), in rde1Δ strains and in repair strains where RNAi has been re-established by restoration of genes for Ago1 or Rdp1 [van.Nues-202?]. Maybe such RNAs are mistakenly turned into targets when stalled translation complexes are resolved. Would this also mean that cells need to “learn” that some RNAs, like rRNA, are self?

Notes¶

[1] (1,2)

For about 18% of Cryptococcus genes alternative splicing has been observed, mostly due to intron-retention (61%), which appears related to virulence [Loftus-2005], [Janbon-2014], [Huff-2014], [Gruetzmann-2014]. Chromosomes in both JEC21 and H99 have large centromeres mainly consisting of repetitive sequences derived from Tcn-like transposable elements [Loftus-2005], [Janbon-2014]. This repetitiveness might have lead to an underestimation of centromere length by RNA-Seq [Yadav-2018]. Compared to viral and bacterial infections, fungal infections are relatively rare, possibly related to the body temperature of mammals as described by [Casadevall-2020]. The high GC content of Cryptococcus would be in line with this, strengthening interactions between nucleic acid strands, thereby stabilizing genomic DNA at a higher temperature. Stronger binding between nucleic acids could, however, make it more difficult to accommodate dynamic interactions that occur during transcription, splicing or RNAi. The propensity for R-loop or G-quadruplex (G4) formation in GC-rich nucleic acids could complicate these processes, and possibly compensatory mechanisms will be in place in Cryptococcus not immediately found in other organisms.

[3]

The assessment that CNAG_03231 was previously called CNAG_07721, is based on genome coordinates provided for RNA-Seq reads that mapped to CNAG_07721 by [Kim-2014] in their supplementary Table S4 (sheet 3, entry 123).

no	Feature ID	DMSO	Chromosome	Strand	Start	End	Function
123	CNAG_07721	1.20	8	plus	399834	400159	conserved hypothetical protein

[9] (1,2)

Output of STAR during mapping.

bash-5.2$ sh 4_1-run-star-uncollapsed-14mer-strict.sh h99
One parameter given (for experiment name); no mismatches.
Using alignments with 0 mismatches.
Splice data found, will be used for making alignments..
Checking STAR-folder for SRR646636 ..
Checking STAR-folder for SRR8697580 ..
Checking STAR-folder for SRR8697581 ..
Checking STAR-folder for SRR8697582 ..
...
mapping SRR8697601
STAR --runThreadN 4 --genomeDir /<path to>/Burke-2019/star-h99 \
 --readFilesCommand gunzip -c \
 --readFilesIn /<path to>/Burke-2019/SRR8697601/SRR8697601-trimmed.fastq.gz \
 --outFilterType BySJout --outSAMprimaryFlag OneBestScore
 --outReadsUnmapped Fastx --outMultimapperOrder Random
 --outSAMtype BAM Unsorted --outSAMmultNmax 1 --outFilterMismatchNmax 0 \
 --outFilterMatchNmin 14 --outFilterScoreMinOverLread 0 \
 --outFilterMatchNminOverLread 0 --outSJfilterOverhangMin 6 4 4 4 \
 --outFilterMultimapNmax 50 --seedSearchLmax 28 --seedSplitMin 9 \
 --alignIntronMin 20 --alignIntronMax 150 --alignSJoverhangMin 3 \
 --alignSJDBoverhangMin 2 --alignEndsType EndToEnd --twopassMode None
STAR version: 2.7.10a_alpha_220818   compiled: 2022-09-21
Oct 29 21:53:43 ..... started STAR run
Oct 29 21:53:43 ..... loading genome
Oct 29 21:53:43 ..... started mapping
Oct 29 21:57:02 ..... finished mapping
Oct 29 21:57:02 ..... finished successfully

[21]

Trace for the run that generated count panels for the MINUS strand of the figure:

bash-5.2$ coalispr region -c2 -r 271985, 279663 -st 3 -s2

  a1_1
        Skipped: 7818

  a1_2
        Skipped: 2810

  r1_1
        Skipped: 2145

  r1_2
        Skipped: 6073

  wt_0
        Skipped: 850

  wt_1
        Skipped: 1696

  wt_2
        Skipped: 2714

  c4_1
        Skipped: 1932

  c4_2
        Skipped: 733

  e2_1
        Skipped: 1200

  e2_2
        Skipped: 420

  re1_1
        Skipped: 3099

  re1_2
        Skipped: 2328

  re2_1
        Skipped: 5418

  re2_2
        Skipped: 9850

  re3_1
        Skipped: 10652

  re3_2
        Skipped: 18727

  re4_1
        Skipped: 5889

  re4_2
        Skipped: 20248

  re5_1
        Skipped: 5768

  re5_2
        Skipped: 3520

  r6_1
        Skipped: 487

  r6_2
        Skipped: 196
Sample            Library      cDNA
a1_1           9567157.67  74940.15
a1_2           8280898.83  49650.90
r1_1           3500283.50  56307.24
r1_2           5818569.31  66155.82
wt_0           1429203.39  29205.80
wt_1           2483356.90  43138.73
wt_2           2793111.66  46670.00
c4_1           1685970.08  44513.99
c4_2            828864.22  31669.26
e2_1            923545.79  39107.84
e2_2            354180.50  30570.28
re1_1          6038546.63  56257.64
re1_2          4826315.55  51447.41
re2_1          8137399.37  63894.66
re2_2          7681477.36  76206.80
re3_1         11405806.14  70610.67
re3_2         12266882.70  82820.17
re4_1          7813346.08  62386.71
re4_2         15429512.61  86797.03
re5_1          8570380.14  66718.64
re5_2          8382610.76  56833.83
r6_1            526533.19  29702.34
r6_2            119248.39  18477.06

Showing  counts of Library, cDNA reads for minus strand(s) of 2:271985-279663 23 samples, grouped by category.
Building figure...
Saving to file in /<path_to>/Burke-2019/Coalispr/figures/h99/svgfigures/regions

Figure finishing.

Showing library length distributions for minus strands of region 2:271985-279663 as bar graphs for 23 separate libraries, grouped by category.
Building figure...
Saving to file in /<path_to>/Burke-2019/Coalispr/figures/h99/svgfigures/lengthcounts

Figure finishing.

Showing cDNA length distributions for minus strands of region 2:271985-279663 as bar graphs for 23 separate libraries, grouped by category.
Building figure...
Saving to file in /<path_to>/Burke-2019/Coalispr/figures/h99/svgfigures/lengthcounts

Figure finishing.

[27] (1,2)

[Parambil-2023] describes a model and findings where poorly translating transcripts are detetcted/enriched on monosomes during a pioneering round of translation involving CBP (Cap-binding proteins) and EJC (Exon-junction complex) proteins of which some co-purify with SMEDWI-1, like CBP80, ALY/REF and POLD3P/SKAR. These proteins, like Eif4A3, Y14, MAGOH, and PABPN are not present on polysomes during normal translation, which is also the case for SMEDWI-1. Thus not stalling by collapsing ribosomes on polysomes but a translation defect at an earlier stage selects target transcripts in stem-cells of planarian Schmidtea mediterranea. SMEDWI-1 generates sense piRNAs directly from these transcripts, which also relies on the nuclease Zucchini that is bound to the mitochondrial outer membrane. In Cryptococcus there are no piRNAs and no obvious homologues for Zucchini, while siRNAs appear to be sense plus antisense. Still, the pioneering round of translation is coupled to nuclear transport of spliced mRNA given the EJC factors involved; these are conserved in Cryptococcus. In yeast, a kind of monosomes are formed prior to the final maturation step of 18S in the cytosol [Lebaron-2012] and pioneer rounds of translation linked to NMD are occurring on both CBP80/EJC bound mRNA exported from the nucleus and on eIF4E-bound mRNA translated in the cytosol [Gao-2005]. If similar translation checks happen in Cryptococcus, problematic transcripts could stall ribosomes and induce RNAi both prior to and during the modeled events (CBC-bound mRNAs can be occupied by more than one ribosome although not by as many as eIF4E-bound mRNA).