Bioinformatics of RNA structure and transcriptome regulation

Introduction and Lay Summary:

When the human genome sequence was released more than a decade ago, it came as a surprise to many that the number of protein-coding genes was not radically different from the corresponding gene count for the more humble organism Caenorhabditis elegans (C. elegans). The current gene counts (20313 for human (GRCh38.p5) versus 20447 for C. elegans (WBcel235)) are stunningly similar. The gene count itself is thus only a poor measure for the complexity of the corresponding organism.

The genome in any cell of a living organism is not a static entity, but generates diverse functional products as function of time. These define the status of that cell and also respond to internal and external changes. The primary products of the genome are transcripts (RNA sequences). These are typically not the final functional products, but need to be processed before they yield the final products (proteins and RNAs). From all we know, complex organisms like ours have developed more refined ways for processing their primary transcripts into different, functional products according to the status of the corresponding cell.

Another surprise in the wake of the human genome sequencing project was the realisation that only a small fraction (<2%) encodes protein information. Moreover, many genes do not encode a protein product (25180 so-called RNA genes (GRCh38.p5)) and even the primary transcripts of protein-coding genes contain a seemingly disproportionate fraction of non-coding nucleotides (introns, untranslated regions). This is surprising, given that introns are excised before the remaining, shortened RNA sequence is translated into the corresponding protein.

My group develops dedicated computational methods and algorithms to understand how the transcriptome is regulated. Our main goal is to understand the role that RNA structure and interactions between different transcripts play in regulating transcripts in the living cell. For this, we study transcriptome data from model organisms such as the fruit fly, the mouse and the human.

Beyond the one-dimensional view of transcripts:

More often than not, figures in textbooks illustrate the Central Dogma of Biology by depicting transcripts as linear molecules inside a eukaryotic cell, with transcription and splicing seemingly happening consecutively. What we know from many dedicated experiments, however, is that processes such as splicing, RNA editing and RNA structure formation can happen co-transcriptionally, i.e. while the transcript emerges from the genome. Similarly to protein information, information on RNA structure can be directly encoded in the transcript itself. We thus expect that RNA structural features and RNA-RNA interactions are widely used for regulating gene expression on transcript level.

Modelling RNA structures in vivo:

In order to devise computational methods for detecting the RNA structural features that are functionally relevant in vivo, it is worth acknowledging the complexity of the cellular environment and the impact this may have on the structure formation process (Lai, Proctor and Meyer, RNA 2013). By devising the new RNA secondary structure prediction program CoFold (Proctor and Meyer, Nucl. Acids Res. 2013), we showed that it is possible to capture the overall effects of the speed and directionality of transcription in vivo. Our method yields significantly improved predictions, especially for long transcripts (> 200 nt) such as ribosomal RNAs. We know already that the sequences of structured RNAs not only encode information on their final RNA structure, but also on transient structural features of their co-transcriptional folding pathway (Meyer and Miklós, BMC Mol Biol 2004).

Figure 1: Arc-plot for the HDV ribozyme made using R-Chie. Each arc represents one pair of base-paired alignment columns. Arcs and the alignment at the top show the alternative structure and the active structure; those at the bottom the inhibitory alternative structure. The left legend specifies the percentage of canonical base-pairs for each arc. The right legend colour-codes the nucleotides and specifies the evolutionary evidence supporting each arc.

It turns out that orthologous transcripts from related organisms also have similar co-transcriptional folding pathways and that distinct transient RNA structure features can be as conserved and functionally relevant as those of the final RNA structure (see Figure 1; Steif and Meyer, RNA Biology 2012; Zhu, Steif, Proctor and Meyer, Nucl. Acids Res. 2013; Zhu and Meyer, RNA Biology 2015). This has obvious implications for many state-of-the-art methods in RNA secondary structure prediction as these typically assume that any given transcript folds into exactly one (but not more) functional RNA structure. A probabilistic method developed earlier by us (Transat, see Wiebe and Meyer, PLoS Compbio. 2010) aims to address this problem and has allowed us to detect individual, conserved RNA secondary structure features of pseudo-knotted structures, ribo-switches and transient structures which are otherwise notoriously difficult to predict.

RNA structure features involved in splicing regulation:

Viral genomes such as Hepatis-C and HIV-1 are known to encode functional RNA structure in protein-coding regions as one major constraint for their genomes it to remain short. We contributed early on to these studies by showing that these RNA structures can be reliably predicted provided the know protein context is explicitly taken into account (Pedersen, Meyer, Forsberg, Simmonds and Hein, Nucl. Acids Res. 2004; Pedersen, Forsberg, Meyer and Hein, Mol. Biol. Evol. 2004; Watts et al., Nature 2009). Functional RNA structures overlapping protein-coding regions, however, are not the preserve of viral genomes, but can also regulate the alternative splicing and translation of eukaryotic protein-coding genes e.g. in Arabidopsis thaliana, mouse and human (Meyer and Miklós, Nucl. Acids Res. 2005; Schöning et al., Nucl. Acids Res. 2008). In order to explore the link between RNA structure and alternative splicing on a transcriptome-wide scale, we recently analysed tissue-specific high-throughput transcriptome data from the fruit fly. Using a new analysis pipeline that explicitly captures the requirement for double-stranded regions, we identified around 2000 novel editing sites as well as numerous (244) regions where RNA editing and alternative splicing are likely to influence each other (see Figure 2).

Figure 2: (A) Genomic context of identified editing sites. (B) Distribution of conversion types for four tissue types. (C) Percentage of common editing sites between pairs of tissues. (Bottom) Gene CG5850 is differentially expressed between head (blue) and digestive system (red) and editing and splicing may affect each other. X-axis: exons of the gene, y-axis: number of reads normalized by library size. Arrows show editing sites. The purple box is predicted to be alternatively expressed.

The detected RNA editing sites are approximately three times more likely to occur in exons with multiple splice sites than unique ones and conserved RNA structure features overlap 39% (96/244) of these regions (see Figure 3). We therefore conclude that local RNA structure features have the potential to regulate alternative splicing via structural changes induced by RNA editing (Mazloomian and Meyer, RNA Biology 2015).

Figure 3: (Top) Arc-plot for the highlighted region of the Cip4 gene containing a predicted, conserved RNA secondary structure overlapping RNA editing sites (red arrows) that could influence alternative splicing via structural changes. The left legend colour-codes the nucleotides according to the evidence supporting each arc, see also Figure 1. Figure made using R-Chie. (Bottom) Gene structure of the Cip4 gene with grey box highlighting the structure-containing part at the top.

Trans RNA-RNA interactions regulating the transcriptome:

RNAs not only have the potential to form RNA structure, but can also interact with other RNAs in trans. This involves the same simple structural building blocks, namely hydrogen bonds between complementary nucleotides ({G,C}, {A,U} and {G, U}). It is thus much more straightforward to evolve or design a desired RNA structure or trans RNA interaction using an RNA than a protein. We thus expect many trans RNA-RNA interactions beyond those already known (miRNA-mRNA, snoRNA-rRNA, etc) to be discovered. We have shown how the power of the comparative approach can be harnessed in the context of coding gene prediction, RNA gene prediction and RNA structure prediction and the prediction of trans RNA-RNA interactions. Based on our recent survey of methods for predicting de novo trans RNA-RNA interactions (Lai and Meyer, RNA Biology 2015), we intend to focus on computational methods that can better handle full-length transcripts and that can differentiate between different kinds of evolutionary constraints.

In the last few years, transcriptomics has made dramatic advances through the invention of new transcriptome-wide investigation protocols (e.g. CLASH, SeqZip) as well as a new generation of sequencing techniques (e.g. nanopore and SMRT sequencing) that allow for significantly longer read-lengths. We want to combine dedicated computational analyses and methods developed by us with transcriptome-wide data sets generated via state-of-the-art sequencing and experimental protocols to gain a global and detailed understanding of how RNA structure and RNA-RNA interactions regulate eukaryotic gene expression on transcript level.

Future collaborations:

I have joined the MDC early 2016 from the University of British Columbia in Vancouver, Canada, and look forward to setting up collaborations with new colleagues at the MDC and the wider scientific community in Berlin, Germany and Europe. Just get in touch via email, if you are interested.