Postdoctoral scholar Briana Van Treeck examines how scientists are exploiting the unique abilities of an enzyme to increase throughput and decrease bias in RNA sequencing.
Innovations in biotechnology often exploit elegant solutions already devised by nature. For example, green fluorescent protein (GFP), the protein responsible for the green glow of jellyfish under blue light, was first observed in 1962 and has since become one of the most widely used tools in bioscience. The identification and isolation of GFP opened the door for visualizing any protein of interest within the cell. More recently, CRISPR technology, the powerful genome editing tool co-discovered by UC Berkeley’s Jennifer Doudna, was adapted from a defense mechanism in bacteria that confers immunity against viruses. The remarkable ease with which different viral sequences could be targeted by the predictable change of a few letters in the bacterial immune system code was quickly re-tasked to target cellular genes of interest instead. Again and again progress has relied on creative ways to exploit, tame, and harness the unique abilities discovered by scientists in the lab.
Recent research from Kathleen Collins’ lab at UC Berkeley has once again pushed the biotechnological frontier forward through the ingenious manipulation of activities found in nature. This time, the unique capabilities of a single enzyme were co-opted to alleviate biases in the production of libraries used for high-throughput sequencing.
Researchers sequence RNA to see which genes are “turned” on or off within a given sample, what their level of expression is, and how their expression changes over time. These sequences are used to study a host of scientific mysteries from understanding how a cancerous mutation can alter various pathways in the cell, to how an organism responds to changes in its environment. It has been a consistent challenge to obtain reliable data from samples composed of short sequences or samples from which very little RNA can be extracted. With the Collins lab’s advance in producing RNA sequencing libraries that increase throughput and avoid bias, scientists will have access to data that has previously been tauntingly out of reach.
What is RNA—and why sequence it?
The central dogma of biology is used to describe the process by which the information stored in DNA, with an alphabet of only 4 letters, is converted to messenger RNA, which can then be translated into a functional protein product. While the information stored in DNA is always present, the abundance of any given RNA message can fluctuate in response to different cellular stimuli. By reading the RNA messages present in a sample at any given time, through a process called RNA sequencing, researchers are able to better understand how our cells respond to genetic mutations, medicinal treatments, and other cellular disturbances.
While some researchers are focused solely on these messenger RNAs, it has long been appreciated that in some cases RNA itself is the functional product and will never be deciphered into a protein. Consequently, adequately reading these “working” RNA molecules is also of high priority.
Due to the vast array of roles RNA can play in the cell, RNA sequencing technology has become a pivotal tool used by researchers to begin to tease apart the inner workings of the cell. Sequencing technology has progressed quickly while also becoming cheaper and more accessible. The outcome is the production of enormous datasets that have been instrumental in driving new biological discoveries. However, don’t let the accessibility of RNA sequencing fool you, there is still plenty of room for improvement.
Bias in RNA sequencing
Lucas Ferguson, a third year graduate student joint between the Collins and Nicholas Ingolia labs, spends a lot of time analyzing RNA sequencing data and loves teasing apart all the information that can be gleaned from a single sequencing experiment. Ferguson is quick to defend the powerful tool that is RNA sequencing, but he is also well aware of the technical biases that exist in these datasets. “The process of converting RNA to sequence-able material is such a journey that you really have to consider how this process is shaping your data,” Ferguson says.
When asked where RNA sequencing can be biased, Heather Upton, a postdoctoral researcher in the Collins lab, says: “I think the more relevant question to ask here is where isn’t there bias in RNA sequencing?”
Upton explained that even if you get rid of all human-introduced error, there is still a substantial amount of bias that exists simply because of the ways the molecules in the tube interact with each other. There are multiple steps that must occur to take the RNA expressed in the cell and convert it into a form that can be sequenced and properly analyzed. Enzymes involved in each step of the process display preferences for some sequences over others, leading to disproportionate representation at each subsequent step. In addition, each purification step leads to sample loss that can have a compounding effect on this bias.
The Collins lab was able to see this bias head-on when three completely different projects led to the study of tRNA fragments, the processed forms of a type of small functional RNA molecule that is characteristically difficult to capture in sequencing experiments.
“The sequencing library preps did not recover sizes that corresponded to what we could see were the dominant sized molecules of the samples we wanted to sequence, so we knew we were missing information,” Collins explains. In this case, the data was not an accurate representation of the sample because an RNA population of interest was severely underrepresented.
In the RNA sequencing world, it is well accepted that the capture of small RNAs is particularly biased due in part to their highly structured nature and rampant chemical modifications. In addition, the normalization methods that have been developed to correct for variation in long RNA datasets cannot generally be extended to small RNA datasets.
“At the end of the day, every sequencing experiment is just like any statistical survey,” Ferguson says. “When the Census Bureau wants to survey the state of American households, they don’t have to survey every single home, but they need to be reasonably certain whoever they do survey yields a fair and unbiased estimation of the population mean.” With this in mind, the Collins lab sought to better represent small RNAs in sequencing data.
The right enzyme for the job
In order to generate massive amounts of sequencing data from a sample of RNA, a library must be created in which each RNA is converted to complementary DNA (cDNA) and flanked with specific universal sequences that allow for data generation from all molecules in parallel. Regardless of the sequence of events used to prepare your cDNA libraries, at some point you will need to use a reverse transcriptase (RT), an enzyme capable of converting RNA to DNA.
Historically, the RTs recruited for this work are of viral origin; these viruses require RT activity as part of their life cycle. “Because a majority of commercially used RTs are virally derived, they have lower fidelity in order to introduce mutations as they copy their unstructured RNAs so as to encourage population variance,” Upton says. “While advances have been made that help offset these drawbacks, industry can only do so much to improve the fidelity and processivity of viral enzymes.”
Viral RTs also have difficulty dealing with the presence of secondary and tertiary structure as well as the RNA modifications that plague many small RNAs. This contributes to an underrepresentation of highly structured or modified RNAs in sequencing data, like the tRNA fragments the Collins lab was initially pursuing.
Luckily, viruses aren’t the only natural sources of RT enzymes. There remains a wide breadth of understudied RTs that have yet to be utilized for biotechnological applications. One such RT that stood out to the Collins lab was from a eukaryotic retrotransposon. Retrotransposons, a class of selfish genetic element, are capable of autonomously moving to new sites throughout the genome via the reverse transcription of retroelement RNA and insertion into the genome. Importantly, this retrotransposon RT has evolved to copy its own highly structured and long RNA. Failure to completely copy the sequence in full means that the enzyme will be unable to continue propagating. Using this RT in lieu of a viral RT eliminates the loss of sample complexity due to RNA structure. In addition, the RNA modifications that stall viral RTs do not stall this enzyme.
The alleviation of these biases through the use of a retroelement RT was predicted from the evolutionary constraints placed on this RT in comparison to viral RTs. But this RT also has a second rather unique characteristic that ended up being revolutionary for reworking the production of RNA sequencing libraries. It is able to jump from one molecule of RNA or DNA to another, pasting the copied DNA products together in the process. This activity would allow for the universal sequences that flank the cDNA sequence to potentially be added by the same enzyme, removing multiple steps, and multiple sources of bias from the library preparation protocol. “These RTs were just calling out to Heather to tackle them into biochemical submission,” Collins reflects.
Original attempts at using the RT enzyme were producing as much bias as other methods. The enzyme had preference for jumping to RNAs that ended in certain letters over others, leading to unequal conversion of RNA to cDNA. “While screening different conditions, we found that certain buffers can toggle the enzyme from reverse transcriptase to non-specific terminal transferase,” Upton recalls. “This was our ‘eureka’ moment.”
This terminal transferase activity is responsible for adding a single letter to the end of the RNA sequence. The addition of the same letter to the end of all RNAs allowed Upton to steer the reaction equilibrium of subsequent RNA copying to cDNA. This single discovery allowed her to effectively eliminate the bias they originally observed. “Finding the terminal transferase activity was key,” Upton says.
You oughta be using OTTR
The recently developed technique takes advantage of the many characteristics of this new RT to simultaneously add the universal left and right flanking sequences in the same step in which the RNA is copied in the library. This has allowed the entire workflow to be collapsed into a single tube, so there is no need for purification between steps and consequently no sample loss from a gel extraction or column purification. “It’s early days, but the fact that this works and we can do this entire process in a single tube is remarkable,” Ferguson excitedly claims.
Sequencing libraries prepared with this method have been termed ordered two-template relay (OTTR) cDNA libraries for RNA-seq. Collins has always appreciated the utility of an acronym that is easy to say and that also paints a telling picture. Collins says, “‘Otter’ is not only easy to say, but the animal, a river otter, has lithe grace swimming from one spot to the next.”
Tip of the RNA iceberg
The Collins lab is optimistic about the new questions that will be answered through the use of this new library prep method. “We started out hoping to make RNA-sequencing better, faster, and cheaper. In the process, we interacted with a moderate number of academics and industry scientists to get an idea of the problems the labs were facing,” Upton says. “Hopefully this methodology allows labs to get information from some tissue or sample that they’ve never been able to recover before. It never ceases to amaze me how creative scientists can be.”
The ultimate goal is to keep the cycle going. Advances in basic biology often inform advances in biotechnology in unforeseen ways. No one could predict that the study of bacterial immune systems and selfish genetic elements would lead to advancements in gene editing and RNA-sequencing, respectively. In return, biotechnological advances allow scientists to observe aspects of biology that had been previously overlooked or simply unseen.
Collins is excited to provide scientists and people in general an appreciation of the diversity, abundance, and complexity of RNA. “Even at the current tip of the iceberg stage, we aren’t capturing the full information content of RNA modification and processing,” Collins says. “Likely even those of us who study RNA are unaware of the true complexity of the modern-day RNA world.”