Herein, we present a new and fully automated miRNA pipeline, mirMachine that 1) can identify known and novel miRNAs more accurately and 2) is fully automated and freely available. Users can now execute a short submission script to run the fully automated mirMachine pipeline.
Of different types of noncoding RNAs, microRNAs (miRNAs) have arguably been in the spotlight over the last decade. As post-transcriptional regulators of gene expression, miRNAs play key roles in various cellular pathways, including both development and response to a/biotic stress, such as drought and diseases. Having high-quality reference genome sequences enabled identification and annotation of miRNAs in several plant species, where miRNA sequences are highly conserved. As computational miRNA identification and annotation processes are mostly error-prone processes, homology-based predictions increase prediction accuracy. We developed and have improved the miRNA annotation pipeline, SUmir, in the last decade, which has been used for several plant genomes since then.
This study presents a fully automated, new miRNA pipeline, mirMachine (miRNA Machine), by (i) adding an additional filtering step on the secondary structure predictions, (ii) making it fully automated, and (iii) introducing new options to predict either known miRNA based on homology or novel miRNAs based on small RNA sequencing reads using the previous pipeline. The new miRNA pipeline, mirMachine, was tested using The Arabidopsis Information Resource, TAIR10, release of the Arabidopsis genome and the International Wheat Genome Sequencing Consortium (IWGSC) wheat reference genome v2.
Advances in next generation sequencing technologies have widened the understanding of RNA structures and regulatory elements, revealing functionally important non-coding RNAs (ncRNAs). Among different types of ncRNAs, microRNAs (miRNAs) constitute a fundamental regulatory class of small RNAs with a length between 19 and 24 nucleotides in plants1,2. Since the discovery of the first miRNA in the nematode Caenorhabditis elegans3, the presence and the functions of miRNAs have been studied extensively in animal and plant genomes as well4,5,6. miRNAs function by targeting mRNAs for cleavage or translational repression7. Accumulating evidence has also shown that miRNAs are involved in a wide range of biological processes in plants including growth and development8, self-biogenesis9, and several biotic and abiotic stress responses10.
In plants, miRNAs are initially processed from long primary transcripts called pri-miRNAs11. These pri-miRNAs generated by RNA polymerase II inside the nucleus are long transcripts forming an imperfect fold-back structure12. The pri-miRNAs later undergo a cleavage process to produce endogenous single-stranded (ss) hairpin precursors of miRNAs called pre-miRNAs11. The pre-miRNA forms a hairpin-like structure wherein a single strand folds into a double-stranded structure to excise an miRNA duplex (miRNA/miRNA*)13. Dicer-like protein cuts both strands of the miRNA/miRNA* duplex, leaving 2-nucleotide 3'-overhangs14,15. The miRNA duplex is methylated inside the nucleus, which protects the 3'-end of the miRNA from degradation and uridylation activity16,17. A helicase unwinds the methylated miRNA duplex after export and exposes the mature miRNA to the RNA-induced silencing complex (RISC) in the cytosol18. One strand of the duplex is mature miRNA incorporated into RISC , whereas the other strand, miRNA*, is degraded. The miRNA-RISC complex binds to the target sequence leading to either mRNA degradation in case of full complementarity or translational repression in case of partial complementarity13.
Based on the expression and biogenesis features, guidelines for miRNA annotation have been described15,19. With the defined guidelines, Lucas and Budak developed the SUmir pipeline to perform a homology-based in silico miRNA identification in plants9. The SUmir pipeline was composed of two scripts: SUmirFind and SUmirFold. SUmirFind performs similarity searches against known miRNA datasets through National Center for Biotechnology Information (NCBI) Basic Local Alignment Search tool (BLAST) screening with modified parameters to include hits with only 2 or fewer mismatches and to avoid bias towards shorter hits (blastn-short -ungapped -penalty -1 -reward 1). SUmirFold evaluates the secondary structure of the putative miRNA sequences from BLAST20 results using UNAfold21. SUmirFold differentiates miRNAs from small interfering RNAs by the identification of the characteristics of hairpin structure. Moreover, it differentiates miRNAs from other ssRNAs such as tRNA and rRNA by the parameters, minimum fold energy index > 0.67 and GC content of 24-71%. This pipeline has been recently updated by adding two additional steps to (i) increase sensitivity, (ii) increase annotation accuracy, and (iii) provide genomic distribution of the predicted miRNA genes22. Given the high conservation of plant miRNA sequences23, this pipeline was originally designed for homology-based miRNA prediction. Novel miRNAs, however, could not be accurately identified with this bioinformatics analysis as it heavily relied on sequence conservation of miRNAs between closely related species.
This paper presents a new and fully automated miRNA pipeline, mirMachine that 1) can identify known and novel miRNAs more accurately (for example, the pipeline now uses sRNA-seq-based novel miRNA predictions as well as homology-based miRNA identification) and 2) is fully automated and freely available. The outputs have also included the genomic distributions of the predicted miRNAs. mirMachine was tested for both homology-based and sRNA-seq-based predictions in wheat and Arabidopsis genomes. Although initially released as free software, UNAfold became a commercial software in the last decade. With this upgrade, the secondary structure prediction tool was switched from UNAfold to RNAfold so that mirMachine can be freely available. Users can now execute a short submission script to run the fully automated mirMachine pipeline (examples are provided at https://github.com/hbusra/mirMachine.git).
1. Software dependencies and installation
2. The mirMachine setup and testing
3. Homology-based miRNA identification
4. Novel miRNA identification
5. Advance parameters
NOTE: The defaults are defined for all the parameters except for the genome file and the input miRNA file.
The miRNA pipeline, mirMachine, described above was applied to the test data for the fast evaluation of the performance of the pipeline. Only the high-confidence plant miRNAs deposited at miRBase v22.1 were screened against the chromosome 5A of IWGSC wheat RefSeq genome v224. mirMachine_find returned 312 hits for the nonredundant list of 189 high-confidence miRNAs with a maximum of 1 mismatch allowed (Table 1). mirMachine_fold classified 49 of them as putative miRNAs depending on the secondary structure evaluation. The highest represented group of miRNAs was miR9666 with a total of 18 miRNAs identified (Figure 1). Some miRNAs shared the same mature miRNA, but processed from a different pre-miRNA sequence. These miRNAs were renamed by the miRNA family name followed by a unique number, e.g., miR156-5p-1 and miR156-5p-2. Among the 49 putative miRNAs, 20 non-redundant mature miRNA sequences were identified. Some miRNAs can be transcribed from more than one locus resulting in a higher number of miRNAs represented. In the test data, miR9666-3p-5 was represented twice: one on the sense strand (at 602887137) and the other on the antisense strand (at 542053079). All locations are provided in the GitHub under the TestData output file named mature_high_conf_v22_1.fa.filtered.fasta.results.tbl. hairpins.tbl.out.tbl.
Expression evidence in one plant genome is sufficient, given the conservation of miRNAs in plants; however, a high-confidence miRNA dataset only provides a limited amount of data. Therefore, it is the user's preference to use the high-confidence and/or experimentally validated miRNAs as the reference dataset and skip the expression validation step, or to use all plant miRNAs available as the reference dataset and look for the expression evidence afterwards. Here, as the high-confidence miRNAs were used as the reference set, which had been validated experimentally in one of the plant genomes, the expression validation step was skipped for the test data.
mirMachine was benchmarked using monocot and dicot plants including Arabidopsis thaliana (Arabidopsis, TAIR10 release) and Triticum aestivum (wheat, IWGSC RefSeq v2). The performance of the homology-based and the sRNA-seq-based predictions was evaluated, and the results were compared with the miRDP225, an NGS-based miRNA prediction tool. Homology-based predictions were executed using the non-redundant list of plant mature miRNA sequences deposited at the miRbase v2226. sRNA-seq-based predictions were executed using the publicly available datasets; GSM2094927 for Arabidopsis and GSM1294661 for the wheat. In addition to raw results, the homology-based predictions were filtered for the expression evidence of mature miRNA and miRNA star sequences using the same sRNA-seq datasets.
Figure 2 shows the performance of each tool and the mirMachine settings on the two species. Sensitivity was calculated as the total number of known miRNAs identified divided by the total number of miRNAs identified. The results showed that mirMachine outperformed miRDP2 in terms of sensitivity and the true positive predictions in the Arabidopsis data. For the wheat data, mirMachine homology-based prediction, supported by expression evidence, provided better sensitivity than miRDP2. For both the genomes, miRDP2 predicted higher number of true positives compared to mirMachine sRNA-seq and homology-based predictions with expression evidence. It should be noted that miRDP2 lowers the expression threshold (RPM, reads per million) from 10 to 1 for the prediction of known miRNAs, resulting in higher true positive predictions. In general, the mirMachine can be used for the identification of both novel and known miRNAs. One advantage of the mirMachine is its ability to predict genome-wide distribution of the putative miRNAs without a limitation of specific tissues and conditions. Finally, the mirMachine is user-friendly and provides flexibility to adjust parameters such as number of hits, mismatches, length of miRNAs, and RPMs for specific research purposes. Taken together, the mirMachine provides accurate predictions for the putative miRNAs in the transcriptomes and the genomes of the plants.
Figure 1: The distribution of miRNA families identified from the chromosome 5A of the IWGSC wheat reference genome v2. Data labels show the miRNA family and the number of miRNAs belonging to each miRNA family. Abbreviations: miRNA = microRNA; IWGSC = International Wheat Genome Sequencing Consortium. Please click here to view a larger version of this figure.
Figure 2: Performance assessment of the mirMachine. Comparisons of the sensitivity and the total number of known miRNAs predicted (true positives) are shown for the mirMachine with homology-based and sRNA-seq-based predictions and the miRDP2 software. Abbreviation: miRNA = microRNA. Please click here to view a larger version of this figure.
Genome | Genome Size | Reference miRNA dataset | mirMachine_find hits | mirMAchine_fold hits | # of miRNA families |
Test data | ~0.7 Gb | 189 | 312 | 49 | 9 |
Chr5A |
Table 1: Statistics of the mirMachine. Test data are from the chromosome 5A of the IWGSC wheat reference genome v2. Abbreviations: miRNA = microRNA; IWGSC = International Wheat Genome Sequencing Consortium.
Our miRNA pipeline, SUmir, has been used for the identification of many plant miRNAs for the last decade. Here, we developed a new, fully automated, and freely available miRNA identification and annotation pipeline, mirMachine. Furthermore, a number of miRNA identification pipelines including, but not limited to the previous pipeline, were dependent on UNAfold software21, which became a commercial software over time, although once being freely available. This new and fully automated mirMachine is no longer dependent on the UNAfold; instead, the freely available RNAfold from the ViennaRNA package27 is used for secondary structure prediction. Additionally, all scripts for the mirMachine were gathered in a bash script with adjustable parameters to make mirMachine a fully automated and freely available miRNA prediction and annotation tool.
The mirMachine benefited from the characteristics of plant miRNAs and their biogenesis. As opposed to animal pre-miRNAs, plant pre-miRNAs are variable in length and structural features15. Consequently, a criterion has been set for the identification of plant miRNAs depending on the characteristics of the miRNAs and their biogenesis15. No cut-off was set for the pre-miRNA length as the length of plant pre-miRNAs can vary remarkably and could be hundreds of nucleotides long. Instead, pri-miRNA structure folding, which was limited to ~700 bp in length, was first evaluated. Later, pre-miRNA sequence was predicted from the candidate pri-miRNA sequences and evaluated for proper folding statistics.
Many plant genomes, especially agronomically important cereals such as wheat and barley, possess highly repetitive genomes28,29,30. Other than the high-repeat content, polyploidy is observed in some of these plants24, introducing additional complexities to the in silico identification and characterization of the miRNA structures. The repeats are a major source for the production of siRNAs31, which resemble miRNAs in their mature forms; however, they differ in biogenesis and function32,33. It is extremely difficult to eliminate siRNAs from the candidate miRNA lists. In fact, the most widely used miRNA database, the miRBase26, has been reported to contain large numbers of siRNAs annotated falsely as miRNAs34,35. Based on the differences in their biogenesis, the mirMachine filters the small RNAs that form a perfect pair with the antisense strand as siRNAs and places those sequences into the suspect table. Additionally, the mirMachine has the -n option, which defines the maximum number of hits to filter the candidate RNAs as siRNAs.
Expression evidence is required to validate all the miRNAs predicted in silico. As miRNAs are highly conserved among plant genomes, expression evidence in one of the plant genomes should be sufficient to confirm the validity of the predicted miRNA. The use of high-confidence, mature miRNA sequences in the initial screening process has the advantage of providing expression evidence for all the predicted miRNAs; however, the short list of initial miRNA dataset limits the prediction of a comprehensive set of miRNAs in a genome. Alternatively, a full set of plant miRNAs deposited in the miRBase database can be used as an initial dataset instead of filtering for high-confidence miRNAs. Users are advised to look for expression evidence through expressed sequence tags, miRNA microarrays, or small RNA sequencing data for at least one of the plant genomes if any expression data are not available for the species of interest.
Homology-based miRNA predictions can help elucidate genome-wide distribution of the known family of miRNAs. These miRNAs are likely to be expressed in certain tissues and conditions. A drawback of homology-based predictions is the lack of ability to identify novel miRNA families. In contrast, sRNA-seq-based predictions could identify novel miRNAs with a cost of a high number of false positives. Therefore, the choice of the best approach is up to the users and the research of interest. The mirMachine presented here can help identification of the miRNAs based on either homology to known miRNAs or sRNA sequencing.
https://www.ncbi.nlm.nih.gov/books/NBK279671/ | Blast+ | ||
https://github.com/hbusra/mirMachine.git | mirMachine submission script | ||
https://www.perl.org/get.html | Perl | ||
https://www.tbi.univie.ac.at/RNA/ | RNAfold | ||
Arabidopsis TAIR10 | |||
Triticum aestivum (wheat, IWGSC RefSeq v2) |