Exactly Palindrome

We derived the longest exact palindrome sequences for the three sets of RNA sequences. The results are depicted in following figure, where the x-axis and y-axis denote the length and number of the longest palindromes, respectively.The length distribution of the longest exactly palindrome for each data set.

Approximate Palindrome

We computed the longest approximate palindrome sequences with k=1 for the three set of sequences. The results are depicted in following figure, where x-axis and y-axis denote the length and frequency of the longest palindromes, respectively.The distribution of the length of the longest approximate palindromes with k=1 for each data set.

KS Test

Given the frequency and length distributions of exactly and approximate palindromes for the three types of RNA sequences, we are interested in knowing if the distributions are RNA type-specific or not. Therefore, the Kolmogorov-Smirnov (KS) test was employed to conduct a pairwise comparison study. The KS test examines the difference between two cumulative distributions. It rejects the null hypothesis of no difference between two cumulative distributions if the p-value is less than 0.05. The KS test is performed by using the*MATLAB*

*‘kstest’*function, the results are shown in the table. The

*H*value represents the hypothesis test result. If

*H*= 1, this indicates the rejection of the null hypothesis at the significance level of 0.05; if

*H*= 0, this indicates a failure to reject the null hypothesis at the same significance level.

Test of homogeneity for the length and frequency distributions of exactly and approximate palindromes for the three types of RNA sequences.

*H*=0 in all cases, it represents there are no significant differences on the length distribution for three types of RNA sequences. From the right-handed side of the table, the results imply that the frequency distribution of the palindromes for miRNA is quite different from the fusion genes' mRNA and lncRNA. This phenomenon could be explained from the length distribution graph of the longest approximate palindromes with k=1, the frequency distribution exhibits a decaying behavior for miRNAs, while the fusion genes’ mRNAs and lncRNAs shown an approximately bell-shaped distribution.

The A-U richness of RNA palindromes

We examined the assumption that the palindromes are also A-U rich in RNA sequences. The probabilities of occurrence for the three types of base pairings, i.e. (A, U), (C, G) and (U, G), in the palindromes of the three types of RNA sequences are listed in Table 8. We found that the (A, U) pair is consistently higher than the average (33.33%) in all of the three types of RNA sequences. Furthermore, in order to validate the A-U richness hypothesis, we recorded the number of sequences whose longest palindrome has an A-U pair ratio higher than the average. It was found that 54.57%, 64.27% and 61.69% of them are A-U rich for fusion gene mRNA, miRNA and lncRNA respectively. These two results validated the A-U rich assumption, which is in line with the results in previous study.
The average percentage of each base pair in the longest palindrome of three types of RNA sequences.