Species Identification on a Small Sample Size of RNA Sequences by Combined Method of Noise Filtering with L2-Norm

This paper proposed a noise filter with L2-norm distance method to design a classification of RNA sequences for the species identification, included of the small sample size of the nucleic acid sequence. This method amended and expanded the study by Hu et al. in 2011 [1]. We verified this method with the biological sample "slipper orchids" and its hybrid for biological species identification test. The result is showed that after applied our method, we can distinguish the paternity of a hybrid among a set of samples of "slipper orchids".


Introduction
This method mainly based on L 2 -norm distance to classify the amino acid sequences, to do pre-processing filtering noise toward the non-A, U, C, G character analyzed through electrophoresis, and to check the progeny of hybrid.This study found that using L 2 -norm distance can easily and efficiently differentiate the species relationships of "slipper orchids" samples, which modified Hu et al.'s study [1] which explored the sequence analysis but failed to mention the problem which might exist.That is, with the small sample sized RNA sequence, artificial intelligence methods may not be successfully classified by the mathematical calculation [1,12].Pre-processing and noise filtering can solve garbled electrophoresis and effectively resolve the problem of automated RNA sequencing analysis [14, 16,17].Consequently, further expansion of the species can truly be applied to biological classification, as Table 2-3.
In the past, the "morphological" observation method [3] was widely adopted to make species identification toward animals and plants.However, the conditions necessary for such identification is very strict, there must be a complete animal and plant appearance or the characteristics parts of that type of animal and plant [2].RNA records genetic characteristics of organisms, and different species have different genetic composition.Also, different individuals of the same species can be distinguished through RNA analysis.
This study amends the classification of RNA sequences proposed by Hu et al. in 2011 [1], launching mathematical analysis to solve the garbled problem resulted from the small sample electrophoretic analysis of nucleic acid sequences.RNA electrophoresis analysis has the characteristics of negatively charged nucleic acids which will cross the gel in the electric field and move towards the cathode.Because of different molecular weight nucleic acid, the gel pore size varies in the speed of movement, so as to separate the different sizes of nucleic acids.However, RNA sequencing generally employs vertical electrophoresis [13].The gel electrophoresis analysis capabilities can analyze from several nucleotide to millions of chromosomal RNA of nucleotides.However, it has resolving power within a certain range, not a colloid can analyze any RNA fragments of various sizes.Therefore, to obtain excellent resolving power, we must explore the range of analytical gel electrophoresis [14].
Two types of gel electrophoresis are commonly used to analyze RNA.One is the agar gel electrophoresis (agarose gel electrophoresis, referred to AGE), the other is polyacrylamide gel electrophoresis (polyacrylamide gel electrophoresis, referred to PAGE) [17].Because of its concentration in the two different gel, gels formed by the holes are not the same.Therefore, the scopes of the analysis are different [3].
Today electrophoresis is convenient and reliable to use.However, on the analysis of RNA molecules, it is unable to analyze the chromosome RNA with larger molecules.That is the reason why the genes on chromosome localization studies totally depend on genetic analysis or localization analysis with the microscope in recent years [14,15,16,17] and it requires sophisticated artificial experimental operation.
Electrophoresis is caused by nucleic acids in electrophoresis since its own mobile logarithmic rate and inversely proportional to molecular weight, and it's not related to the base composition and nucleic acid sequences [14,16].Nevertheless, there are various causes in the experimental operation and other factors affecting the electrophoresis: (i) colloid concentration, (ii) nucleic acid structure, (iii) electrophoresis buffer salts composition, (iv) electric field strength, (v) electrosmosis phenomenon, (vi) to support the choice of materials, (vii) temperature [14, 16,17].Accordingly, it's not easy for us to get a complete noise-free RNA sequence.But, using the appropriate noise filtering pre-processing of this study enables us to resolve the garbled characters in previously mentioned problems and to enhance the accuracy of automated analysis machines.
Through the category of L 2 -norm distance, we achieved the automation possibility of species identification with small sample size sequence [4].With unavailable RNA sequence of training samples in the number of samples, this study could conduct related calculation of species identification and also supplements how to deal with RNA sequence classification calculations with small samples.It further successfully resolved the issue related to classification so that future research can take advantage of this principle.Species identification designed to lay the possibility of biological sensors.
Therefore, this study proposed noise filtering pre-processing and L 2 -norm distance for classification.We designed a small samples size of RNA sequences (or only single) occurred in the case of classification of biological computing.Also, we used "slipper orchids" to do the actual value of testing biological samples.The results can be found in single RNA hybrid slipper orchids, some garbled characters in sequence noise filtering could be removed by using pre-processing.Finally, we used L 2 -norm distance classification to classify amino acid sequences.The calculation results in this way can be just a small sample of untrained check RNA sequence data.Slipper orchids in this experiment can be found in species identification.
In this study, the six native species of "slipper orchids" were inspected and tested in the beginning, then we expanded to fourteen native species "slipper orchids" (Source: Council of Agriculture, Executive Yuan, ROC, Taichung District Agricultural Improvement Station) for the fourteen species of slipper orchids native RNA sequence [5].We calculated a set of hybrid offspring slipper orchid samples.The results are found that by employing L 2 -norm distance in the classification, calculated species identification of biological sequence classification could be correctly completed, and it further calculated the parent for breeding hybrids of native species and then completed biological calculation of the genetic identification.Consequently, after being tested, this study could be considered practical and effective, as shown in table 4-1 to 4-7.
Advanced Engineering Forum Vol. 1

Materials
Homogeneous RNA sequence represents having high similarity, coming from the same ancestor, having the same spatial structure, and having similar biochemical functions.Biological definition: if more than 25% of protein amino acid sequence is the same, or more than 75% of the nitrogenous base sequence is the same in RNA, we can conclude that protein or RNA sequence are homogeneous.This point serves as the mathematical calculation reference as we conducted genetic or species identification.Proteins are formed by linear arrangement of amino acid molecules.It is linked through the formation of peptide bonds.Amino acid sequence of the protein is encoded by the corresponding genes.They are mainly 20 standard amino acids encode by the genetic code, as shown in Table 2-1 [7,8].
Biologists discover the mating phage RNA should be based on the significance of a group of three strings, and it is conducted through the way of Codon.Basically, Codon is the control method of translation when RNA is converted to amino acid sequence.Because there are 20 kinds of amino acids and RNA with 4 bases, RNA is three words as a unit to produce 64 (4 3 =64) different combinations and it used multivalued function corresponding to 20 amino acids [8].

Table 2-1: The genetic code table
In the genetic code (Table 2-1) shows, Methionine is the general common initiation codon initiation codon.However, there are very few biological exception is the use of GUG as the initiation codon.UAA, UAG, UGA is the stop codon.They do not correspond to any amino acid, as is the sentence "period".When the translation stop codon when translated if you encounter will stop.Due to base 64 (4 3 = 64) genetic codon, but only 20 kinds of amino acids.Therefore, there must be a lot of duplicate counterparts, such as Arginine is the amino acid corresponding with the most repeated.It can be produced in six different codons.

Base sequence noise filtering methods
In this study, in order to address the actual base sequence obtained by electrophoresis of biological samples, it often associated with the experimental data errors occurring phenomena to enhance the computing system the feasibility of automation.For example, it supposed to show AA'A'CCUGGG, but it appeared AA'X'CCUGGG, a garbled problem.Here we designed a new way to solve the noise filter base sequence of occurrence of the above mentioned garbled problems.The proposed noise filtering is based on electrophoretic analysis of biological experiments [14, 15,17].We take parts of the organizational structure principle when taking the tissue sample, and we divided the above example AAXCCUGGG into two sequences AA + CCUGGG.Because the AA is

216
Emerging Engineering Approaches and Applications less than 3 characters, we didn't count them in and only preserved CCUGGG for calculation.We used the genetic code table to translate RNA into protein sequence of the calculation.Finally, Since these 4 characters, A, U, C, G, forms 64 different 3-character strings.From [8], we know what constitute the 20 amino acids of biological proteins.Hence, we adopted the method of Hu et al. [1] and set the codes of 20 kinds of amino acids as parameters 1 20 x x .Parameters are UAA, UAG, UGA and other "STOP" string as a characteristic frequency of occurrence.Because there is no sequence for the encoded protein fragments, A and U particularly contain more.Therefore, it could be a feature to see if it will be particularly rich in A and U. Therefore, the parameter 22 x = 『frequency of occurrence of A + U』, as Table 2-2.

Experimental procedures 1. Extracting RNA sequences from slipper orchids:
In this study, we mainly extract the ribosomal RNA sequences of ITS1 to ITS2 of slipper orchids.Because this species closely related species RNA sequence.It rapidly evolves and has genetic variation characteristics [5].Therefore, in the RNA analysis, biologists often use this RNA data.

RNA copy (PCR polymerase chain reaction):
PCR thermal cycling, RNA sequences are partly copied.Per cycle, twice the number of the original RNA sequence can be obtained.By 30 cycles, we will have two of the thirty times square, that is 1,073,741,824 times, which is about a billion times [7].

RNA sequenced (electrophoresis analysis):
Sequenced RNA electrophoresis procedure uses PCR analysis.We put the product after the PCR reaction into the automated sequencer for sequence analysis, also adopting the electrophoresis method as our principle.But in the end, we put the laser scanner to scan the base sequence with fluorescent markers, and then determined the RNA sequence we need via the computer [7].
Advanced Engineering Forum Vol. 1 4. The RNA sequence were transformed into the amino acid sequence data and quantified 22 characteristic documents.The research data set is obtained from R.O.C. Council of Agriculture, Taichung District Agricultural Improvement Station provided slipper orchids sequence.Therefore, the starting point of the original sequence is known.There are also some non-A, U, C, G character generator.Therefore, the proposed noise filtering methods were used to fix the garbled problem generated along with the sampling error of the machine.

Sequence analysis of slipper orchids noise filtering
As conducting biological experiments, we found that there were some wrong characters.From the perspective of mathematical analysis and through the discussion on error of the experimental analysis, we did not calculate those wrong characters corresponding to the amino acid variables.If there appeared wrong characters, we analyze data through the algorithm, so that the results of this study could undertake automated calculations.For example: […,…,…,…,AUU,NAC,GCA,…,…,…,…], Because the character, N, is a wrong one, NAC could not be converted into amino acids variable.Therefore we skipped it and did not take NAC into consideration.The longer the whole sequence is, the smaller the error ratio is, as in this formula , the frequency for other characteristics to appear is

Classification of L 2 -norm distance
In order to effectively achieve species identification, we design our feature vector X = {x 1 ,x 2 ,…,x 22 } based on the amino acids variables in Table 2-1.Then, we set a group of feature vector set toward identified and compared objects, and the feature vector set itself was classified by L 2 -norm distance computation in order to reflect the classification of the most essential features (i.e.minimum L 2 -norm distance).This is our proposed classification of L 2 -norm distance computation process.This program can successfully resolve: To determine the base sequence from the collating sequence with the small sample size and alignment problem, as shown in Table 2-3.
First of all, we transformed bases into 20 groups of amino acids in proteins.An additional group of amino acid bases of gene transfer terminator, and the word A and U base pairs group, as shown in Table 2-2.Amino acids variable served as 22 feature vectors of the study.
We calculated the appearing relative frequency as a feature extraction purposes and converted the original string of data for analysis.The frequency of the first 20 amino acids with terminator genes and A + U base pairs group converted into the amino acid sequence analysis by noise filtering methods to vector representation (1), by the relative frequency of conversion.Then we converted (1) that value into 22 groups.Then we used the smallest sort ( min || || ,2,…,n, j=1,2…,c, l=1,2,…,n.To find the filter after the first 22 groups of parameters best affinity.Plus terminator and the words A and T base pairs group.Determine the best variables, as We used computer simulation found that classification.If terminator and the words A and T base pairs were been as a paragraph label.There will be 22 parameters.So we let { , , , , , } x represents the k-th characteristic frequency of occurrence in the classification.Then, the number of variables was adjusted.Dimension of the vector was set down to represent the whole sample parameters.

218
Emerging Engineering Approaches and Applications

Sequence alignment
Tests in this study were calculated by the RNA sequence of the laboratory obtained from Agricultural Improvement area, the biological sequence data.Using the noise filter method of the research conducts sequence data pre-processing.Then we use [8] in the RNA sequence into amino acid sequence principle.Finally, we used our proposed classification of L 2 -norm distance to measure the amino acid sequence existing between the actual gap.

Experimental results show that
During the operations in the actual biological experiments, lack of information error is likely to occur.Therefore, we proposed to calculate the noise filter to solve the blind spot.In this study, we used a two-stage biological samples for the actual test, as follows: In the first category, There are six species of slipper orchids, "P.acmodontum", "P.charlesworthii", "P.concolor", "P.conco-bellatulum", "P.randsii", "P.rothsc hildianum", for study samples, and one species, "Delr(P.rothschildianumX P.delenatii)" for the classification of hybrid, and the results are shown in In the second-staged category, we increased number of the study samples to 14, "P.armeniacum", "P.rothschildianum", "P.chamberlainianum", "P.concolor", "P.glaucophyllum", "P.haynaldianum", "P.lowii", "P.bellatulum", "P.sukhakulii", "P.urbanianum", "P.urbanianum", "P.victoria-mariae", "P.villosum", "P.delenatii", "Phragmipediummem", and the number of hybrids to Magi (P.micranthum X P.delenatii) and use the noise filtering algorithm directly to obtain L 2 -norm distance.The classification result is shown in Table 4 4-2, we could clearly realize the effectiveness and validity of the application in the slipper orchids in this research and know that the minimum L 2 -norm distance on behalf of its parent association or parent.

Conclusions
It was common to use the way of diminishing dimension classification forecasts.The advantage of Hu et al. study [1] is that all the dimensions of the sample parameters could be included in the analysis, and more sequence of correct classification out of the group can be found.However, if we encounter the data provided by the native species (parent generation) base sequence and hybrids (offspring) are organized as a single-base sequence, the above approach [1] may be unable to calculate and analyze.

Advanced Engineering Forum Vol. 1
With the noise filtering, we amended the error produced by the machine through the non-A, U, C, G electrophoresis analysis process.Furthermore, we followed the L 2 -norm distance of the proposed space theory to achieve the species classifications.Finally, we analyzed samples of biological experiments, using native species by the 14 kinds of "slipper orchids" to classify hybrid slipper orchids, and using this research to validate our method in genetic identification and the validity of species identification.
The classification by the numerical results also proved the validity and reasonability of this study.When all the parameters in the classification dimensions are considered, the classification accuracy increases.Additionally, this study proposed noise filtering method and we successfully solved the common biological garbled problem occurred by electrophoresis [14,17] and completed the error correction.Moreover, we use the actual biological samples of slipper orchids to verify the effectiveness of this method.
This method makes it possible to establish the biological testing simple model of species identification in the future, and makes the automatic detection design more complete and effective.

≒
, and RNA sequence has a certain length, (… represents a three-character amino acid variables).

Table 2 -
1 was organized into 22 feature vectors for data analysis as Table 2-2.