Identification of exonic regions in dna sequences an approach using cross-correlation and noise suppression by discrete cosine transform

To identify the exonic regions in the DNA sequence of Chromosome 23, filtering techniques are used. DCT is a technique with the ability to remove noise from signals as shown in [Saraiva et al., 2018], in addition, noise suppression with DCT is not enough in itself, so in this work a new method of identifying exonic regions using cross correlation with DCT together with an FFT-based bandpass filter to decrease signal noise and find exonic regions.


Introduction
The identification of protein coding regions (exons) in DNA sequences using signal processing techniques is an important component of bioinformatics and biological signal processing, this work presents a new identification method, in this case the cross correlation and the correlation coefficient was used to confirm the feasibility of the technique used. The availability of complete genome sequence of many eukaryotic organisms continues to contribute towards better understanding of their genome design and evolution. An average vertebrate gene consists of multiple small exons separated by introns that are 10 or 100 times longer. In order to understand the structure and evolution of eukaryotic genomes, it is important to know the general statistical characteristics of the exons and introns, furthermore the identification of the exonic regions assist in the process of analyzing the eukaryotic genome sequence (Avery et al., 1944;Morgan, 1911).
When the DNA sequence of a new eukaryotic organism is synthesized, the exonic (protein coding) regions must be distinguished from the introns. The protein coding regions of Development, v. 9, n. 9, e883998173, 2020 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v9i9.8173 3 DNA have been observed to exhibit a period-3 property due to the non-uniform codon usage in the translation of codons into amino acids (Fickett, 1982). The aim of this paper is to use this property to identify exonic regions (Hershey & Chase, 1952). A few codons take an interest more in protein union than others, offering ascend to reiterations of a particular sort of codon in the genome. For instance, the presence of an enormous number of GCA codons in the exonic areas gives more noteworthy reiteration of G, C and A nucleotides in the primary, second and third codon position, individually (Shetty, 2018). As such, the G, C and A nucleotides show period-3 property in the exonic areas (Akhtar et al., 2008). Quality discovering techniques dependent on hereditary attributes, for example, advertiser, CpG Island, start and stop codon and so on, will in general be of deficient exactness. The portrayal of coding and noncoding locales dependent on nucleotide insights inside codons is depicted by, who utilized a 12-image letter set to recognize the fringes among coding and noncoding districts (Bernaola-Galv án et al., 2000). Afterward, Nicorici and Astola sectioned the DNA grouping into coding and non-coding areas utilizing recursive entropic division and stopcodon measurements (Datta and Asif, 2005).

Methodology
In this section will be explained all the materials and methods used to achieve this work's results, to facilitate the understanding this section will be divided in 3 subsections that will explain in details the transforms and the statistics chosen. The step by step process is as shown in the Figure 1. Research, Society and Development, v. 9, n. 9, e883998173, 2020 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v9i9.8173

DNA numeric conversion
To apply the technique to the DNA sequence in order to find nucleotide a region exhibiting a denoised signal, the DNA sequence is first mapped onto the numerical sequence, the DNA sequence is organized as shown in the Figure 2.  Research, Society and Development, v. 9, n. 9, e883998173, 2020 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v9i9.8173 5 spoken to by '1' and '0', individually. For instance, given a segment of DNA succession ATCCGATATTC, the paired arrangement of the nucleotide A, signified IA[n], is [10000101000]. The paired groupings for the other three nucleotides T, C and G are discovered likewise After planning the DNA succession onto its parallel mathematical arrangement, the twofold grouping is gone through a Hamming window-based FIR channel of request 8 with focal recurrence set to 2/3. Absence of mutilations in FIR channels is one explanation behind their favored use over IIR channels in clinical applications.
Furthermore, the discrete cosine transform was applied on the signal to lower the noise on the data acquired, after that the signal become more understandable, and the exonic regions become more visible in the signal.
In the final step, the statistic is taken from the signal to find the statistic of how much accurate is the technique the statistic chosen was the cross-correlation and the correlation coefficient calculation of the resultant signal.

Fast Fourier Transform Based FIR Filter
Filters are signal conditioners. Each function by accepting an input signal, blocking pre specified frequency components, and passing the original signal minus those components to the output. For example, a typical phone line acts as a filter that limits frequencies to a range considerably smaller than the range of frequencies human beings can hear.
A digital filter takes a digital input, gives a digital output, and consists of digital components. In a typical digital filtering application, software running on a digital signal processor (DSP) reads input samples from an A/D converter, performs the mathematical manipulations dictated by theory for the required filter type, and outputs the result via a D/A converter. The FIR filter is designed using windowing, the method is to make an ideal filter in the frequency domain, and then translate it into the discrete time domain. However, this will give an infinite impulse response. To compensate for this, a window function is multiplied onto the ideal impulse response.
To make the ideal filter on the frequency domain we use the Fast Fourier Transform (FFT) and the hamming window as the principal tools. the FFT was defined like in the equation 1.
(1) Research, Society and Development, v. 9, n. 9, e883998173, 2020 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v9i9.8173 6 In FIR filter design the order for the filter is denoted M and it determines the length of the window, corresponding to the discrete-time notation of h[n] as it shown in the equation 2. (2) As the approximated impulse response of the filter and with w [n] as the windowing function as is shown on the equation 3. (3) The product of (M.2) is in the frequency domain equal to the convolution as is shown below on the equation 4. (4) The FIR frequency response H (ω) is a finite-degree polynomial in .

Discrete Cosine Transform
The discrete cosine transform (DCT) is very related to the Discrete Fourier Transform (DFT), it can often reconstruct a precise sequence of only a few DCT coefficients, this property is very useful for applications that require data reduction, precisely the purpose of this work, to explore the reduction of data use in electrocardiogram, [Nguyen et al., 2017].The DCT has four standard variants, for an x-signal of size N and with the kronecker δ, the transformations are defined by the equations 1, 2, 3 and 4 respectively.
Research, Society and Development, v. 9, n. 9, e883998173, 2020 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v9i9.8173 7 (8) The series are indexed with n = 1 and k = 1 instead of the usual n = 0 and k = 0. On the equations, x is meaning the input array, y are the DCT itself and n is equal to the length of the transform, a positive integer scalar, with x and y being vectors (they can be matrices) (Nguyen et al., 2017).
In his work, Swarnkar using the standlet transform achieved better results compared to DCT and Wavelet transform, being able to illustrate well its results using data like SNR, also used in this work, CR and Price Related Differential (PRD), A. Swarnkar et al., 2017]. A DCT expresses a series of finitely many data points in terms of a sum of cosine functions oscillate at different frequencies. DCT has the applications of solving partial differential equations, Chebyshev approximation, audio compression, (Raj & Ray, 2017).

Correlation coefficient
A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two components of a multivariate random variable with a known distribution. The correlation coefficient of two random variables is a measure of their linear dependence. If each variable has N scalar observations, then the Pearson correlation coefficient is defined as is shown in equation 9.    After that the cross-correlation are exemplified on the Figure 5, keeping in mind that the correlation coefficient also was used in the statistic, the result is shown on the Table 1. Research, Society and Development, v. 9, n. 9, e883998173, 2020 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v9i9.8173 Observing the Figure 5, is possible to see that the cross correlation between theresult and the DNA sequence is possible find that the points are, grouped as follows, clearly resemblant a triangle as shows itself proving that the signal obtained has a high correlation to the first sequence, after that the Table 1 show the 1.000 result and a mean error of -0.0052 giving the accuracy of 99,48% which shows that the correlation to its original sequence is nearly optimal as mentioned on the statistic section. To conclude this work, it was shown that DCT proved to be very effective for the reduction of noise in the signal obtained from the DNA sequence, even the correlation coefficient showed an excellent result, in terms of accuracy regarding the exons identified from signal obtained by DNA sequence which was shown in the Figure 3.