Shashi Bajaj Mukherjee1*, Pradip Kumar Sen2
[Vol. 03 (01), March, 2022, pp. 91-107]
Gene finding techniques are based on identifying coding portions in DNA sequences. In this paper, Gene finding techniques are based on identifying coding portions in DNA sequences. In this paper, the difference between coding and non-coding sequences is studied and a criterion is developed to classify asequence into one of the two types. The species considered in Drosophila Melanogaster, and all six chromosomes are studied individually. The basic characteristic of a sequence that is considered is the distribution of the four bases A, C, G and T. A statistical test of homogeneity is used to study the difference among the types in this respect. In this connection, we also tested the GC-richness property of coding sequences, which is validated. Finally, a distance type parameter is developed for each sequence. Actually d2 , the square of the Euclidean distance of the distribution of bases from a standard distribution is considered. For the standard distribution, various choices are examined. The frequency distribution of d2 for the coding and non-coding types shows clear differences for all choices for the standard distribution. Finally a critical value of d2 is obtained to classify a sequence as coding or non-coding, which gives more than 90% accuracy for all six chromosomes.