Researchers from the NUS Cancer Science Institute of Singapore (CSI) and the School of Biological Sciences at Nanyang Technological University (NTU) had conducted a study which showed a common deficiency in existing artificial intelligence (AI) methods used to predict enhancer-promoter interactions that may result in inflated performance measurements.
It should be mentioned that an enhancer is a short sequence of DNA that works to boost the genetic transcription and a promoter is a piece of DNA which acts to initiate gene transcription.
It is quite critical to understand how an enhancer and a promoter interact for gene regulation studies as there is a great scientific interest in knowing whether such interactions may be dysfunctional in cancer cells and present an opportunity for clinical intervention. But AI methods for predicting such interactions are very important to facilitate researchers in their studies and enable them to extend the availability of such data to new cell types.
This research was conducted by Cao Fan, a research fellow at CSI and Dr Melissa J. Fullwood, Principal Investigator at CSI and a Nanyang Assistant Professor at NTU. Under the supervision of these lead researchers, the team attempted to develop an enhancer-promoter interaction prediction method using existing data sets from TargetFinder.
It should be noted that TargetFinder is an advanced machine learning method that predicts enhancer-promoter interactions based on transcription factor and histone, which are highly alkaline proteins found in eukaryotic cell nuclei, its modification profiles in the window regions between enhancers and promoters.
While conducting this research process, the team observed that enhancer-promoter interactions were predicted at random DNA sequence features in the window regions, while indicating high performance.
The research team also noticed while examining the TargetFinder datasets that high performances could be attributed to the high overlap between window regions of positive samples in the datasets, affecting the predicted performance.
The team also evaluated enhancer-promoter interaction methods using a chromosome-split strategy to mitigate the issue of overlapping samples. TargetFinder achieved significantly lower performance with the chromosome-split strategy, which proved that the performance measurements were indeed inflated in the earlier prediction.
The team also examined another method, called JEME which is a machine learning method that uses datasets with significant differences in distance distributions between positive and negative samples to predict enhancer-promoter interactions. After this investigation, the team came to know that JEME too results in inflated performance measurements due to erroneous use of input data.
Cao Fan from CSI said this study highlights "the need for careful experimental design when applying machine learning to genomic research."
"It is key to properly evaluate an enhancer-promoter interaction method, and take into account the possibility of generating highly inflated performance measurement."