A data-mining based process to early identify breast cancer from metabolomic data

Abstract of our work presented at EURO 2018, the largest and most important conference for Operational Research, co-authored by Víctor M. Rivas Santos, jointly with researchers of Complejo Hospitalario de Jaén and Fundación Medina.

This paper was presented last 9-July-2018 at Valencia, as part of the stream Data Mining and Statistics.

A data-mining based process to early identify breast cancer from metabolomic data

Abstract

We present the results yielded by our multidisciplinary group in the task of discriminating blood samples coming from breast cancer patients and healthy people. Models used to classify samples have been built using data mining techniques; data have been collected by means of liquid chromatography-mass spectrometry, a technique that detects and quantifies the metabolites present in blood samples.

Different algorithms have been tested under 10-CV and 75/25 scenarios. Our experiments showed that IBk, and J48 and Logistic Model Trees yielded rates greater than 90% only for healthy people. Naive Bayes and Random Forest enhanced the previous results in the 10-CV approach, but they did not yield more than 85% of true positives for patients in the 75/25 one. Finally, Bayesian network resulted to be the best algorithm as rates greater than 90% were yielded for both patients and rest of the people.

Many statistics have been computed as well as confusion matrices, showing that the model built by Bayesian network can effectively be used to solve this problem. Currently, the metabolites used to do built the model are being identified by biochemists. This last step will be definitive in order to consider them as a valid biomarker for breast cancer.