A theoretical and practical study by the UPV/EHU has endeavoured to solve two very common problems in Machine Learning. In the first, few known labelled data are available and semi-supervised learning is used so that the computer can learn to classify properly. The second tackles class imbalance: there are many labelled data of one type whereas there are very few of others, which leads the models to draw wrong conclusions.
Theoretical solutions for two very common problems in artificial intelligence
A study by the UPV/EHU-University of the Basque Country has made significant advances in two types of problems in computer machine learning based on data
First publication date: 27/08/2019
The research conducted by the computer engineer Jonathan Ortigosa in the Department of Computer Sciences and Artificial Intelligence of the UPV/EHU’s Faculty of Informatics in Donostia-San Sebastian is based on Machine Learning, a popular branch of Artificial Intelligence mainly based on the learning of predictive models based on data. “It is a field that explores the building of models that can learn and make predictions on the basis of the data they are provided with,” specified the researcher. Specifically, Ortigosa has focussed his work on automatic classification tasks: “In this field an attempt is made to use a large quantity of data so that computers are capable of learning from them and automatically producing classifications without being explicitly programmed to do so,” explained Ortigosa.
The research has focussed on two problematic situations that are very common in this field and “which are currently posing huge challenges for the scientific community, as they constantly emerge in the problems being tackled by machine learning”, he pointed out. It all began with a piece of work relating to sentiment analysis. “It was a piece of work to characterise articles from various blogs on certain products to find out whether the texts were objective or subjective, whether they were making positive or negative assessments, etc.,” he explained. But the researchers had very few properly labelled articles to enable the computer to learn robust models. That is why “we had to create new learning algorithms that would use large quantities of unlabelled data available over the Internet and a small proportion of labelled ones, and the result improved what already existed”, added Ortigosa.
That prompted the author of the work to ask himself: “What is the minimum number of labelled data needed to resolve problems similar to the above-mentioned one?” So he conducted a theoretical and mathematical study of this subject, and analysed “what the best semi-supervised algorithm that could be proposed for a certain small quantity of labelled data would be and what its error would be”. That way they calculated what error would be the lowest that could be achieved with any algorithm that might be proposed for this kind of problem, in other words, “we can know whether a certain number of data would be sufficient to achieve a percentage of accuracy. That way it is possible to estimate the worthiness of the solution proposed”, he specified.
The other problem he set out to tackle was class imbalance: “Teaching a computer is very similar to the way small children are taught to distinguish between dogs and cats. But if they are shown many dogs and only one cat, they may not understand the difference well or may draw incorrect conclusions,” explained Ortigosa. Yet in machine learning, as the author pointed out, “a wrong conclusion by the computer may have serious consequences in a company”. In this respect, they proposed “a metric to measure what degree of imbalance, or difference in label types, are present in the data that are made available for the model to be learnt. This degree is related to the performance of the solution that can be proposed with these data, and therefore, it is essential to measure it”, he added.
The next step was to come up with metrics to evaluate whether a solution proposed for a problem of imbalance is good or not. “Imagine we have 1,000 animals, 999 dogs and 1 cat. If we create a solution that says all animals are dogs, we have a 99.9% degree of accuracy. The number is very good but the solution is not. This evaluation metric is called accuracy and is widely used in Machine Learning,” he pointed out. To penalise these cases of “silly” solutions, in this research they conducted a theoretical study “to be able to draw up a set of recommendations as to which evaluation metrics are suitable in these cases and thus to be able to make an honest, useful evaluation of the solutions”.
As Ortigosa pointed out, besides the research applied to each of the problems, in other words, besides seeking the practical resolution of the problems, he conducted a theoretical piece of research. “I mathematically modelled both problems to be able to control them, study them in depth and extract information that may be used in proposing solutions for real problems,” explained the researcher. “Real problems are complex, and even though much research is being done, great theoretical knowledge is needed so that later you know how to propose solutions that are better than the existing ones,” he concluded.
Jonathan Ortigosa (Donostia-San Sebastian, 1985) wrote up his PhD thesis (‘Theoretical and Methodological Advances in Semi-supervised Learning and the Class-Imbalance Problem’) in the Department of Computer Sciences and Artificial Intelligence of the Faculty of Informatics in Donostia-San Sebastian, under the supervision of José A. Lozano, professor of the Faculty of Informatics, leader of the Intelligent Systems Group of the UPV/EHU and scientific director of BCAM, and Iñaki Inza, lecturer in the Faculty of Informatics. Right now, Ortigosa leads the Advanced Analytics and Artificial Intelligence team of the Department of Advanced Manufacturing and Standardization at Gestamp.