XSL Content

Data Mining26218

Centre
Faculty of Informatics
Degree
Bachelor's Degree in Informatics Engineering
Academic course
2022/23
Academic year
X
No. of credits
6
Languages
Spanish
Basque
Code
26218

TeachingToggle Navigation

Distribution of hours by type of teaching
Study typeHours of face-to-face teachingHours of non classroom-based work by the student
Lecture-based4060
Applied laboratory-based groups2030

Teaching guideToggle Navigation

Description and Contextualization of the SubjectToggle Navigation

This subject focuses on a field known as data mining or machine learning. It includes a series of techniques which, being based on artificial intelligence and classic statistics, have emerged strongly in the last decade for solving problems using large volumes of data. Its applications range from bioinformatics or finance to marketing-advertising, and also natural language.



Although the technological giants have been in the vanguard of this ‘data science - big data - data mining’ discipline for years, over the last few years more and more small- and medium-sized companies and institutions are becoming aware of the need to store data on their activities, and to analyse them to draw useful conclusions for their day-to-day operations. In the case of Euskadi, the machine tool sector and the term ‘Industry 4.0’ have increased the profile of our discipline.



The subject is closely linked to other computing subjects such as Artificial Intelligence and Algorithm Design; optional subjects include Heuristic Searches, plus others from other specialities related to databased and computing systems.



Students will study the main data mining techniques and will become familiar with real programs.

Skills/Learning outcomes of the subjectToggle Navigation

Main resuls of the learning process:

- knowledge adquisition on principal supervised classification techniques

- knowledge adquisition on principal non-supervised classification (clustering) techniques

- knowledge adquisition on principal techniques for classification models' evaluation

- skills on the use of principal software tools for learning and evaluation supervised and non-supervised classification models



Main data mining techniques will be studied, and the student will acquire skills in the use of free software which implements those techniques. The student will also show real data mining applications. Skills on the basic, international machine learning vocabulary will be acquired by the student.





Theoretical and practical contentToggle Navigation

1. Introduction to data mining

Applications and success stories. Everything related to data mining as a discipline within the field of artificial intelligence



2. Distance-based classifiers: k-nearest neighbour

The intuitive nature of this classic method of data mining makes it ideal as the first technique of supervised classification. Its basic functioning will be studied, together with its main variants and parameters for use.



3. Techniques to evaluate and validate classifiers

Study of the main techniques for evaluating classifiers, with special emphasis on supervised classification methods and the estimation of success rates. Introduction to the main statistical tests for comparison between different classifiers.



4. Classification trees and decision rules

Study of these two algorithms, inspired by the philosophy of 'divide and rule', with special emphasis on the transparency and simplicity of its final models. Different growth and pruning options will be explained.



5. Classifiers based on Bayesian networks

Study of the basic theory underlying Bayes' theorem. Classification models of different complexity will be explained. We will examine the following applications of this type of classifiers: models for diagnosis and prognosis in medicine (evidence-based medicine, computational medicine).



6. Combination of classifiers

Study of the different techniques used to combine classifiers. The virtues of the consensus reached by classifiers will be highlighted.



7. Techniques for selecting variables

Study of basic concepts and techniques, both from the univariate and the multivariate points of view. Applications of this type of techniques: most important genes in an illness (a new area of bioinformatics).



8. Non-supervised classification (clustering)

Main clustering techniques. Describing the characteristics of this type of problem, differentiating them from the supervised ones. Practical examples: image segmentation, groups of foodstuffs based on their nutritional characteristics, segmentation of customers and targeted marketing and advertising.



9. Introduction to heuristic searches and genetic algorithms

Study of the best-known search technique: genetic algorithms. Usefulness in solving problems of selection of variables. Practical examples: design problems (aircraft, Meccano), composition of musical scores, travel agency problems.



10. Introduction to neural networks

Basic mechanisms of a neural network classification structure. Main neural network architectures. The subject is a motivation for a further course in the Faculty: "Machine Learning and Neural Networks"

MethodologyToggle Navigation

Three lessons per week. One practical laboratory with computers (personal laptop, or provided by the Faculty), and two theoretical lessons.



Assessment systemsToggle Navigation

  • Continuous Assessment System
  • Final Assessment System
  • Tools and qualification percentages:
    • Written test to be taken (%): 60
    • Individual works (%): 40

Ordinary Call: Orientations and DisclaimerToggle Navigation

A mid-exam, consisting of the 70% of the theoretical material, will be realized by the beginning of November.

A last-exam, asking for a minimum mark, consisting on the rest of the theoretical material, will be realized by January.

At least two deadlines will be announced to collect the practical laboratories developed by the student.



In order to pass the subject, it is needed to pass both parts: theory and practice.



In case of lockdown: tests, interviews and works through the telematic systems of the UPV-EHU

Extraordinary Call: Orientations and DisclaimerToggle Navigation

A final examen in January consisting in the 100% theoretical material.

If the student has not delivered the practical laboratories during the weeks of the course, those must be delivered to the teacher one week before the final theoretical exam.



In order to pass the subject, it is needed to pass both parts: theory and practice.



In case of lockdown: tests, interviews and works through the telematic systems of the UPV-EHU





Compulsory materialsToggle Navigation

"egela" system is used to guide the "day-per-day" of the course: material of the theoretical lessons, as well as the formulation of the practica-laboratory sessions.

BibliographyToggle Navigation

Basic bibliography

- L. Gatto (2020). An Introduction to Machine Learning with R. https://github.com/lgatto/IntroMachineLearningWithR/

- H. Wickham, G. Grolemund (2017). R for Data Science. https://r4ds.had.co.nz/

- I. H. Witten, E. Frank (2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. (Second edition)

- B. Sierra (ed.) (2006). Aprendizaje Automático: conceptos básicos y avanzados. Prentice Hall.

- E. Alpaydin (2004). Introduction to Machine Learning. MIT Press.

- T. Mitchell (1997). Machine Learning. McGraw Hill.

- J. Han, M. Kamber (2006). Data Mining: concept and techniques. Morgan Kaufmann. (Second edition)

In-depth bibliography

- O. Pourret, P. Naïm, B. Marcot (2008). Bayesian networks: a practical guide to applications. Wiley.
- L.I. Kuncheva (2004). Combining Pattern Classifiers. Wiley.
- H. Liu, H. Motoda (ed.) (2008). Computational Methods of Feature Selection. Chapman & Hall/CRC.
- C.M. Bishop (2006). Pattern Recognition and Machine Learning. Springer.
- S. Brunak, P. Baldi (2001). Bioinformatics: the machine learning approach. MIT Press. (Second edition).
- B. Liu (2006). Web Data Mining: exploring hyperlink, contents and usage data. Springer.

Journals

- Machine Learning Journal. Springer.
- Journal of Machine Learning Research. Electronic publication.
- Data Mining and Knowledge Discovery. Springer.
- Bioinformatics. Oxford University Press.

Web addresses

- WEKA software: http://www.cs.waikato.ac.nz/ml/weka/
- Datasets' benchmark repository (University of California Irvine): http://archive.ics.uci.edu/ml/
- A list of intuitive data mining applications, described in a divulgative style (updated by the teacher): http://www.sc.ehu.es/ccwbayes/members/inaki/DM-applications.htm
- LiO software for heuristic optimization: http://www.dsi.uclm.es/simd/SOFTWARE/LIO/



Examining board of the 5th, 6th and exceptional callToggle Navigation

  • AZCUNE GALPARSORO, GORKA
  • INZA CANO, IÑAKI
  • SIERRA ARAUJO, BASILIO

GroupsToggle Navigation

16 Teórico (Spanish - Tarde)Show/hide subpages

Calendar
WeeksMondayTuesdayWednesdayThursdayFriday
1-15

14:00-15:30 (1)

15:30-17:00 (2)

Teaching staff

16 Applied laboratory-based groups-1 (Spanish - Tarde)Show/hide subpages

Calendar
WeeksMondayTuesdayWednesdayThursdayFriday
1-15

17:00-18:30 (1)

Teaching staff

16 Applied laboratory-based groups-2 (Spanish - Tarde)Show/hide subpages

Calendar
WeeksMondayTuesdayWednesdayThursdayFriday
1-15

17:00-18:30 (1)

Teaching staff

16 Applied laboratory-based groups-3 (Spanish - Tarde)Show/hide subpages

Calendar
WeeksMondayTuesdayWednesdayThursdayFriday
1-15

12:00-13:30 (1)

Teaching staff

31 Teórico (Basque - Mañana)Show/hide subpages

Calendar
WeeksMondayTuesdayWednesdayThursdayFriday
1-15

09:00-10:30 (1)

10:30-12:00 (2)

Teaching staff

31 Applied laboratory-based groups-1 (Basque - Mañana)Show/hide subpages

Calendar
WeeksMondayTuesdayWednesdayThursdayFriday
1-15

12:00-13:30 (1)

Teaching staff

31 Applied laboratory-based groups-2 (Basque - Mañana)Show/hide subpages

Calendar
WeeksMondayTuesdayWednesdayThursdayFriday
1-15

14:00-15:30 (1)

Teaching staff