Subject

XSL Content

Machine Learning (II)

General details of the subject

Mode: Face-to-face degree course
Language: English

Description and contextualization of the subject

The machine learning discipline is based on a set of techniques for data modeling that arise from artificial intelligence and statistics areas. These models are learned from data, and commonly used for classification and/or description purposes.

The machine learning field has lived an exponential protagonism increase in different application areas such as bioinformatics, industry, finance, and natural language processing.

The course is focused on the study of the principal tools in a classifical “data analysis pipeline”: data preprocessing, feature selection, learning scenarios, evaluation and comparison. The techniques are illustrated by the use of powerful machine learning software, and applied over different natural language processing problems.

Teaching staff

Name	Institution	Category	Doctor	Teaching profile	Area	E-mail
INZA CANO, IÑAKI	University of the Basque Country	Profesorado Pleno	Doctor	Bilingual	Science of Computation and Artificial Intelligence	inaki.inza@ehu.eus

Competencies

Name	Weight
Learn skills to deal with strategies and tools for natural language processing.	30.0 %
Learn skills to deal with machine learning methods that analyze text corpora.	70.0 %

Study types

Type	Face-to-face hours	Non face-to-face hours	Total hours
Lecture-based	10	15	25
Applied computer-based groups	20	30	50

Training activities

Name	Hours	Percentage of classroom teaching
Computer work practice, laboratory, site visits, field trips, external visits	50.0	40 %
Lectures	25.0	40 %

Assessment systems

Name	Minimum weighting	Maximum weighting
Practical tasks	0.0 %	100.0 %

Learning outcomes of the subject

Identify the principal machine learning scenarios: differences and similarities.

Identification of the adequate machine learning technique to be applied in a specific machine learning scenario.

Learn the basic, standard steps of a classic machine learning ¿pipeline¿.

Acquire skills in the use of R-project libraries to create a ¿document-term¿ matrix from a corpus and apply machine learning techniques over it.

Ordinary call: orientations and renunciation

Continuous evaluation:

First, it is needed that the student attends, at least, 80% of the sessions. The evalution consists in an individual project, resumed in he following lines:

Starting from raw text (e.g. tweets or comments in social networks, html text, a set of text files, etc.), it is needed to import an reate a corpus. The corpus needs to be based in a supervised problem, composed of texts-documents with differente labels. The corpus will be preprocessed with basic text-mining filters (e.g. removing stop-words, stemming, removing of sparse terms, etc.). R-project's “tm” (“text-mining”) package will be used for this purpose. Corpus will be transformed to a matrix-format, in order to be processed by machine learning specialized software, in our case, popular R's “caret” package. A classical supervised pipeline will be applied, consisting at least in the following steps: load and data exploration, variables' preprocessing, corpus partition for validation, feature extraction and selection, application of class-imbalance techniques, learning and tuning of classification models, statistical comparison.

The output of the project will be a “notebook”, which alternates the implemented code with description of its functionalities and design decisions taken.

Single-final evaluation:

Individual project: when the student can't attend the lessons and he/she asks for a single final evaluation, this will consist in the development of the individual project previously exposed.

Extraordinary call: orientations and renunciation

Individual project: when the student can't attend the lessons and he/she asks for a single final evaluation, this will consist in the development of the individual project previously exposed.

Temary

1- General terms on the "data science" world: the "data science" term, relation among AI and data science, the big data term, kaggle repository, kdnuggets.com, data science for a better world...

2- Principal classification scenarios: supervised classification, unsupervised classification (clustering), weakly supervised classification (alternative scenarios). For each learning scenario: structure of the data matrix, type of annotation, real world applications.

3- Semi-supervised classification: usefulness in NLP tasks. Software, RSSL package in R.

4- One-class classification and outlier detection: usefulness in NLP tasks. Software, R packages.

5- Using statistical tests to compare the accuracy of different classifiers. Software: R, online statistical tests in the web

6- Feature selection techniques. Techniques for selecting a "competitive" subset of original features.

7- General techniques and filters for data preprocessing. Preprocessing filters for any kind of data: missing data imputation, one-hot encoding, discretization, imbalanced class distributions...

8- "A short introduction to the tm (text mining) package in R: text processing". How to construct by text mining operators a proper corpus, and transform to a document-term matrix for further machine learning analysis. Starting from raw text such as files, html pages, twitter... A tutorial using R software.

9- "The machine learning approach: clustering words and classifying documents with R". A tutorial using R software, caret package.

10 - "First steps on deep learning for NLP by R’s h2o package (+word2vec)". A tutorial using R software. Voluntary work

Bibliography

Basic bibliography

• M. Kuhn, K. Johnson (2013). Applied Predictive Modeling. Springer.

• ParallelDots, online text analysis APIs for several tasks: sentiment analysis, tags' prediction, keyword generator, entity extraction, comparing similarity of texts, different emotions analysis, intent analysis, abusive text prediction, etc. https://www.paralleldots.com/text-analysis-apis

• sentiment140: an interesting project for automatic sentiment categorization of tweets: http://help.sentiment140.com/

• Stanford TreeBank project. "Recursive deep models for semantic compositionality over a semantic treebank". https://nlp.stanford.edu/sentiment/

• RDataMining website: Text mining with R: Twitter data analysis: http://www.rdatamining.com/docs/text-mining-with-r

• Awesome sentiment analysis: A curated list of Sentiment Analysis methods, implementations and misc. https://github.com/xiamx/awesome-sentiment-analysis

• "5 things you need to know about sentiment analysis and classification": https://www.kdnuggets.com/2018/03/5-things-sentiment-analysis-classification.html

• Bing Liu's website on "Opinion mining, sentiment analysis and opinion spam detection: the machine learning approach". https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

• 18 NLP key terms, explained for ML practitioners and NLP novices: https://www.kdnuggets.com/2017/02/natural-language-processing-key-terms-explained.html

Search Bar

Erasmus Mundus Master in Language and Communication Technologies (LCT)

Subject

XSL Content

Machine Learning (II)

General details of the subject

Description and contextualization of the subject

Teaching staff

Competencies

Study types

Training activities

Assessment systems

Learning outcomes of the subject

Ordinary call: orientations and renunciation

Extraordinary call: orientations and renunciation

Temary

Bibliography

Basic bibliography

Search Bar

Breadcrumb

Subject

XSL Content

Machine Learning (II)

General details of the subject

Description and contextualization of the subject

Teaching staff

Competencies

Study types

Training activities

Assessment systems

Learning outcomes of the subject

Ordinary call: orientations and renunciation

Extraordinary call: orientations and renunciation

Temary

Bibliography

Basic bibliography