General details of the subject
- Face-to-face degree course
Description and contextualization of the subjectThis course is an introduction to corpus linguistics. We will start with a brief introduction to textual corpora, including linguistic annotation and representation schemas. We will then address aspects such as the extraction of relevant information from corpora, such as collocations or keyword extraction, using statistical and distributional techniques. Finally, we will learn the XML markup language. During the module we will introduce several corpora in various languages (English, Spanish, Basque, etc).
|SOROA ECHAVE, AITOR||University of the Basque Country||Profesorado Agregado||Doctor||Bilingual||Science of Computation and Artificial Intelligenceemail@example.com|
|Ability to design and manage big linguistic resources (textual and speech corpora, multilingual corpora and lexical-semantic databases).||40.0 %|
|Ability to develop heuristics and to modify classic algorithms to adapt them for specific tasks.||20.0 %|
|Ability to design and manage systems based on standard annotation languages based on XML such as TEI or NAF.||40.0 %|
|Type||Face-to-face hours||Non face-to-face hours||Total hours|
|Applied computer-based groups||20||30||50|
|Name||Hours||Percentage of classroom teaching|
|Computer work practice, laboratory, site visits, field trips, external visits||50.0||40 %|
|Name||Minimum weighting||Maximum weighting|
|Attendance and participation||20.0 %||20.0 %|
|Portfolio||20.0 %||20.0 %|
|Practical tasks||40.0 %||40.0 %|
|Presentations||20.0 %||20.0 %|
Learning outcomes of the subjectIn this course the students will learn the principles of corpus linguistics and linguistic annotations, including markup languages such as XML. At the end of the course, the students will be able to extract many relevant information from textual corpora based on statistical analysis.
Temary1. Introduction to Corpus Linguistics
2. Corpus characteristics and types
- Corpus examples
3. Corpus annotation
- Usual marks and analysis levels
4. Linguistic representation
- The XML markup langiages
- standards for linguistic representation (TEI, NAF, AWA)
- Unix tools
- Word frequencies and Zipf law
- Keyword extraction
- XML and XPath
Basic bibliographyAarts, J. And Meijs, W. (eds.) (1986) Corpus Linguistics II, Amsterdam: Rodopi.
Aijmer, K. and Altenberg, B. (Eds) (1991) English Corpus Linguistics: Studies In Honour Of Jan Svari. London: Longman.
Anthony, L. (2013) ¿A critical look at software tools in corpus linguistics¿, Linguistic Research, Volume 30, Issue 2, pp. 141-161.
Baker, P. (2010) Sociolinguistics and Corpus Linguistics. Edinburgh University Press, Edinburgh.
Garside, R., Leech, G. and McEnery, T. (1997) Corpus Annotation. Longman, Harlow.
Jurafsky D., Martin J.H. (2000) Speech and Language Processing. An Introduction To Natural Language Processing Computational Linguistics and Speech Recognition. Prentice-Hall.
Lawler J., Aristar H. (1998) Using Computers In Linguistics. A Practical Guide. Routledge.
Leech, G. And Fallon, R. (1992) "Computer Corpora - What Do They Tell Us About Culture". Icame Journal, 29-50.
McEnery, T. and Hardie, A (2012) Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, Cambridge.
Text Encoding And Interchange, TEI P5 (2016) Chicago And Oxford: Text Encoding Initiative.