Na-Rae Han's home page

LING 2050 Special Topics in Linguistics: Corpus Linguistics

Spring 2010, University of Pittsburgh

Instructor: Na-Rae Han
Meetings: MW 9:30am -- 10:45am, 340 Cathedral of Learning
This course is an introduction to the use of corpora in the study of language. In modern linguistics, the term "corpus" is used to refer to large collections of electronic texts which represent a sample of a particular variety of use of language(s). In a more general sense, the term refers to any collection of authentic and naturally occurring texts in an electronic form. Use of corpora has been increasingly popular in both empirical linguistic research and language engineering. In this class, students will be introduced to the field of corpus linguistics, learn how to utilize existing corpora such as the British National Corpus and the Corpus of Contemporary American English, learn the basic computational skills and quantitative methods necessary in carrying out a corpus investigation, find out how corpora are influencing recent trends in linguistic research, and have opportunities to apply corpus-based methods in their own work.

Course Syllabus: [pdf]
Term Project Overview: [pdf]
Class Schedule (Revised)
WDate Topic Readings
1 1/06 (W) Introduction [pdf] [1]A1
21/11 (M) Corpus basics [pdf] [1]A1,A2
1/13 (W) Corpus basics [pdf] [1]A3
3 1/20 (W) Corpus annotation [pdf] [1]A4,A5
41/25 (M) Survey of available corpora [pdf] [1]A7
1/27 (W) Survey of available corpora [pdf]
Lab 1: Navigating in terminals; displaying files
52/01 (M) Survey of available corpora [pdf] [1]A7
2/03 (W) Lab 2: Managing files; Searching text file contents [Exercise]
62/08 (M)
Snow day: class canceled*
2/10 (W)
Snow day: class canceled*
72/15 (M) Collocation, frequency, corpus statistics [pdf] [1]A6,C1 [b], [c]
2/17 (W) Lab 3: Text transformation; Word lists and type frequency tables [HW] [a]
82/22 (M) Lexical studies: collocation, phraseology, semantic prosody [1]A10.2, (1)
2/24 (W) Lab 4: Compiling N-grams; Regular Expressions [a]
93/01 (M) Grammatical studies [1]A10.3, (2)
3/03 (W) Lab 5: Regular expressions
Spring break
103/15 (M) Language variation studies [1]A10.4, A10.5, (3)
3/17 (W) Lab 6: CLAN, AntConc [d], [e]
113/22 (M) Contrastive and diachronic studies [1]A10.6, 10.7, (4)
3/24 (W) Lab 7: NLTK [f], [g]
123/29 (M) Corpora in language education: Issues of language description; General and specific applications [2]Ch.6,7,8; (5)
3/31 (W) Lab 8: NLTK [f], [g]
134/05 (M) Corpora in language education: Studies [1]A10.8, (5)
4/07 (W) Lab 9: R
144/12 (M) Stylistics, stylometry, translation studies [1]A10.13, (6)
4/14 (W) Lab 10: R
154/19 (M)Term project presentation
4/21 (W)Term project presentation
164/30 (F)Project paper due
*A make up class will be scheduled.

Assignment Schedule:

  1. 1/20 -- 2/01 Class presentation: survey of two corpora of your own choice
  2. 2/17 Homework assignment: processing text files
  3. 3/03 Homework assignment: regular expression search, N-gram stats
  4. 3/31 Homework assignment: corpus processing using NLTK
  5. 3/01 -- 4/12 Class presentation: literature reviews and case studies
  6. Additionally, a few take-home exercises will be given for lab classes.

Main textbook:
[1] Corpus-Based Language Studies: An Advanced Resource Book. Tony McEnery et al. Routledge, 2006.
*See References section below for articles covered in the B and C units of this book.

Supplementary books (a couple chapters will be used):
[2] Corpora in Applied Linguistics. Susan Hunston. Cambridge, 2002.
      Ch.6,7,8 Corpora and language teaching
[3] From Corpus to Classroom. O'Keeffe et al, Cambridge, 2007.
[4] Corpus Linguistics. McEnery & Wilson, Edinburgh Univ. Press, 2001.


    All topics
  • Martin Wynne (editor). 2005. Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books. Available online from this [link]
  • John Sinclair. 2005. "How to Build a Corpus" in Developing Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow Books: 1-16. Available online from this [link]
  • [b] Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press. [link] Ch.5 Collocations [pdf]
  • Bird, S., E. Klein and E. Loper. 2009. Natural Language Processing with Python. O'Reilly Media. [home][e-book][book]
  • Gries, S. 2009. Quantitative Corpus Linguistics with R: A Practical Introduction. Routledge. [link]
  • Baayen, R. H. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge University Press. [link]
  • [c] Zipf's law [link]
  • MacWhinney, B. 'The CHILDES System'. [pdf]

    (1) Lexical studies: collocation, phraseology, semantic prosody
  • [1]B3: Partington, A. 2004. 'Utterly content in each other's company': semantic prosody and semantic preference'. International Journal of Corpus Linguistics 9:1.
  • Hunston, S. 2007. 'Semantic prosody revisited'. In Moon, Rosamund (ed.), Words, grammar, text: revisiting the work of John Sinclair: Special issue of International Journal of Corpus Linguistics 12:2. [link]
  • Biber, D. 2009. 'A corpus-driven approach to formulaic language in English: Multi-word patterns in speech and writing'. International Journal of Corpus Linguistics 14:3. [link]
  • Forchini, P. 2008. 'N-grams in comparable specialized corpora: Perspectives on phraseology, translation, and pedagogy'. In Amanda Murphy Römer, Ute and Rainer Schulze (eds.), Patterns, meaningful units and specialized discourses: Special Issue of International Journal of Corpus Linguistics 13:3. [link]

    (2) Grammatical studies
  • [1]B3: Carter, R. and McCarthy, M. 1999. 'The English get-passive in spoken discourse: description and implication for an interpersonal grammar'. English Language and Literature 3:1.
  • [1]B3: Kreyer, R. 2003. 'Genitive and of-construction in modern written English: processability and human involvement'. International Journal of Corpus Linguistics 8:1.
  • Nesselhauf, N. and Ute Römer. 2007. 'Lexical-grammatical patterns in spoken English: The case of the progressive with future time reference'. International Journal of Corpus Linguistics 12:3. [link]

    (3) Register and language variation
  • [1]B4: Biber, D. 1995a. 'On the role of computational, statistical, and interpretive techniques in multi-dimensional analysis of register variation'. Text 15:3.
  • [1]B4,C5: Biber, D. 1988. Variation Across Speech and Writing. Cambridge: Cambridge University Press.
  • Biber, D. 1993. 'Using register-diversified corpora for general language studies'. Computational Linguistics 19:2. [link]
  • [1]C4: McEnery, A. and Xiao, Z. 2004. 'Swearing in modern British English: the case of FUCK in the BNC'. Language and Literature 13:3.
  • [1]B4: Lehmann, H. 2002. 'Zero subject relative constructions in American and British English'. New Frontiers in Corpus Research, pp. 163-177. Amsterdam: Rodopi.
  • [1]B4: Kachru, Y. 2003. 'On definite reference in world Englishes'. World Englishes 22:4.
  • Peters, P. 'A study of backchannels in regional varieties of English, using corpus mark-up as the means of identification'. International Journal of Corpus Linguistics 12:4. [link]

    (4) Contrastive and diachronic studies
  • [1]B5: Altenberg, B. and Granger, S. 2002. 'Recent trends in cross-linguistic lexical studies' in B. Altenbert and S. Granger (eds) Lexis in Contrast, pp. 3-48. Amsterdam: John Benjamins.
  • [1]B5: McEnery, A., Xiao, Z. and Mo, L. 2003. 'Aspect marking in English and Chinese'. Literary and Linguistic Computing 18:4.
  • [1]B5: Kilpiö, M. 1997. 'On the forms and functions of the verb to be from Old to Modern English.' In M. Rissanen, M. Kytö and K. Heikkonen (eds.), English in Transition: Corpus-Based Studies in Linguistic Variation and Genre Styles. 87-120. Berlin: Mouton de Gruyter.
  • Millar, N. 2009. 'Modal verbs in TIME: Frequency changes 1923-2006'. International Journal of Corpus Linguistics 14:2. [link]

    (5) Acquisition, SLA
  • [1]B6: Gavioli, L. and Aston, G. 2001. 'Enriching reality: language corpora in language pedagogy'. ELT Journal 55:3.
  • [1]B6: Thurstun, J. and Candlin, C. 1998. 'Concordancing and teaching of the vocabulary of academic English'. English for Specific Purposes 17: 267-280.
  • Flowerdew, L. 2009. 'Applying corpus linguistics to pedagogy: A critical evaluation' International Journal of Corpus Linguistics 14:3. [link]
  • Mahlberg, M. 2006. 'Lexical cohesion: Corpus linguistic theory and its application in English language teaching'. In Flowerdew, John and Michaela Mahlberg (eds.), Lexical Cohesion and Corpus Linguistics: Special issue of International Journal of Corpus Linguistics 11:3 [link]
  • Lu, X. 2009. 'Automatic measurement of syntactic complexity in child language acquisition' International Journal of Corpus Linguistics 14:1. [link]

    (6) Stylistics, stylometry, translation studies
  • Fischer-Starcke, B. 'Keywords and frequent phrases of Jane Austen's Pride and Prejudice: A corpus-stylistic analysis'. International Journal of Corpus Linguistics 14:4. [link]
  • Grieve, J. 2007. 'Quantitative authorship attribution: an evaluation of techniques'. Literary and Linguistic Computing 22(3). [link]
  • Dayrell, C. 2007. 'A quantitative approach to compare collocational patterns in translated and non‑translated texts'. International Journal of Corpus Linguistics 12:3. [link]
Corpus Resource Pages:

Corpora [Page]

  • Web-searchable corpora
  • Other easy-access corpora
  • For-fee/limited-access corpora
  • Corpus archives
  • Corpora in other languages

Tools and More [Page]

  • Organizations
  • Tools and software
  • Other corpus resource pages

Lab Pages:

  • Lab 0: Setting up your computing environment
  • Lab 1: Navigating in terminal environment; displaying file contents
  • Lab 2: Managing multiple files; basics of searching text file contents [Exercise]
  • Lab 3: Basics of text transformation; Extracting word lists and type frequency tables [Homework 1]
  • Lab 4: N-grams; Regular Expressions
  • Lab 5: More Regular Expressions; perl
  • Lab 6: Using Antconc [Homework 2]
  • Lab 7: Installing python and NLTK; using NLTK corpora

Lab Help Pages:

  • Configuring your terminal environment [OS-X][Cygwin]
  • Unix command reference sheet [Page]

Lab Forum is located in Pitt's CourseWeb class page.

Lab References: