Computer-readable corpora can consist of raw text only,i.e. plain text with no additional information. Many corpora have been provided with some kind of linguistic information, here called mark-up or annotation.
Types of corpora
There are many different kinds of corpora. They can contain written or spoken (transcribed) language, modern or old texts, texts from one language or several languages. The texts can be whole books, newspapers, journals, speeches etc, or consist of extracts of varying length. The kind of texts included and the combination of different texts vary between different corpora and corpus types.
'General corpora' consist of general texts, texts that do not belong to a single text type, subject field, or register. An example of a general corpus is the British National Corpus.Some corpora contain texts that are sampled (chosen from) a particular variety of a language, for example, from a particular dialect or from a particular subject area. These corpora are sometimes called 'Sublanguage Corpora'.
@OrinKerr @lsolum @HoffProf @derektmuller @MBKugler Yeah, that's the ticket -- let's have a bunch of scholars/students do "corpus linguistics" research to discover the "original public meaning" of words that *aren't* in the Constitution! What is the worl— Marty Lederman (@SteamPoweredDM) May 30, 2018
Corpora can consist of texts in one language (or language variety) only or of texts in more than one language. If the texts are the same in all languages, e.i. translations, the corpus is called a ParallelCorpus. A ComparableCorpus is a collection of "similar" text
For a list of various corpora, click HERE
Corpora serve as the basis for a number of research tasks within the field of Corpus Linguistics.