What Are Language Corpora?

Introduction to Corpora A corpus is a collection of texts or text extracts that have been put together to be used as a sample of a language or language variety.2 Mar 2022
2010

LDC2010S01 Fisher Spanish SpeechLDC2010S02 WTIMIT 1.0LDC2010S03 2003 NIST Speaker Recognition EvaluationLDC2010S07 Asian Spoken Language SamplerLDC2010T01 NIST Open Machine Translation 2008 Evaluation (MT08) Selected Reference and System TranslationsLDC2010T02 Czech Broadcast News MDE TranscriptsLDC2010T03 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2LDC2010T04 Fisher Spanish - TranscriptsLDC2010T05 NPS Internet Chatroom Conversations, Release 1.0LDC2010T07 Chinese Treebank 7.0LDC2010T08 Arabic Treebank: Part 3 v 3.2LDC2010T09 ACE 2005 Mandarin SpatialML AnnotationsLDC2010T10 NIST 2002 Open Machine Translation (OpenMT) EvaluationLDC2010T11 NIST 2003 Open Machine Translation (OpenMT) EvaluationLDC2010T12 NIST 2004 Open Machine Translation (OpenMT) EvaluationLDC2010T13 Arabic Treebank: Part 1 v 4.1LDC2010T14 NIST 2005 Open Machine Translation (OpenMT) EvaluationLDC2010T15 Message Understanding Conference 7 Timed (MUC7_T)LDC2010T17 NIST 2006 Open Machine Translation (OpenMT) EvaluationLDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0LDC2010T19 Korean Newswire Second EditionLDC2010T21 NIST 2008 Open Machine Translation (OpenMT) EvaluationLDC2010T22 Manually Annotated Sub-Corpus First ReleaseLDC2010T23 NIST 2009 Open Machine Translation (OpenMT) EvaluationLDC2010V01 TRECVID 2004 Keyframes & TranscriptsLDC2010V02 TRECVID 2006 Keyframes

2009

LDC2009E58 TAC 2009 KBP Evaluation Reference Knowledge BaseLDC2009L01 An English Dictionary of the Tamil Verb Second EditionLDC2009S01 CSLU: Numbers Version 1.3LDC2009S02 Czech Broadcast Conversation SpeechLDC2009S03 CSLU: S4X Release 1.2LDC2009T01 English CTS Treebank with Structural MetadataLDC2009T02 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1LDC2009T04 2007 NIST Language Recognition Evaluation Test SetLDC2009T05 2007 NIST Language Recognition Evaluation Supplemental Training SetLDC2009T06 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2LDC2009T07 Unified Linguistic Annotation Text CollectionLDC2009T08 Japanese Web N-gram Version 1LDC2009T09 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2LDC2009T10 Language Understanding Annotation CorpusLDC2009T11 REFLEX Entity Translation Training/DevTestLDC2009T12 2008 CoNLL Shared Task DataLDC2009T13 English Gigaword Fourth EditionLDC2009T14 Tagged Chinese Gigaword Version 2.0LDC2009T15 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1LDC2009T20 Czech Broadcast Conversation MDE TranscriptsLDC2009T21 Spanish Gigaword Second EditionLDC2009T22 Arabic Newswire English Translation CollectionLDC2009T23 FactBank 1.0LDC2009T24 OntoNotes Release 3.0LDC2009T25 Web 1T 5-gram, 10 European Languages Version 1LDC2009T26 NXT Switchboard AnnotationsLDC2009T27 Chinese Gigaword Fourth EditionLDC2009T28 French Gigaword Second EditionLDC2009T29 ACL Anthology Reference CorpusLDC2009T30 Arabic Gigaword Fourth EditionLDC2009V01 Audiovisual Database of Spoken American English

2008

LDC2008L02 Hindi WordNetLDC2008L03 Global Yoruba Lexical Database v. 1.0LDC2008T02 GALE Phase 1 Arabic Blog Parallel TextLDC2008T06 GALE Phase 1 Chinese Blog Parallel TextLDC2008T07 Chinese Proposition Bank 2.0LDC2008T08 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2LDC2008T09 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2LDC2008T13 BLLIP North American News Text, CompleteLDC2008T17 CALLHOME Mandarin Chinese Transcripts - XML VersionLDC2008T18 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3LDC2008T20 PennBioIE CYP 1.0

2007

LDC2007E37 CoNLL 2007 Shared Task English Test Set, Part 1LDC2007E38 CoNLL 2007 Shared Task Arabic Test Set, Part 1LDC2007E39 CoNLL 2007 Shared Task Czech Test Set, Part 1LDC2007E40 CoNLL 2007 Shared Task English Test Set, Part 2LDC2007E41 CoNLL 2007 Shared Task Arabic Test Set, Part 2LDC2007E42 CoNLL 2007 Shared Task Czech Test Set, Part 2LDC2007T22 2001 Topic Annotated Enron Email Data SetLDC2007T36 Chinese Treebank 6.0

2006

LDC2006S33 Middle East Technical University Turkish Microphone Speech v. 1.0LDC2006S42 Korean Broadcast News SpeechLDC2006T03 Korean PropbankLDC2006T06 ACE 2005 Multilingual Training CorpusLDC2006T09 Korean Treebank Annotations Version 2.0LDC2006T13 Web 1T 5-gram Version 1LDC2006T14 Korean Broadcast News Transcripts