Sample Corpora
Go to a corpus directory, download the binary file (.txm), and then call the ‘File > Load’ command in TXM to load it.
For some corpora the sources are also provided (.zip), so you can also import from the sources and tune the import to your preferences.
Written texts
French
- discours: corpus of various French presidents’ speeches, published by Damon Mayaffre.
- fleurs-du-mal: Les Fleurs du mal (The Flowers of Evil) by Charles Baudelaire, edition of Jean-Marie Viprey.
- mpt: corpus of French National Assembly debates on the “Mariage pour tous” law of 2013 from the mariagepourtousInXML project.
- quete-du-graal-tei: Queste del Saint Graal (Quest for the Holy Grail), edition of Christiane-Marchello Nizia and Alexei Lavrentiev, based on ‘Lyon, Palais des Arts 77 (ms. K) (fol. 160a-224d)’ and ‘Paris, BNF n. acq. fr. 1119 (ms. Z)’ ca. 1225 or 1230 Old French manuscripts.
- tdm80j: Le tour du monde en quatre-vingts jours (Around the World in Eighty Days), Jules Verne, 1873, edition of J. Hetzel et Cie. Synoptic edition with Wikisource facsimile images.
- txm-odt-manual: TXM User’s manual as a TXM corpus.
- voeux: corpus of 54 New Year’s Day speeches of French presidents (1959-2009), published by Jean-Marc Leblanc.
- voeux-fr: See voeux.
English
- brown: corpus of 500 texts written in American English in 1961, published by W. N. Francis et H. Kucera (this version based on the XML TEI version of NLTK project).
- leviathan: Leviathan by Thomas Hobbes, 1588-1679. XML-TEI P5 text sample from the EEBO-TCP Phase 1 project.
German
- voeux-rfa: corpus of the Christmas and the New Year’s addresses delivered by the Presidents and the Chancellors of the Federal Republic of Germany since 1987, contributed by Sascha Diwersy, Universität zu Köln.
Record transcriptions (synchronized)
-
p1s8-course-transcription: Speech transcription and audio/video recording of a high school course of physics (in French).
See Tiberghien Andrée et al., Partager un corpus vidéo dans la recherche en éducation : analyses et regards pluriels dans le cadre du projet ViSA, éducation & didactique 3/2012 (vol.6) [on line at openedition.org].
To practice video replay from concordances (needs Media Player extension).
Parallel corpora (multilingual)
-
uno-tmx-sample: sample of United Nations General Assembly Resolutions: A Six-Language Parallel Corpus (Arabic, Chinese, English, French, Russian and Spanish), http://www.uncorpora.org, Alexandre Rafalovitch, Robert Dale. 2009. United Nations General Assembly Resolutions: A Six-Language Parallel Corpus. In Proceedings of the MT Summit XII, pages 292-299, Ottawa, Canada, August.
To import with the XML-TMX import module.
Annotated corpora
- CORPUS110CYL067: a single syntactically parsed text from the MASC corpus. To practice TIGER Search queries (see TIGER-XML import validation, needs TIGER Search extension).
Some corpora are also downloadable from the TXM demo portal.