What is TXM?
The TXM platform combines powerful and original techniques for the analysis of structured and annotated textual corpora using modular and open-source components (Heiden 2010, Heiden et al., 2010, Pincemin et al., 2010). It was initiated by the ANR Textometry1 project, which launched a new generation of textometric research, in synergy with current corpus and statistical technologies (Unicode, XML, TEI, NLP, CQP and R).
It helps users to build and analyze any type of digital textual corpus that may be tagged and structured in XML. It is distributed as a Windows, Mac or Linux software called “TXM” (based on Eclipse RCP ) technology and as a web portal software (based on GWT technology) for online corpora access.
It is commonly used by research projects from different disciplines of humanities and social sciences such as history, literature, geography, linguistics, sociology and political science. Textometric scientific publications are presented during the Days of Textual Data statistical Analysis (JADT - Journées d’Analyse statistique des Données Textuelles) international conference, see also (Heiden and Pincemin, 2008).2
Summary
TXM Key Strengths
Progressively usable from the beginner to the expert
The beginner benefits from an intuitive interface that includes all the modern elements found in the most common desktop applications (word processing, spreadsheet, mailing…):
- multi-window
- trigger actions by general menu, toolbars, contextual menu or hypertext links
- hierarchical organization of manipulated objects (corpus, sub-corpus, partitions) and results (concordances, indexes, fca…)
- vector graphics display: whatever the zoom level, the information is displayed with the best definition
- saving the layout of windows between work sessions
The expert can use keyboard shortcuts and especially the scripting capabilities written in dynamic programming languages:
- in Groovy language to access the entire TXM platform (developed in Java)
- in R language to access all installed R packages and objects created by TXM (corpus, word lists…)
Scripts allow you to automatically reproduce complex work sessions (for which the manipulations in the interface are binding) and repetitive (the scripts are very profitable when you have to repeat a calculation by changing only a few parameters or when a calculation depends on the result of another). They can be passed on to colleagues so that they in turn can execute them themselves or understand and verify their functioning. Finally, they allow to extend the TXM platform itself (in the form of macros).
Combines simple and accessible text analysis tools with more advanced tools
TXM can be used to produce simple word lists (the vocabulary of a text) or simple kwic concordances (the list of appearances of a given word in context). But it also offers more advanced analytical tools such as the statistically more specific vocabulary of a text of the corpus or the statistically most present words in the vicinity of a given word. The user chooses his work strategy with TXM according to his preferences and experience and can choose to progressively use more advanced tools if it is relevant.
For example, if TXM is only used as a concordancer, the user will be able to progressively deepen his mastery of query expressions of the full text search engine depending on the expression limits he will try to exceed for his work. For example, by developing a search for a variable word sequence (for example a given word followed by an adjective) rather than searching for a single word.
Processes any text or text extract easily
TXM allows you to work directly on text that has been previously copied to the clipboard of the system by another application - by selection + copy (word processing, PDF viewer, browser, mailer …). TXM also works on texts organized in directories of text files following various formats.
Can process progressively from plain text to richly structured XML-TEI encoded text
TXM offers a continuous range of import modules covering the most frequently used standard formats:
- TXT: for any plain text coming from word processors, PDFs, websites, etc.;
- XML: for slightly structured texts (only sentences or paragraphs for example) or even enriched (with XML tags that encode certain words with lexical properties);
- TEI: for texts encoded according to the recommendations of the TEI consortium and which are intended to be capitalized for long-term projects, shared with other initiatives or compatible with archiving systems.
A project can apply TXM to its encoded data progressively from the simplest way (and most limited to use) to the most complex way (and richest). TXM therefore makes it possible to adapt the costs of corpus encoding according to the real needs of the study, especially when these needs are discovered as the corpus is being analyzed. Under these conditions, TXM assists both the encoding activity and the exploitation of corpora.
Manages a great variety of formats and corpus configuration
TXM is not limited to textual corpora. It makes it possible to work on record transcriptions (where the transcription can encode notably speakers and temporal synchronization points with the original source video or audio), on parallel corpora where texts are in relation of translation (multilingual corpora) or of versioning. This diversity is a guarantee of the robustness of the corpus model of the platform.
Manages the most standard formats
TXM imports plain text encoded in Unicode: the international standard for character encoding of all writing systems used in the world. TXM imports texts encoded in XML: international W3C standard for encoding textual data. TXM already imports more than a dozen applications of the recommendations of the TEI consortium. In doing so, TXM accompanies the evolutions of these standards which is a guarantee of stability of its corpus management capabilities over time.
Its search engine works on words, not just character strings
Its internal search engine, called “Corpus Query Processor” (CQP), allows to express the search - for display or for counting - of sequences of words rather than characters. It is therefore particularly suitable for phraseological work and for the search for collocations. For this, words are accessible not only through their graphical (or surface) form but also through all the information associated with them as their lemma or their grammatical category. This offers a great wealth of construction of textual observables on which to apply all textometry tools.
His corpus model is rich because structured
All texts of TXM corpora can be structured according to a tree of structures having various properties. These structures can be used by the search engine (to limit a search for sequences of variable length of words for example) or to build sub-corpus configurations internal to a given text for example.
Allows to export all its results to other software
All tabular results can be exported in CSV for manipulation by other software (such as spreadsheets) and all graphical results produced can be exported as a vector or bitmap image (especially to help publish).
It is easy to modify by a programmer
All programmers have learned programming with the Java language and the Eclipse development environment, used to develop the TXM platform, makes the intervention in the source extremely fast. The development team is working hard to document their workflow, produce developer manuals, and code documentation.
Its software architecture is robust
TXM is developed with the two most used programming languages of the computer industry: Java and C. In the context of the Java language and its ecosystem, TXM is developed with the framework “Rich Client Platform” (RCP) which complies with the OSGi Alliance Industrial Architecture Standard.
It can be installed on all platforms
TXM is available as a desktop application for the 3 operating systems most used by researchers and the general public: Windows, Mac OS X and Linux.
It allows to access corpora online and to analyze them
TXM is available as a portal web server. A TXM portal makes it possible to put online corpora handled by the desktop version and to give access to their analysis by means of a simple web browser. The person who accesses the portal does not need to install TXM or a corpus on his computer.
It’s transparent
By systematically giving access to its sources, its open-source license offers:
- complete transparency in the calculation methods used (scientific guarantee of verifiability and reproducibility);
- the possibility for everyone to improve it for the benefit of the community of its users.
It’s accessible
Distributed for free, it is easily installable and updatable anywhere, especially when means are limited or just to be tested for discovery.
TXM features
- build sub-corpora from various metadata (properties) of texts (eg: date of publication, author, type of text, theme);
- build partitions from these properties to apply contrasting calculations (eg: between texts or groups of texts);
- produce kwic concordances from complex lexical pattern searches - constructed from the properties of words (eg: “a lemma word ‘love’ followed by at most 2 words of a word starting with ‘may’). From each concordance line, you can access the reading of the corresponding HTML edition page;
- calculate the overall vocabulary of a corpus or the list of attested values of a given word property;
- build a basic HTML edition for each text of the corpus;
- build different contingency tables that cross words, texts and their structures;
- calculate the list of words appearing preferentially in the same contexts as a complex lexical pattern (statistical co-occurrences);
- calculate the words, or the properties of words, particularly present in a part of the corpus (statistical specificities);
- compute visualizations of the corpus in the form of cartography of words, properties or texts (factorial correspondences analysis with the FactoMineR R package);
- build a corpus from various textual sources. 16 import modules are available including3: plain text combined with flat metadata (CSV or Excel or Calc), raw XML/w+metadata4, XML-TEI Zero+metadata5, XML-TEI BFM6, XML-TXM7, Transcriber+metadata8, Hyperbase, Alceste, Cordial and prototypes (TMX, Factiva …);
- integrates the automatic application of natural language processing tools (NLP) on texts. Delivered with a plugin of the TreeTagger tagger and lemmatizer for different languages (TreeTagger is to be installed separately for license reasons). The results of this tool are accessible in the platform in the form of word properties (eg: “likes” word, “VER: pres” - verb in present tense - morphosyntactic tag, “to like” lemma);
- export all results in CSV format for lists and in SVG format for graphics;
- can be scripted by Groovy or R scripts
See also
- TXM description in the TEI wiki
- TXM description in TAPoR 2
- TXM description in Bamboo DIRT
- TXM description in PLUME
Notes:
-
Funded by grant ANR-06-CORP-029, 2007-2010 - see Home of this site ↩
-
Other information about the Textometry project and the TXM platform: Project Publications, Reference Manual, Users’s Wiki, Developers’s Wiki ↩
-
See their respective descriptions in the online documentation ↩
-
any XML where the
w
tag can encode words ↩ -
TEI compatible XML format, consisting of a minimal tagset:
cell
,emph
,graphic
,hi
,head
,item
,lb
,list
,note
,p
,pb
,ref
,row
,table
,text
,w
. ↩ -
As defined by the TEI compatible XML Text Encoding Guide of the BFM project ↩
-
An XML-TEI compatible internal TXM-oriented pivot format ↩
-
As defined by the Transcriber software ↩