What is textometry?

This discipline has been essentially developed in France during the 1970s, with the pioneering research in lexical statistics done by Pierre Guiraud (1954, 1960) and Charles Muller (1968, 1977) (evaluation of vocabulary richness of a text, characteristic vocabulary of a text). Textometry also adopts and follows data analysis methods (e.g. factor analysis, clustering) developed and applied to linguistic data by Jean-Paul Benzécri (1973): such techniques enable generation of the synthetic and visual maps of words and texts as they are related or opposed to each other within a corpus. Besides, textometry develops new statistical models to discover important features of textual data: contextual attractiveness of the words (phraseology, thematic fields …), linearity and internal text structure (for example, evenly distributed words over the text or, on the contrary, words appearing in “bursts”), intertextual contrasts (reliable statistical measurement of overuse or underuse of a word in a text, and identification of words and phrases characteristic for a text), indicators of lexical evolution (characteristic period of a term usage, detection of the significant usage disruptions). The results of the calculations are synthetic, selective and suggestive reorganizations of texts submitted for analysis: ordered lists, cartographic visualisations, regrouping, enhancements throughout the text. The interpretation of the calculations is based on numerical indicators, as well as systematic examination of the contexts, now facilitated by relevant hyper textual links. Textometry researchers, of course, also further discussed modeling of the textual data: what do we count exactly? To what extent is it appropriate to submit text for linguistic analysis in advance, in order to delimit and recognize the words?

Textometry thus applies a wide range of linguistically significant and mathematically based calculations for methodical and renewed analysis of text collections: syntagmatic and paradigmatic associations, contrasts and characterizations, evolutions. Balanced approach calculating global synthetic views and targeted consultation of employed contexts reveals new interpretation possibilities offered by digital corpora.

How is textometry an original approach?

Of course, textometry is not the only discipline interested in applying calculations to textual data. Here are some other current approaches in relation to which textometry should be placed.

During the same period, Information Retrieval was developed in the USA by Salton. In a context of the computerization of libraries and automation of documents search, it was necessary to develop measures for automatic selection of characteristic words of a document (like keywords indexing), and to relate needed information expressed in a “query” with the most relevant documents. Current research has developed popular, but simple and effective techniques (for example, tf.idf) for weighting words of a text, mutual information which measures the lexical attraction between the words). This research shares with textometry the desire to find robust methods capable of handling large volumes of texts. On the other hand, it seems to us that textometry has gone further trying to base its models on both mathematical (probabilities, statistics, data analysis) and linguistic plans (expression and mathematical translation of hypotheses on language and textuality). This theoretical base gives the perspective to evaluate, interpret, evolve and enrich its models. Moreover, textometry is characterized by a rich palette of tools for textual analysis (with literary, stylistic, philological, hermeneutical styles, etc.), while the search for information naturally focuses on document issues (locating and linking information units).

Latent Semantic Analysis (LSA) also applies mathematical calculation to textual data, which makes it possible to determine spatial synthetic representation and to derive interesting linguistic effects, including a certain ability to neutralize variations in synonymy and paraphrase. If the designers and users of LSA form an effective research community, relatively autonomous, it remains the fact that the implemented geometric model is closely related to the factorial analyses invented by Benzécri. Textometry uses and yet refines this kind of calculations of linguistic spaces; it puts them at the service of textual analysis and articulates them with other complementary tools, in particular to reflect used contexts. The importance given to the return to the text characterize the textometric approach, while the LSA community starts from the text to explore other areas beyond the text, such as cognition or language.

More generally, Text Mining explores and highlights the techniques of data analysis and statistical analysis by applying them to textual corpora. Here again, we find calculations known and practiced by textometry (as clustering). But conversely, much of the original calculations developed by textometry for modeling linguistic phenomena are still ignored by the text mining community. Moreover, text mining designation emphasizes the idea that some valuable information will be extracted from texts; the textometry approach is constantly oriented towards the text, while giving the means to control the abundance of contexts.

Even in the field of computer-assisted text analysis (for example, for literary analysis), textometry software still occupies an original place because of the wealth of calculations and therefore the offered analysis paths. Nowadays there are quite actual excellent softwares packages specialized in a type of calculation: highly advanced search engines, or very comprehensive concordancers (or “KWIC”, keyword in context).

Natural Language Processing (NLP) also uses statistics for the construction and recognition of linguistic units. These calculations are complementary to those of textometry, since they aimed other goals and are refined for them (such as terminology extraction or morphosyntactic tagging). The relation to the corpus is also quite different: NLP eventually exploits a corpus to calibrate its calculations, but then the object of analysis is of the order of the phrase; textometry navigates between global views of texts and the consultation of local contexts.

And what about textometry compared to current tools that allow powerful searching of very large corpora, such as Internet search engines?

Functionally, the search engines (such as the most popular search tools on the Internet: Google, eXalead, etc.) locate documents that have occurrences of the searched word (or pattern). The tool focuses on the identification of documents but does not empower their perusal and their analysis. The functionality of textometry can be described synthetically by the SEMA model (Synthesis, Edition, Engine, Annotation): we have not only a search engine (locating the occurrences of a given motive), but particular attention is also paid to the edition (presentation) of the text (access to contexts); and especially to statistical calculations generating significant synthetic views (characterization of the singularities of a text, identification of themes, etc.). Finally, annotation possibilities complete this tool by allowing to personalize, refine and dynamically enrich the corpus over the analysis. Google’s revolutionary algorithm, which ranks pages according to their popularity (expressed by hypertext links pointing to each page and the global balance of the network), directly develops a “conforming” approach to intertextuality: the more a page is quoted, the more it is highlighted by Google, and vice versa. On the contrary, textometry aims to highlight specificities and significant contrast. These possibilities of characterization and recognizing of singularities are particularly welcome for research work in the human sciences; it is also a more generally factor of openness and freedom of thought.

Internet engines have been developed to index and search for web “pages”. The techniques used are adapted to short texts. But the expert analysis of corpora in the human sciences must take into account their textual depth. Current orientations of textometric research, which feed the present project, explicitly take into account this aspect by developing new models in textual topology and by developing adapted interfaces (like the paragraphs map in Lexico 3).

Textometry software can also further support the scientific work of interpreting fine results of automated calculations. In fact, commercial engines available to Internet users are presented as black boxes, their selection criteria of such and such page are partly obscure. However, a good understanding of treatments is not a matter of mere technical curiosity, but conditions necessary for a proper assessment and use of the proposed results. One of the challenges of an open textometry software is to explain (at all levels: theoretical, methodological and computational) the features available. Such a possible control of the operation of the tool gives access to an accurate and efficient understanding of the results of the queries.

Even in its conception of relevance, textometry software effectively renews the approach cultivated by search engines. For these, the results are ranked in order of relevance: a “competitive” (quantitative and endless - score) conception of the text contribution, questionable and in any case limited, especially for a corpus based research work in human sciences. Exploration and synthesis tools of textometry are at the service of a qualitative, global vision, respecting a plurality of answers without deciding a priori.

Why is textometry particularly interesting for the exploitation of corpora in human sciences?

Textometry is therefore particularly relevant for the exploitation of corpora in the human and social sciences. It is also appreciated in very different disciplines: historical archives, survey examinations with open questions, literary works, etc. Indeed, it enables both fine and global observation of texts, and therefore a relatively complete exploitation of the data collected in these corpora. In addition, textometry remains close to texts, it respects the expressive choices and enhances them: this language reality is often an important and very rich field of observation for human and social sciences.

To cite this document

PINCEMIN Bénédicte, HEIDEN Serge (2008) - “What is textometry? Introduction”, Textometry project website, http://textometrie.ens-lyon.fr/spip.php?rubrique80&lang=en