TreeTagger installation into TXM tutorial 1
To be able to automatically lemmatize your corpora during the import process into TXM, follow one of the two tutorials A or B below.
A. TreeTagger installation into TXM 0.8.0
Starting from TXM 0.8.0, two extensions dedicated to TreeTagger install automatically the TreeTagger software and the French and English models:
- Call the “File > Add an Extension” command
- Select the “TreeTagger software” and the “TreeTagger models” lines to install the TreeTagger software and French and English models
- Validate the next steps
- After TXM has restarted, TreeTagger is ready to be used
- To install more language models, follow the 4th section of the TreeTagger manual installation below
- The End
B. TreeTagger “manual” installation into TXM 0.7.9 and previous versions
This tutorial will guide you to:
- Download the TreeTagger software and some language specific model files
- Tell the TXM platform where TreeTagger and its model files are installed on your machine
B.1 Download files from the web and prepare them
While connected to the Internet:
- Download the TreeTagger software archive from the TreeTagger web site:
Extract the content (bin, cmd, doc, FILES, LICENSE and README) to a folder named “treetagger” located in your applications folder 2. Depending on your system, in:
Mac OS X
Check: After extraction, the treetagger folder must contain the following files and directories : bin, cmd, doc, FILES and README.
Note: This way of installing TreeTagger is specific to TXM. You really just need to extract the contents of the TreeTagger archive. You don’t need to follow any additionnal instructions found in any INSTALL.txt file that could be found in the archive.
- Create a “treetagger-models” folder in your ‘TXM-0.8.1’ folder 3. It will contain all the language specific model files.
- Download from TreeTagger website a language model file (compressed file: ‘*.gz’)
for each language in which you may need to tag a text:
- English1 [en]: english.par.gz model built from Penn treebank (Penn treebank tagset)
- English2 [en]: english-bnc.par.gz model built from British National Corpus (BNC tagset)
- French [fr]: french.par.gz model (French tagset)
- Spoken French [frp]: perceo to get from the CNRTL web site
- Old French [fro]: fro model (tagset)
- German [de]: german.par.gz model (tagset)
- Italian [it]: italian.par.gz model (tagset)
- Spanish1 [es]: spanish.par.gz model (tagset)
- Spanish2 [es]: spanish-ancora.par.gz model built from Ancora corpus (tagset)
- Russian [ru]: russian.par.gz model (tagset)
- Latin1 [la]: latin.par.gz model built by Gabriele Brandolini(tagset)
- Latin2 [la]: latinIT.par.gz model built from Index Thomisticus Treebank (tagset)
- Greek [el]: greek.par.gz model (tagset)
- Ancient Greek [grc]: ancient-greek.par.gz model (tagset)
- Other languages: see the list of all the available language models in the ‘Parameter files’ section of the TreeTagger website
- Extract the downloaded model(s) archive(s) into the “treetagger-models”
Under Windows, if you don’t know how to extract ‘*.gz’ files, we recommend to use the 7-zip open-source software.
- Rename each model file according to the 2-letter ISO
language code standard. For instance:
- ‘french.par’ to ‘fr.par’
- ‘english.par’ to ‘en.par’
With Windows and Mac OS X : The default behavior of these sytems is to hide file extensions they think they can manage. This may mislead the user when he rename a file (the name displayed is “fr.par” but the real file name is “fr.par.bin”
In that case, you need to display and check the real file names in your Explorer/Finder:
- Under Windows :
- Follow the official tutorial: Show or hide file name extensions
- You can now choose the appropriate file name.
- Under Mac OS X :
- Double click on the file icon (Ctrl-click mouse or double-finger tap in the trackpad)
- Select the ‘Get Info’ menu entry
- Edit the ‘Name and Extension’ field : delete the ‘.bin’ extension.
- Close the “Info” window. Check: the ‘treetagger-models’ folder must contain some model files like the ‘fr.par’ file of size about 18 Mo or the ‘en.par’ file of size about 14.4 Mo.
- Under Windows :
B.2 In TXM
1. Set the TreeTagger preferences
- Select the ‘Edit / Preferences’ main menu entry
- Go to the ‘TXM / Advanced / NLP / TreeTagger’ page (see figure 1)
- Set the ‘Path to the install folder’ preference to the ‘treetagger’ folder path
- Set the ‘Path to the linguistic models folder’ preference to the ‘treetagger-models’ folder path
- Finish with the ‘OK’ button to save the preferences
2. Check the installation
Copy the following text:
Running SearchEngine in memory mode. Statistical Engine launched.connected. Reloading subcorpora and partitions...Done. No update available.
In TXM launch the File > Import > Clipboard command
Check in the console that the last lines are:
pAttrs : [id, lbid, enpos, enlemma] sAttrs : [text:+id+path+base+project, s:+n, p:+id, txmcorpus:+lang] -- EDITION - Building edition . Import done:3sec (3265 ms) Running SearchEngine in memory mode. Statistical Engine launched.connected. Reloading subcorpora and partitions...Done. TXM is ready.
(Note that the first above line should contain enpos and enlemma. But the indication of time after “Import done” can of course be different.)
In case of difficulty you can find further help in the FAQ (in French).
If you can’t manage the installation process, please send your enquiries to the TXM users mailing list (txm-users AT groupes.renater.fr) after subscribing to the mailing list, or contact the TXM team by mail.
TreeTagger licence prohibits the delivery of TreeTagger embedded in a commercial software. As TXM licence doesn’t prevent anyone to do business with TXM, we can not include TreeTagger in the TXM distribution. See TreeTagger web site ↩
If you don’t have access rights to create the folder in the applications folder, you can create it in your home folder. ↩
The ‘TXM-0.8.1’ folder is in your home directory. ↩