My experience with choosing translation statistical analysis software

logoI’ll tell you about my experience with choosing a program that provides a solution for a rather unconventional issue for a translator – statistical text analysis. Actually, these programs aren’t very common in the field of translation. Too bad: with their help, you can quickly select the keywords and key expressions within a text and, as a result, assess the subject and the level of text complexity before taking an order, and during a translation, be mindful of particular keywords. Also, the need for such programs appears if translators aren’t using translation memory programs, but there is a need to trace usage and translation of the key terms.

The first statistical text analysis program I’ve ever come across online is Wordstat (distributed freely).

The program is really simple to use – you choose a file (although it only supports txt and html/htm files, press the button and, a second later, receive the file – in txt format –with the keywords.
As you can tell from the results, the program’s algorithm is also perfectly simple: the program calculates the usage of words and, based on that, builds its own rating list. As a result, propositions and articles are the first on the list – certainly not what contains the truly important information. In addition, words are analysed one by one – this is a drawback, because glossaries, of course, have to include expressions as well.

So I continued my research and found a program called TextAnalyst (distributed freely), that has a better algorithm that takes into account, along with frequency, a number of linguistic parameters: a word’s position in a sentence, the sentence’s position in the text, how words are connected to each other, semantic parameters.
So, although as a result there’s a lot of “noise”, the really important terms are selected and can be used to create a keywords glossary. Unfortunately, this miraculous program only supports Russian.

If your source text is in English (or any other language based on Cyrillic or Latin alphabet), you can use my next find – Textanz. When compared to the Russian TextAnalyst, Textanz uses more “rough” methods and is limited solely to frequency analysis. The only linguistic feature of this program is the ability to not take into account prepositions, articles and other words included in a special list. Clearly, the very simplicity of the algorithm is what allows the program to work with several languages.

Of course, if you need to create a professional glossary for a large text, you’d better use a specialized program. The aforementioned programs would be more suitable for a hasty study of text content prior to the translation (in order to better assess the subject), selection of key terms and tracking their translation “for yourself”.