Add entries to your term base - with a little help from Excel

translation_articles_icon

ProZ.com Translation Article Knowledgebase

Articles about translation and interpreting
Article Categories
Search Articles


Advanced Search
About the Articles Knowledgebase
ProZ.com has created this section with the goals of:

Further enabling knowledge sharing among professionals
Providing resources for the education of clients and translators
Offering an additional channel for promotion of ProZ.com members (as authors)

We invite your participation and feedback concerning this new resource.

More info and discussion >

Article Options
Your Favorite Articles
You Recently Viewed...
Recommended Articles
  1. ProZ.com overview and action plan (#1 of 8): Sourcing (ie. jobs / directory)
  2. Réalité de la traduction automatique en 2014
  3. Getting the most out of ProZ.com: A guide for translators and interpreters
  4. Does Juliet's Rose, by Any Other Name, Smell as Sweet?
  5. The difference between editing and proofreading
No recommended articles found.

 »  Articles Overview  »  Technology  »  CAT Tools  »  Add entries to your term base - with a little help from Excel

Add entries to your term base - with a little help from Excel

By Vito Smolej | Published  11/27/2005 | CAT Tools | Recommendation:RateSecARateSecARateSecARateSecARateSecI
Contact the author
Quicklink: http://rus.proz.com/doc/555
Author:
Vito Smolej
Германия
английский => словенский translator
Член ProZ.com c Dec 10, 2004.
 
View all articles by Vito Smolej

See this author's ProZ.com profile
Introduction

One of very nice features in a CAT product is a possibility to pretranslate the text and find the most frequent segments in it. To avoid inconsistent translations one can export and translate them before actually translating the documents themselves. This speeds up the translation process and ensures that frequent segments are translated in a consistent manner.

How about something similar for Term bases? Adding to them while on the way through the document is always possible, but it is rather distracting. To do a proper job one has to concentrate on a the word alone, maybe go check in a dictionary or two, ask friends – and that usually means one enters some prop at those places and then forgets to tackle them later.

Here's a simple method to avoid this. It involves a text processing program – for example Microsoft Word – and some functionality from Microsoft Excel, specifically its pivot table.


Cutting up the source into single words

What we are looking for, is eventually a list (and as a consequence a dictionary) of words present in the text to be processed. Attention: it always pays to have a copy made of what you are working on.

The first step is relatively simple: to order the text into single words, one replaces all blanks and tabs by carriage return/line feeds. In Word this is achieved by replacing blank with ^p. You would do the same replacement for other kinds of separators, like tabs, commas, columns etc. After the global replace you should have your original text changed to lines, consisting of single words, bracketed just by carriage returns

Let us take the first paragraph from The tale of two cities:

It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven, we were all going direct
the other way--in short, the period was so far like the present
period, that some of its noisiest authorities insisted on
its being received, for good or for evil, in the superlative degree
of comparison only.

Making the suggested change to one word to a line, the text is converted to (sparing you some or most of the 119 words):

It
was
the
best
……
of
comparison
only.

There may be some exotic single cases like the combination "way—in" above, which in my Gutenberg version of Dickens' text was missing the blanks. As the rule of this game is "Do less well", don't bother.

Now copy this text to the clipboard (^A and ^C) and start Excel.


Creating vocabulary and word frequencies

According to Vicipedia, vocabulary est verba et translationem verborum in linguas alias docens – which means it's telling you about words and their translations into other languages. We are not that far yet, as we need the words first, and here's where Excel comes in handy: it will namely reduce all the word repeats to their single occurrences and on top of that show us how common they are.

To get this list, you will need the services of a pivot table. I will assume you have some experience with them, so I hope the following description is sufficient, if not even superfluous. With Excel open and our one-word-per-line text copied into clipboard:

  • select one of the tables, make sure it is empty, and enter "words" into A1
  • activate the cell A2
  • press ^V to paste in the text you have in the clipboard from before
  • select the complete A column
  • in Data menu
    • select the pivot chart
    • press "next" one time in the first window
    • press "next" to confirm A column as the data selected
    • press Layout

You should see now the layout of the pivot table and somewhere at 2'oclock "words"

  • drag "word" rectangle to "line"
  • drag it to "data" – it changes to "count of words"
  • press "OK" and "finish" in the next window

A new spreadsheet appears, containing distinct words from your text and their frequencies, i.e. how often they have occurred in the original text. In case of Charles Dickens' Tale of two cities, the top of this list looks like this:

Count of words

words

result

so

1

age

2

all

2

authorities

1

before

2

being

1

belief

1

best

1

The program found 57 different words in the text,–so evidently some of them turn up more than once. Ordering the pivot table by "result" (by copying its contents and sorting the copy in decreasing order of "result") shows the following:

the

14

of

12

was

11

It

10

we

4

which is what one would expect and what does not need to be translated - typing "es" outright in German for instance is of course faster than using term base to look up "it".

Harvesting – a real case

Here's a real example - 1500+ words of a MSDS text, with the usual suspects at the top:

and

47

to

46

the

39

in

38

of

33

be

33

with

31

or

20

… and then here and there some words, we will be pleased to add to our term base:

water

17

material

11

reaction

9

diisocyanate

8

respiratory

7

isocyanate

6

heat

6

carbon

6

avoid

5

polyol

5

dioxide

5

container

5

pressure

5


Conclusion

One can of course build whole machinery around this simple solution, adding for instance:

i) exclusion tables – "ignore words and, it, the…."

ii) exclusion rules – "ignore words shorter than…"

iii) start automatic search for translations

However, just by taking care of the above table (with "water", "material" etc.) we are 95 pretranslates richer.

Not bad for a 10 minutes job.



Comments on this article

Knowledgebase Contributions Related to this Article
  • No contributions found.
     
Want to contribute to the article knowledgebase? Join ProZ.com.


Articles are copyright © ProZ.com, 1999-2024, except where otherwise indicated. All rights reserved.
Content may not be republished without the consent of ProZ.com.