Corpus-based tools

In addition to the "manual" approach mentioned, various corpus-based techniques are also used in compiling the dictionary. The "corpus" in this case consists of a body of Spanish and English texts taken from the Internet, from a variety of sources. Sources are generally chosen from sites that offer RSS feeds (and the texts taken via those feeds), and where those feeds are broadly categorised thematically1. Note that in practice, this does of course give a bias to the material: sites offering regular high-quality feeds tend to be on-line newspapers and magazines.

The raw HTML files downloaded from the Internet first require quite an aggressive "cleanup" to remove unwanted tags, and to differentiate between useful text and unuseful text such as copyright messages, menus etc. Some HTML tags which make some linguistic sense (e.g. title tags and header) are actually used in parsing and categorising the texts. A fairly "shallow" parsing process is applied to the cleaned-up text with the purpose of reducing declined forms (e.g. plurals, conjugated verbs) to their base form. From this body of material, we can then perform various tasks. (Custom-written software is used for this process.)

Based on this project's sister French-English dictionary, a similar process to the last point is also performed but taking the English translation of the French cognate of a given Spanish word2. This helps to find translations for words where the most appropriate English translation isn't always the cognate. In some cases where the relationship between a French and Spanish cognate pair is no longer very obvious from the orthography, such pairings have actually been marked up "by hand" in the Spanish dictionary data.

Dictionary reversal

The dictionary is generally compiled in the direction Spanish-English. Most of the English-Spanish entries are actually generated by a computer program that "reverses" the Spanish-English entries. This obviously more or less halves the amount of effort required to compile the dictionary, and also avoids certain inconsistencies that can creep in when both halves of a bilingual dictionary are hand-compiled (e.g. a particular expression is included in only one side of the dictionary). Special markers are added in some cases to the Spanish-English entries to help the reverser. For example, "index" markers are added to the English translations of Spanish examples where it is otherwise difficult for the reverser to guess which entries to include the given examples in.

Of course, this automatic reversing process still works best for certain entries. It doesn't work very well for certain key, often functional, words such as articles which require more of a textual explanation in their entry than example phrases. So some entries of the English-Spanish side are actually hand-compiled.

