Improving speed of TMX imports

Heartsome FAQ

Improving speed of TMX imports?

Three factors are relevant when importing a TMX file into a database. They are:
1. Source language settings
2. Duplicated entries
3. Presence/absence of "tuid" attribute in all elements.

Source language settings
Every TMX file must declare a source language in its header. If the file can be used to translate from any language into any other language contained in the file, then the source language must be set to *all*.
Some CAT tools can only translate in one direction and thus they must indicate one of the languages present in the file as the "source language" and they usually restrict TMX files to having only two languages.
Importing a file that indicates that it can be only used to translate in one direction requires additional efforts for checking if there are legacy entries that need updates in the database and that procedure consumes time.
The following table shows the differences when storing a TMX file with 20,000 entries into a database:

Source Language
Database Engine
Time
FR-FR
MySQL 4.1.11
9 minutes
*all*
MySQL 4.1.11
3 minutes
FR-FR
PostgreSQL 8.0.3-1
13 minutes
*all*
PostgreSQL 8.0.3-1
4 minutes

Duplicated entries
If a file has many duplicated entries, it is better to remove them before storing in a database. When you use the Remove Duplicates feature of the TMX Editor, you work with a file stored in memory and the removal is fast. When you leave the task to be performed at import time, duplicates are checked against the data stored in the database and that procedure is noticeable slower. If you know that your file has many duplicates, remove them in advance.

Presence/absence of "tuid" attributes
When importing a file, translation units are processed one at a time.
Every element is extracted from the TMX file and the database is checked to see whether a record already exists with the same "tuid". If a record has the same "tuid" then the entry with that id is removed from the database and the new imported record is saved. This helps keep the database updated with the latest version of translated segments.
Please note: It is strongly recommended that the "tuid" attribute be unique for all imported TMX files. If you import a TMX file with the same "tuid" for segments as the "tuid" for segments already in your database then you risk loosing the data entered previously in your database.
The TMX Editor includes an option called "Generate New TU IDs" in Tasks menu that you can use to generate new clean IDs for all TUs in a TMX file at once.


Comments