How do you handle word count extraction from designed files?

Preflight and File Preparation Frequently asked questions

How do you handle word count extraction from designed files??

Accurate word count extraction from designed files is one of the most important — and most frequently underestimated — steps in translation project planning. Unlike extracting word counts from Word documents or plain text files, designed files embed text within complex layout structures that automated tools often cannot fully parse. An inaccurate word count leads to incorrect quotes, blown budgets, and timeline overruns.

Our word count extraction process uses the native application for each file format. For Adobe InDesign, we use the built-in word count feature across all stories (text threads), including overset text that is not visible on the page but is present in the file. We separately count text in anchored objects, grouped items, and items on the pasteboard that may be intended for inclusion. For Adobe Illustrator, we traverse all artboards and layers, including hidden and locked layers, since these often contain text variants or alternate-language content that needs translation.

For FrameMaker, we process the entire book file to capture text across all chapters, including header/footer text, table cell content, and cross-reference text. For QuarkXPress, we extract text from all text boxes including those on master pages. For PowerPoint, we capture slide content, notes, speaker notes, alt text, and embedded chart labels.

We also identify and flag non-editable text — text that has been rasterized into images or flattened into vector paths. This text cannot be extracted and translated through normal DTP processes; it requires graphic recreation. We report these instances separately so project managers can decide whether to include them in scope or deliver them as-is.

The final word count report breaks down counts by file, by text type (body, headers, captions, UI elements), and flags any ambiguities. We provide this in a spreadsheet format that PMs can use directly for quoting. For ongoing clients, we maintain file-level translation memory leverage estimates that further refine cost projections.


Comments