3 tips for easier translation of PDF filesLukáš Slovák 24.02.2020 For Translators Reading time: 4 min.
Every translator has probably been faced with a non-editable document, for example a PDF, JPG, PNG, or even paper-based text needing translation. Fortunately, nowadays there are programs which can help us with this and make our work easier.
They’re called OCR, which stands for optical character recognition. An OCR program is a good problem solving tool, but only if you can use it effectively. How to avoid the OCR program creating more problems than solutions? We’ll find out in this article.
What are OCR programs actually good for?
Optical character recognition software, or OCR for short, converts text and other content into an editable format.
Let’s take a practical example: a client provides you with scanned operating instructions in PDF (and it’s not available in an editable format, unfortunately). These operating instructions contain uncopiable text, tables, and images. At first glance, there are a lot of numbers and repetitions. You’d prefer to use a CAT tool rather than translate the text by rewriting everything from the PDF in MS Word. That would take a long time and you might make a mistake or even unintentionally omit part of the text. So you opt for an OCR program, because it automatically recognises text, tables, and images, and converts and formats the result into docx (MS Word). Great, isn’t it? Well, yes and no...
The output will never be perfect
When you open the file in MS Word, you may often find that the text is messy, the program inserted extra section breaks and columns, it’s full of text boxes instead of simple paragraphs, images are incorrectly cut, and the text contains strange characters. What next?
You should note that automatic OCR output will never be perfect. Preparation is needed before converting to MS Word. You have to help the program a little and define which parts of the document are tables (if the program defined them incorrectly), what is continuous text, where the images are, etc.
Document quality also plays a role – if the scan is blurred, the program might not recognise the text correctly (e.g. instead of “Behind” it can read “8ehind”), punctuation, letters, or even whole words might be missing. That’s why many OCR programs enable spellcheck.
Basically, the more complicated the document, the more time must be spent setting the OCR process itself. But at the end of the day, you’ll be rewarded for the time invested.
Adjustments to be made after converting to MS Word
Even though you might be pretty happy with the output in Word, your job is not quite done yet. Although it might sound counterproductive, in most cases, we recommend starting by clearing the formatting. All the text, numbers, and tables will be kept, and only the formatting will be deleted. The format can then be edited according to your client’s requirements.
We recommend focusing on the following advice:
Start with the general settings
Before getting down to changing the font and bullet point size, define the basic elements, such as paragraphs, page size, and sections. Why? You might be familiar with this situation: after deleting a redundant section, the text moves four pages forward, a new column appears and, for an unknown reason, every third image is deleted.
Keep it simple
Reduce the number of sections and create automatic numbered lists (this applies to the table of contents and automatic heading styles, too). Don’t forget the document headers and footers.
Think like a CAT tool
Adapt the text to the translation process. If you don’t prepare the document thoroughly and then import an unedited file into your CAT tool, the whole document might simply fall apart when you export it – the text will display in the wrong places (if it displays at all) and you’ll spend much more time editing it than you planned.
Don’t forget to correctly separate text from numbers and use hidden tables instead of multiple tabs. This method will enable you to correctly split paragraphs of pre-translated numbers (CAT tools can usually only do this if numerical data is correctly segmented, meaning separated from continuous text).
Also, avoid text boxes, if possible, as they are a double-edged sword. On one hand, they can be placed in an exact spot anywhere on a page, but on the other hand, they aren’t flexible. So if the translation is longer than the source text, after exporting the translation from the CAT tool, they must be manually resized in MS Word to display the whole translated text.
You’ve guessed it, formatting is a science that every translator should master. All it takes is some practice – after formatting a few dozen documents, you’ll do much better than an OCR program.
OCR built into CAT tools
Most modern CAT tools let their users “translate” PDF files directly by performing OCR during project creation. Despite the time saved compared to manual OCR and text formatting, this could result in badly segmented text or even incoherently formatted target language file. We recommend to test the quality of the output for each file before you start translating. You can do so in most CAT tools by pseudo-translating the document and creating the final target language file to see if the output is what you and your client expect.
What program should you choose?
OCR programs have different processing options, recognition speeds, licence types (lease/permanent licence), and there are online and offline versions. When using online versions and free OCR services, be careful with personal data and sensitive documents. When it comes to paid offline programs, we recommend ABBYY FineReader, for example.
Do you find the topic of OCR programs interesting and want to learn more? Let us know!