The TechSIG meeting of 3 June 2021 was very well attended, with 38 participants. The topic for this meeting was ‘Converting PDFs and OCR’. The presenters were Jenny Zonneveld and Hans van Bemmelen.
Jenny started off with well-known advice that deserves to be repeated: tell the client that you charge extra for converting the file from PDF, because if you do, sometimes the client will discover that they have it available in an editable format after all!
Jenny then proceeded to explain various ways to use Microsoft Word itself to perform PDF conversion. In the latest versions of Word, one can load PDF files directly, either by using File > Open, or by dragging and dropping the PDF file into Word (if there is an existing file open, drag and drop the PDF file to the ribbon). Word then converts the PDF file into a Word document.
Acrobat Standard DC, which is the cheapest PDF reader in the Adobe suite, can also export PDF files to Word. Go to Tools > Export PDF. The quality of Acrobat's conversion is often better than that of Word.
Finally, Jenny showed a few ways of converting the PDF file in Abbyy FineReader, an OCR program. FineReader can convert images and PDF files to a number of other formats, but it is not a PDF editor. Hans then gave an extensive explanation of how to deal with a rather complex sample PDF file in FineReader. Hans and Jenny demonstrated several tips and tricks in particular for dealing with tables in FineReader.
Jenny explained how to troubleshoot problems with scanned newspaper clippings. Samuel Murray gave the tip that one can improve the accuracy of OCR conversion by extracting the pages from the PDF file and converting them to images (eg, JPG), and then improving the quality of those images in a free program like XnView, before loading the images into FineReader. In XnView, go to Image > Map or go to Image > Adjust. There are various free websites for extracting pages from PDFs to images. Martina Abagnale recommended PDFSam (the premium version can convert PDFs to images).
It was also mentioned that some CAT tools do very good PDF-to-Word conversion. Trados in particular creates Word files that are very ‘translator-friendly’ in the sense that it does not insert line breaks in places where translators would not want them.
Hans explained that if a client sends a Word file that was converted from PDF very poorly, it may be better to re-convert that file to PDF and back to Word again. In Word, File > Export > Export as PDF. Then use FineReader to convert it back to Word.
One hour was far too short to cover all of this very relevant topic, and we will certainly have another TechSIG meeting about it in future.