Pages in topic:   < [1 2]
Is this the solution to formatting problems from OCR?
Thread poster: Dylan J Hartmann
Dylan J Hartmann
Dylan J Hartmann  Identity Verified
Australia
Member (2014)
Thai to English
+ ...

MODERATOR
TOPIC STARTER
Please, have a go! Jun 15, 2016

I have made a MS Word file with just the heading to show. I have also included the .txt file exported from ABBYY.

In example one you can try bold/italicise yourself, and it works.

In example two, where I have corrected spelling, the newly inserted text cannot be bolded/italicised.

If I then copy-paste to TextEdit, then copy-paste it back to Word, it can be bo
... See more
I have made a MS Word file with just the heading to show. I have also included the .txt file exported from ABBYY.

In example one you can try bold/italicise yourself, and it works.

In example two, where I have corrected spelling, the newly inserted text cannot be bolded/italicised.

If I then copy-paste to TextEdit, then copy-paste it back to Word, it can be bolded/italicised.

https://drive.google.com/file/d/0B8HMbu6gZdA8VEN1UGgwT296TXc/view?usp=sharing

[Edited at 2016-06-15 07:36 GMT]
Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 15:03
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
LibreOffice Writer vs. MSO Word Jun 15, 2016

I have tried editing your file using both LibreOffice Writer 5.1.2.1 and Microsoft Office Word 2007 under openSUSE Linux 13.2. With LO Writer, I see absolutely no problem in editing (making bold/italic/bold italic) the Thai part of heading 2 (proof: a PDF file exported from LO, looks fine to me). I can save the file to DOCX, DOC and, of course, ODT, re-open and see the text fine again. However, when I open the file in MS Word, I see only empty ... See more
I have tried editing your file using both LibreOffice Writer 5.1.2.1 and Microsoft Office Word 2007 under openSUSE Linux 13.2. With LO Writer, I see absolutely no problem in editing (making bold/italic/bold italic) the Thai part of heading 2 (proof: a PDF file exported from LO, looks fine to me). I can save the file to DOCX, DOC and, of course, ODT, re-open and see the text fine again. However, when I open the file in MS Word, I see only empty squares instead of Thai letters. Thus, the issue seems to be about Word (well, no surprise to me, as Microsoft products are generally very ill-designed). So far my advise would be to use LibreOffice.Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 15:03
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
Fonts Jun 15, 2016

A small update: the issue seems to be about MS Word handling (certain) fonts AND about the DOCX format. I have formatted the Thai letters with a font named Waree using LibreOffice Writer and exported the file to DOC and DOCX. Interestingly, the DOC file opens fine in Word 2007 and seems to be editable, however the DOCX file still shows only empty squares. Thus, my recommendation is slightly up... See more
A small update: the issue seems to be about MS Word handling (certain) fonts AND about the DOCX format. I have formatted the Thai letters with a font named Waree using LibreOffice Writer and exported the file to DOC and DOCX. Interestingly, the DOC file opens fine in Word 2007 and seems to be editable, however the DOCX file still shows only empty squares. Thus, my recommendation is slightly updated:
- use LibreOffice;
- experiment with the fonts available in your system, some of them may be a better solution as compared to others;
- if you need compatibility with MS Word, use DOC, not DOCX.

[Edited at 2016-06-15 09:51 GMT]
Collapse


 
José Henrique Lamensdorf
José Henrique Lamensdorf  Identity Verified
Brazil
Local time: 09:03
English to Portuguese
+ ...
In memoriam
Good point! Jun 15, 2016

Tom in London wrote:

None of what you describe has anything to do with translating.


Formatting from OCR is not "problems", but separate work.

If I translate a Word file using WordFast Classic (or any other decent CAT tool), the translated output comes out already formatted. So it's a mere matter of courtesy to offer the preserved formatting at no extra charge, since it costs the translator nothing.

Some translators "do" HTML, terminology management, Flash reauthoring, PowerPoint narration, etc. which I don't. I translate "live" PDF files with Infix, do complete video subtitling jobs... and do DTP! ALL these are separate jobs, peripheral to (and arising from) translation, and should be charged separately.

The real "problem" is when translators offer to do DTP for free, and using MS Word, which is - as it name implies - a word processor, and not a DTP tool.

By personal choice, I didn't take the plunge into InDesign when Adobe phased out PageMaker, to keep selling upgrades. When DTP is required on top of translation from scanned originals, I still do it using PageMaker, and deliver a finished PDF.

For the record, I only keep using PM because after some 25+ years using successive versions of it, I'm really good and amazingly fast with it. A colleague said it would take me about 30 minutes to get used to InDesign, but I still think it's by far an overkill for what I do.

My method should work on InDesign as well:

1. I do OCR on the scanned files to extract PLAIN TEXT, which I translate.
2. I assemble a new pub on PM, and put each (scanned) page of the original in its place, as a background template.
3. I crop the illustrations (if any) from the source file, edit them as necessary, and put them in place, precisely overlapping the original in the background.
4. I place translated text chunks in their places exactly where the background template shows me they should be, and format them accordingly, taking into account swelling/shrinkage in translation.
5. I add the graphic elements (lines, boxes, shades, whatever) as seen on the background template.
6. I check the pub for completeness.
7. I delete all those background templates, save, and distill into a PDF for delivery.

Most important, yes, I do charge for all this DTP work. Worth mentioning, when I translate PDFs using Infix, I also charge for DTP fixing, though about half of what it would cost to recreate from scratch using PageMaker.

Whenever I am requested to give an estimate for a translation that may require such DTP work, I quote separately:
a) translation only, delivery in plain TXT (so they may hire the DTP operator they choose); and
b) DTP if bundled together with that translation.


 
Huw Watkins
Huw Watkins  Identity Verified
United Kingdom
Local time: 13:03
Member (2005)
Italian to English
+ ...
Maybe change tool? Jun 16, 2016

I actually use Omnipage Pro 18 and have done for years now.

I find it does a pretty decent/very good job of the formatting and the only things I usually need to fix are the occasional document where hard and soft returns are inserted at the end of lines (cutting the sentence up into segments in the CAT tool) and character spacing. Ominpage also inserts text in boxes when dealing with heavily formatted brochures etc - very useful way of dealing with that - and then reproduces the ima
... See more
I actually use Omnipage Pro 18 and have done for years now.

I find it does a pretty decent/very good job of the formatting and the only things I usually need to fix are the occasional document where hard and soft returns are inserted at the end of lines (cutting the sentence up into segments in the CAT tool) and character spacing. Ominpage also inserts text in boxes when dealing with heavily formatted brochures etc - very useful way of dealing with that - and then reproduces the images, leaving you with a nice looking product. It also rarely creates the excel-type lines some OCR software produce, with the resulting unbelievable headache with formatting. I rarely spend more than an hour on a fairly large-sized document fixing it up and it is easy to do when you activate "Show paragraph markers and other hidden formatting symbols" on the home tab of word.

While some may think this is time wasted, I do not. I have experimented with voice recognition software in the past due to a wrist problem I developed and I encountered these main issues:

1) It inserts strange words that don't get picked up in spell check, especially on more area-specific and/or technical texts.
2) It's an absolute nightmare when you have lots of codes and numbers, tags etc.
3) You still have to do the formatting, but from scratch!
4) You have very little to no automated terminology aids (glossaries and TBs etc etc).
5) No handling of repetitions.

I find the hour spent getting the source file in decent shape for SDL Trados Studio and all the plugins and productivity-boosting bits and pieces it affords - to the point where 4000 words/day seems perfectly doable in a good day's work nowadays - saves hours over the course of the project.

My advice is to experiment with Nuance Omnipage and see how you get on. You can get a free 30-day trial.



[Edited at 2016-06-16 08:21 GMT]
Collapse


 
Phil Hand
Phil Hand  Identity Verified
China
Local time: 20:03
Chinese to English
I love stripping off... Jun 16, 2016

...all the formatting.

I don't know about your specific problems with Thai, but I certainly have Notepad open on my desktop permanently, and generally copy and paste words and texts around via Notepad to remove all formatting. Because yes, when you're working with any kind of formatted text and with a CAT, there's always a tag or an autolist or some bit of format that messes up your document.


 
Pages in topic:   < [1 2]


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Is this the solution to formatting problems from OCR?







CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »