Trim internal fuzzies (AutoIt script) (CAT Tools Technical Help)

Technical forums » CAT Tools Technical Help »
Trim internal fuzzies (AutoIt script)
Track this topic

Trim internal fuzzies (AutoIt script)

Thread poster: Samuel Murray

Samuel Murray

Netherlands
Local time: 16:11
Member (2006)
English to Afrikaans
+ ...

Aug 16, 2020

Hello everyone

I have written a very simple, work-in-progress, proof-of-concept set of scripts that searches a list of sentences for internal fuzzy matches and then groups internal fuzzy matches together. This would potentially allow one to share a job among multiple translators while preventing internal fuzzy matches from being split between translators (who would otherwise not have any benefit from such matches, since the "other" sentences may have been given to other translators).

These are AutoIt scripts, so you need AutoIt installed on your computer to use them. The input file format is plain text, UTF8 with BOM, one sentence per line. The output file contains the same sentences in the same order, except that internal fuzzy matching segments from later in the list are moved up and grouped with the earliest one of the matches.

http://www.leuce.com/autoit/trim%20internal%20fuzzies%20v01.zip

There are four scripts, using Levenshtein and Sift2 fuzzy matching, using two different methods of searching. The inner workings of both Levenshtein and Sift2 are Greek to me, so don't bother asking such technical questions. (-:

Samuel

PS. If you know of any CAT tool that can do this sort of thing (i.e. either remove/export or group internal fuzzies in a file), please please let me know. It should be a standard feature, but it isn't.

---------------------------------------------------------------------------
Added:

If a tool or CAT tool has a feature such as the one I'm talking about, one should be able to tweak its settings to reduce this list of 20 segments down to fewer than ten, or even down to a list of four:

This is the house that the Jack built for his friend the alligator.
This is the house that the Jack built for his friend the bear.
This is the house that the Jack built for his friend the camel.
This is the house that the Jack built for his friend the dolphin.
This is the house that the Jack built for his friend the elephant.
In the Old West, cowboys and their wives ate only fish.
In the Old West, cowboys and their wives ate only giraffe.
In the Old West, cowboys and their wives ate only hippo.
In the Old West, cowboys and their wives ate only insect.
In the Old West, cowboys and their wives ate only jellyfish.
The kangaroo went on a long holiday and never returned.
The lion went on a long holiday and never returned.
The monkey went on a long holiday and never returned.
The newt went on a long holiday and never returned.
The owl went on a long holiday and never returned.
Everyone agrees that the cutest animals are penguin and quail.
Everyone agrees that the cutest animals are raccoon and seal.
Everyone agrees that the cutest animals are tiger and unicorn.
Everyone agrees that the cutest animals are viper and whale.
Everyone agrees that the cutest animals are x-ray fish, yak and zebra.

If I run one of my Levenshtein scripts on this list, I get these results:

Fuzzy threshold: 75%
This is the house that the Jack built for his friend the alligator.
In the Old West, cowboys and their wives ate only fish.
The kangaroo went on a long holiday and never returned.
Everyone agrees that the cutest animals are penguin and quail.

Fuzzy threshold: 85%
This is the house that the Jack built for his friend the alligator.
In the Old West, cowboys and their wives ate only fish.
The kangaroo went on a long holiday and never returned.
The owl went on a long holiday and never returned.
Everyone agrees that the cutest animals are penguin and quail.
Everyone agrees that the cutest animals are raccoon and seal.
Everyone agrees that the cutest animals are tiger and unicorn.
Everyone agrees that the cutest animals are viper and whale.

[Edited at 2020-08-17 08:08 GMT] ▲ Collapse

James Plastow

United Kingdom
Local time: 15:11
Member (2020)
Japanese to English

Excel

Aug 16, 2020

Perhaps this fuzzy match add-in for Excel might also be able to do the job? (I have not tried it)

https://www.microsoft.com/en-us/download/details.aspx?id=15011

Samuel Murray

Netherlands
Local time: 16:11
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

@James

Aug 16, 2020

James Plastow wrote:
Perhaps this fuzzy match add-in for Excel might also be able to do the job? (I have not tried it)

No, the Fuzzy Lookup add-on in Excel compares two tables and tries to match data from the one table to data in the other table. It does not (and can't) match cells from within a single table (i.e. a single list of sentences) to each other.

https://www.youtube.com/watch?v=3v-qxcjZbyo

James Plastow

United Kingdom
Local time: 15:11
Member (2020)
Japanese to English

ablebits

Aug 16, 2020

I see,

https://www.ablebits.com/docs/excel-find-fuzzy-duplicates/

is another one, but I guess you have already looked into the available options.

Samuel Murray

Netherlands
Local time: 16:11
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

@James

Aug 16, 2020

James Plastow wrote:
https://www.ablebits.com/docs/excel-find-fuzzy-duplicates/
is another one, but I guess you have already looked into the available options.

No, I haven't look at Excel tools at all.

The Ablebits one appears to be a potential solution, but unfortunately it went straight to end-of-demo on my computer, so I was unable to test it. It doesn't appear that one can set a fuzzy match percentage; rather, one sets a number of characters that differ. So it may work, but it may also not (since it was really designed to find fuzzy matches in short field data, e.g. people's names or addresses). It is a bit expensive, though: $99. Here's a video of it:
https://www.youtube.com/watch?v=2Tc5Ifl2bX4

James Plastow

United Kingdom
Local time: 15:11
Member (2020)
Japanese to English

google sheets

Aug 16, 2020

OK, if you haven't looked into it, it may also be worth searching for Google Sheets fuzzy match add-ins.
Seems like there are quite a few options, for example Flookup.

Samuel Murray

Netherlands
Local time: 16:11
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

Okay, we should investigate various Google Sheets and Excel add-ons

Aug 16, 2020

James Plastow wrote:
It may also be worth searching for Google Sheets fuzzy match add-ins. For example Flookup.

Okay.

FWIW, Flookup itself doesn't help with this problem: it can only remove (or highlight) all fuzzy matches (and not just all except 1), and in my tests it flagged widely divergent sentences of divergent lengths as being fuzzy matches. And the highest match threshold is 90% (it can also do 80%, 70%, 60% etc, but not e.g. 93%, 95% etc.). Price: $10 per month.

James Plastow

United Kingdom
Local time: 15:11
Member (2020)
Japanese to English

workaround

Aug 16, 2020

I have been playing around with it,
one way is to say
> Approximate fuzzy matches as segments with duplicate substrings

So, use TextSTAT (free) to find the most frequent substrings in the text. (paste the source into Notepad then open in TextSTAT)
* Actually this works nicely for Japanese but only matches individual words with English. There should be some software that will analyze phrase frequency in Western languages.

Export the list of frequent substrings and open it in Excel

Tidy up the list if necessary

Use the textjoin function with | as the delimiter to create a search term for all the frequent substrings

Paste this into the Trados filter box

No good for specific percentages of fuzzy matches but it does let you filter for segments with repeated phrases

[Edited at 2020-08-16 19:55 GMT]

[Edited at 2020-08-16 19:58 GMT] ▲ Collapse

Samuel Murray

Netherlands
Local time: 16:11
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

@James

Aug 16, 2020

James Plastow wrote:
Find the most frequent sub-strings in the text [and filter segments by it in Trados].
It does let you filter for segments with repeated phrases.

That may be so, but that is not what I'm trying to achieve. Take a look at these two sentences:
- The rain in Spain falls mainly on the plains in October and November each year.
- In the Old West, cowboys sat on their horses mainly on the plains where the grasses grew.
Both contain the sub-string "mainly on the plains" but they are by no means fuzzy matches of each other.

I'm not convinced that one can say that segments are fuzzy matches of each other if they share frequent sub-strings... but even if one could, non-fuzzy matches may also contain those sub-strings, and we don't want to flag non-fuzzy matching segments.

[Edited at 2020-08-16 21:18 GMT]

Hans Lenting
Netherlands
Member (2006)
German to Dutch

Transit NXT

Aug 17, 2020

https://www.star-spain.com/en/blog/transittermstar-nxt-tooltips/creating-translation-extracts-and-reference-extracts

Samuel Murray

Netherlands
Local time: 16:11
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

@Hans

Aug 17, 2020

Hans Lenting wrote:
https://www.star-spain.com/en/blog/transittermstar-nxt-tooltips/creating-translation-extracts-and-reference-extracts

Thanks. I don't have NXT but from what I can tell from that blog post, these two features are not what we're looking for.

According to the blog post (and from the screenshots), in NXT one can create two types of reduced sets of data, namely a "translation extract" (which extracts all untranslated segments, for re-import later) and a "reference extract" (which, and I'm guessing, extracts TUs from TMs and possibly also glossaries).

The reference extract does have the option of specifying a fuzzy threshold, so if NXT retains multiple instances of TUs in its TMs, then perhaps this feature can be used after all: create a source=target TM from the source file, then run a "reference extract" with the source file against that TM only, and then export it to a format that one can process (e.g. TMX), and removing any TUs that occur only once (TUs that occur more than once would be TUs that were a match for more than just its own segment), and then convert that to a new plaintext source file with one segment per line, and remove duplicate lines. This all hinges on the assumption that NXT writes (retains) multiple instances of identical translations into its own TM... or that its TM system contains a segment re-use counter.

See also my updated first post with a test file.

[Edited at 2020-08-17 08:08 GMT]

Samuel Murray

Netherlands
Local time: 16:11
Member (2006)
English to Afrikaans
+ ...

TOPIC STARTER

New, updated version

Aug 18, 2020

http://www.leuce.com/autoit/trim%20internal%20fuzzies%20v03a.zip

New improved version, with non-working scripts removed, and with more settings to speed things up (by reducing the number of matches found). An input list of anything over 1000 sentences runs the risk of running for an hour or more. The "type 1" script is the fastest for long texts. Also expor... See more

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie	[Call to this topic]
Peter Zauner	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

Trim internal fuzzies (AutoIt script)

Translation news related to CAT tools

» Memsource Sells to Carlyle: The Inside Story
(0 comments)
» memoQ 9.4: Turbo-Charging Productivity
(0 comments)
» The Future Of Work Now: The Computer-Assisted Translator And Lilt
(0 comments)

Submit translation news about CAT tools »
Read more translation news »

Forum rules

Help and orientation

TM-Town
Manage your TMs and Terms ... and boost your translation business Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work. More info »

Wordfast Pro
Translation Memory Software for Any Platform Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value Buy now! »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Trim internal fuzzies (AutoIt script)

Trim internal fuzzies (AutoIt script)

You have native languages that can be verified

Your current localization setting

Select a language