Trim internal fuzzies (AutoIt script) Thread poster: Samuel Murray
| Samuel Murray Netherlands Local time: 16:11 Member (2006) English to Afrikaans + ...
Hello everyone I have written a very simple, work-in-progress, proof-of-concept set of scripts that searches a list of sentences for internal fuzzy matches and then groups internal fuzzy matches together. This would potentially allow one to share a job among multiple translators while preventing internal fuzzy matches from being split between translators (who would otherwise not have any benefit from such matches, since the "other" sentences may have been given to other translators... See more Hello everyone I have written a very simple, work-in-progress, proof-of-concept set of scripts that searches a list of sentences for internal fuzzy matches and then groups internal fuzzy matches together. This would potentially allow one to share a job among multiple translators while preventing internal fuzzy matches from being split between translators (who would otherwise not have any benefit from such matches, since the "other" sentences may have been given to other translators). These are AutoIt scripts, so you need AutoIt installed on your computer to use them. The input file format is plain text, UTF8 with BOM, one sentence per line. The output file contains the same sentences in the same order, except that internal fuzzy matching segments from later in the list are moved up and grouped with the earliest one of the matches. http://www.leuce.com/autoit/trim%20internal%20fuzzies%20v01.zip There are four scripts, using Levenshtein and Sift2 fuzzy matching, using two different methods of searching. The inner workings of both Levenshtein and Sift2 are Greek to me, so don't bother asking such technical questions. (-: Samuel PS. If you know of any CAT tool that can do this sort of thing (i.e. either remove/export or group internal fuzzies in a file), please please let me know. It should be a standard feature, but it isn't. --------------------------------------------------------------------------- Added: If a tool or CAT tool has a feature such as the one I'm talking about, one should be able to tweak its settings to reduce this list of 20 segments down to fewer than ten, or even down to a list of four: This is the house that the Jack built for his friend the alligator. This is the house that the Jack built for his friend the bear. This is the house that the Jack built for his friend the camel. This is the house that the Jack built for his friend the dolphin. This is the house that the Jack built for his friend the elephant. In the Old West, cowboys and their wives ate only fish. In the Old West, cowboys and their wives ate only giraffe. In the Old West, cowboys and their wives ate only hippo. In the Old West, cowboys and their wives ate only insect. In the Old West, cowboys and their wives ate only jellyfish. The kangaroo went on a long holiday and never returned. The lion went on a long holiday and never returned. The monkey went on a long holiday and never returned. The newt went on a long holiday and never returned. The owl went on a long holiday and never returned. Everyone agrees that the cutest animals are penguin and quail. Everyone agrees that the cutest animals are raccoon and seal. Everyone agrees that the cutest animals are tiger and unicorn. Everyone agrees that the cutest animals are viper and whale. Everyone agrees that the cutest animals are x-ray fish, yak and zebra. If I run one of my Levenshtein scripts on this list, I get these results: Fuzzy threshold: 75% This is the house that the Jack built for his friend the alligator. In the Old West, cowboys and their wives ate only fish. The kangaroo went on a long holiday and never returned. Everyone agrees that the cutest animals are penguin and quail. Fuzzy threshold: 85% This is the house that the Jack built for his friend the alligator. In the Old West, cowboys and their wives ate only fish. The kangaroo went on a long holiday and never returned. The owl went on a long holiday and never returned. Everyone agrees that the cutest animals are penguin and quail. Everyone agrees that the cutest animals are raccoon and seal. Everyone agrees that the cutest animals are tiger and unicorn. Everyone agrees that the cutest animals are viper and whale.
[Edited at 2020-08-17 08:08 GMT] ▲ Collapse | | | James Plastow United Kingdom Local time: 15:11 Member (2020) Japanese to English | Samuel Murray Netherlands Local time: 16:11 Member (2006) English to Afrikaans + ... TOPIC STARTER
James Plastow wrote: Perhaps this fuzzy match add-in for Excel might also be able to do the job? (I have not tried it) No, the Fuzzy Lookup add-on in Excel compares two tables and tries to match data from the one table to data in the other table. It does not (and can't) match cells from within a single table (i.e. a single list of sentences) to each other. https://www.youtube.com/watch?v=3v-qxcjZbyo | | | James Plastow United Kingdom Local time: 15:11 Member (2020) Japanese to English
|
|
Samuel Murray Netherlands Local time: 16:11 Member (2006) English to Afrikaans + ... TOPIC STARTER
No, I haven't look at Excel tools at all. The Ablebits one appears to be a potential solution, but unfortunately it went straight to end-of-demo on my computer, so I was unable to test it. It doesn't appear that one can set a fuzzy match percentage; rather, one sets a number of characters that differ. So it may work, but it may also not (since it was really designed to find fuzzy matches in short field data, e.g. people's names or addresses). It is a bit expensive, though: $99. Here's a video of it: https://www.youtube.com/watch?v=2Tc5Ifl2bX4 | | | James Plastow United Kingdom Local time: 15:11 Member (2020) Japanese to English google sheets | Aug 16, 2020 |
OK, if you haven't looked into it, it may also be worth searching for Google Sheets fuzzy match add-ins. Seems like there are quite a few options, for example Flookup. | | | Samuel Murray Netherlands Local time: 16:11 Member (2006) English to Afrikaans + ... TOPIC STARTER Okay, we should investigate various Google Sheets and Excel add-ons | Aug 16, 2020 |
James Plastow wrote: It may also be worth searching for Google Sheets fuzzy match add-ins. For example Flookup. Okay. FWIW, Flookup itself doesn't help with this problem: it can only remove (or highlight) all fuzzy matches (and not just all except 1), and in my tests it flagged widely divergent sentences of divergent lengths as being fuzzy matches. And the highest match threshold is 90% (it can also do 80%, 70%, 60% etc, but not e.g. 93%, 95% etc.). Price: $10 per month. | | | James Plastow United Kingdom Local time: 15:11 Member (2020) Japanese to English
I have been playing around with it, one way is to say > Approximate fuzzy matches as segments with duplicate substrings So, use TextSTAT (free) to find the most frequent substrings in the text. (paste the source into Notepad then open in TextSTAT) * Actually this works nicely for Japanese but only matches individual words with English. There should be some software that will analyze phrase frequency in Western languages. Export the list of frequent sub... See more I have been playing around with it, one way is to say > Approximate fuzzy matches as segments with duplicate substrings So, use TextSTAT (free) to find the most frequent substrings in the text. (paste the source into Notepad then open in TextSTAT) * Actually this works nicely for Japanese but only matches individual words with English. There should be some software that will analyze phrase frequency in Western languages. Export the list of frequent substrings and open it in Excel Tidy up the list if necessary Use the textjoin function with | as the delimiter to create a search term for all the frequent substrings Paste this into the Trados filter box No good for specific percentages of fuzzy matches but it does let you filter for segments with repeated phrases
[Edited at 2020-08-16 19:55 GMT]
[Edited at 2020-08-16 19:58 GMT] ▲ Collapse | |
|
|
Samuel Murray Netherlands Local time: 16:11 Member (2006) English to Afrikaans + ... TOPIC STARTER
James Plastow wrote: Find the most frequent sub-strings in the text [and filter segments by it in Trados]. It does let you filter for segments with repeated phrases. That may be so, but that is not what I'm trying to achieve. Take a look at these two sentences: - The rain in Spain falls mainly on the plains in October and November each year. - In the Old West, cowboys sat on their horses mainly on the plains where the grasses grew. Both contain the sub-string "mainly on the plains" but they are by no means fuzzy matches of each other. I'm not convinced that one can say that segments are fuzzy matches of each other if they share frequent sub-strings... but even if one could, non-fuzzy matches may also contain those sub-strings, and we don't want to flag non-fuzzy matching segments.
[Edited at 2020-08-16 21:18 GMT] | | | | Samuel Murray Netherlands Local time: 16:11 Member (2006) English to Afrikaans + ... TOPIC STARTER
Thanks. I don't have NXT but from what I can tell from that blog post, these two features are not what we're looking for. According to the blog post (and from the screenshots), in NXT one can create two types of reduced sets of data, namely a "translation extract" (which extracts all untranslated segments, for re-import later) and a "reference extract" (which, and I'm guessing, extracts TUs from TMs and possibly also glossaries). The reference extract does have the option of specifying a fuzzy threshold, so if NXT retains multiple instances of TUs in its TMs, then perhaps this feature can be used after all: create a source=target TM from the source file, then run a "reference extract" with the source file against that TM only, and then export it to a format that one can process (e.g. TMX), and removing any TUs that occur only once (TUs that occur more than once would be TUs that were a match for more than just its own segment), and then convert that to a new plaintext source file with one segment per line, and remove duplicate lines. This all hinges on the assumption that NXT writes (retains) multiple instances of identical translations into its own TM... or that its TM system contains a segment re-use counter. See also my updated first post with a test file.
[Edited at 2020-08-17 08:08 GMT] | | | Samuel Murray Netherlands Local time: 16:11 Member (2006) English to Afrikaans + ... TOPIC STARTER New, updated version | Aug 18, 2020 |
http://www.leuce.com/autoit/trim%20internal%20fuzzies%20v03a.zip New improved version, with non-working scripts removed, and with more settings to speed things up (by reducing the number of matches found). An input list of anything over 1000 sentences runs the risk of running for an hour or more. The "type 1" script is the fastest for long texts. Also expor... See more http://www.leuce.com/autoit/trim%20internal%20fuzzies%20v03a.zip New improved version, with non-working scripts removed, and with more settings to speed things up (by reducing the number of matches found). An input list of anything over 1000 sentences runs the risk of running for an hour or more. The "type 1" script is the fastest for long texts. Also exports a second file with only the fuzzy segments. Real-world example included in the ZIP file.
[Edited at 2020-08-18 20:16 GMT] ▲ Collapse | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Trim internal fuzzies (AutoIt script) TM-Town | Manage your TMs and Terms ... and boost your translation business
Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.
More info » |
| Wordfast Pro | Translation Memory Software for Any Platform
Exclusive discount for ProZ.com users!
Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value
Buy now! » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |