Caveats of machine translation

Posted on September 27, 2022

by Oliver Czulo, Venema Victor, Jo Havemann, Jennifer Miller and Dasapta Irawan

Translation tools (often referred to as CAT tools for ‘computer aided translation’) are a great means of streamlining some of the elements of a translation process, such as checking terminology or retrieving existing translations (so-called Translation Memories). Modern versions of these tools allow for a web-based, collaborative translation, giving collaborators such possibilities as revising and/or commenting proposed translations, evaluating existing translations or adding machine translation (henceforth MT) support. Modern MT systems are based on artificial neural networks, which have boosted quality considerably since roughly the mid-2010s.

CAT tools, with or without added machine translation support, have been studied from various angles. While they in general increase efficiency and often ease the task of translation as translators do not have to start from scratch, there are some caveats to be kept in mind when working with them. Here are some of the more important ones.

A major problem which has been described is lack of consistency. This does not only extend to the terminological level as, e.g., shown by (Čulo & Nitzke 2016), but a system may suddenly change in the output style, switching between different forms of addressing readers, for instance. A problem which is sometimes also attributed to how CAT tools display source and target text (mostly in segments of sentences, aligned left-to-right) is that translators do not necessarily spot these inconsistencies, a sort of peephole effect, as they check sentence by sentence and thus do not easily perceive the text as a whole in their revision. Sentence-by-sentence evaluation is also the reason why MT systems often used to score better in their evaluation than they deserved and sometimes still do (see, e.g., Castilho 2021; Krüger 2022): Being evaluated by means of checking translations of single sentences only, inconsistencies are not spotted and thus not penalised.

A second very serious problem, as known from other fields of AI, is that neural MT systems reproduce biases that are implicitly or explicitly encoded in the training texts, a notable issue being gender bias. When translating from a language that has little or no grammatical gender such as English into a language such as German which differentiates between a grammatical ‘masculine’, ‘feminine’ and ‘neuter’ gender (which often, but not necessarily coincide with (supposed) biological sex for nouns referring to humans), this shows: Try translating “sexy pianist” and “clever pianist” into German with MT systems like DeepL. At the time of writing the first version of these notes, the former translates into “sexy Pianistin” (feminine gender), the latter into “geschickter Pianist” (masculine gender). Also, gendering across a text can be wildly inconsistent. And highlighting the non-deterministic and adaptive nature of such systems, the results can actually vary not only between systems, but even for one system over time.

Third, watch out for missing or even spurious additional text. Koehn (2017) describes some of the challenges of early neural machine translation research, some of which have been addressed in the meantime, but an important one remains: MT hallucination, or also called MT fiction. Neural MT systems basically operate by trying to predict the next most likely output based on previous input (which, in principle, is the same mechanism that allows for search completion in a web search bar). Take a moment to reflect on the options you are given in a search query completion: some of them may be very fitting, others quite nonsensical. Modern MT systems have become very good at picking out the fitting options, but when they cannot ‘make sense’ of the input, they may omit something, just try to ‘guess’ or even add stuff that is not there in the source text.

Last but not least, data ethics should be raised as an issue here. Note that for web-based CAT tools and/or machine translation systems (also those that you can plug into your locally installed CAT tool), the source text will be copied over to and processed by multiple other machines. Even if you have the permission to produce a translation that is accessible under more liberal terms, this can technically be a violation of copyright for the source text if it falls under stricter copyright terms. Anonymization of people which may not have been much of an issue for printed, narrowly distributed material can also pose an issue in such settings, even if you chose to perform anonymization for the target text. Ecological matters may apply as well, giving rise to the question how often and at which stage(s) MT should be used: it requires, after all, quite a bit of computing power. For a more in-depth discussion of ethics and the use of machine translation, see Moorkens (2022).

This is the second post to help people make and publishing translations of scientific works. The first post gave some “theoretical background to translation“.

References

Castilho, Sheila. 2021. ‘Towards document-level human MT evaluation: On the issues of annotator agreement, effort and misevaluation’. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), 34–45. Online: Association for Computational Linguistics. https://aclanthology.org/2021.humeval-1.4.

Čulo, Oliver, and Jean Nitzke. 2016. ‘Patterns of Terminological Variation in Post-Editing and of Cognate Use in Machine Translation in Contrast to Human Translation’. Baltic Journal of Modern Computing 4 (2): 106–14. https://aclanthology.org/W16-3401.pdf

Koehn, Philipp, and Rebecca Knowles. 2017. ‘Six Challenges for Neural Machine Translation’. In Proceedings of the First Workshop on Neural Machine Translation, 28–39. Vancouver: Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-3204.

Krüger, Ralph. 2022. ‘Some Translation Studies Informed Suggestions for Further Balancing Methodologies for Machine Translation Quality Evaluation’. Translation Spaces, March. https://doi.org/10.1075/ts.21026.kru.

Moorkens, Joss. 2022. ‘Ethics and Machine Translation’. In Machine Translation for Everyone: Empowering Users in the Age of Artificial Intelligence, edited by Dorothy Kenny, 121–40. Translation and Multilingual Natural Language Processing 18. Language Science Press. https://zenodo.org/record/6653406.

Top photo of robot by Hello Robotics. This file is licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

Theoretical background to translation

Posted on September 19, 2022

by Oliver Czulo, Jennifer Miller, Jo Havemann, Venema Victor and Dasapta Irawan

Photo of a backlit keyboard with a person typing

This is the first of two blog posts with general notes on how to approach the task of translating science, touching upon the most prevalent basic notions and advice relevant to the task. The two main points presented in this and the upcoming post are (a) an introduction into a present-day functionalist view of translation which provides for a wide range of purpose-driven strategic translation options and (b) key caveats when making use of digital support tools for translation including machine translation. These general notes are meant for people who read academic texts at a postgraduate level and have experience with scholarly publishing, but may have little to no experience in translation. As technological tools are nowadays omnipresent in translation processes, they have been comprised here under basic background to translation.

Translation

Translation is a cluster concept (Tymoczko 2005) that is constituted by various cultural practices with complex overlapping similarities. This includes what is sometimes referred to as ‘translation proper’, i.e. ‘transferring’ a (mostly) written source text from one language to a target text in another language. Interpreting, i.e. the ‘transfer’ of (mostly) spoken language is part of the cluster concept, just as well as localization – of software, video games and the like – or sur-/subtitling, transcreating etc. In the following, the terms translation and translate shall include all these practices.

On a side note: It is exactly this understanding of translation as a cluster of cultural practices which opens up the possibility of studying not just the linguistic differences between two texts, but the whole range of patterns of practices and power concerning translation, including, but not limited to such questions as what is translated and who commissions translations, what conscious and subconscious translation strategies are being taught and applied, how censorship and translation interact, etc. etc. This wiki page introduces key concepts and issues that inform the pragmatics of translating a specific text.

Functionalism and translation strategies

Functionalist theories of translation (see, e.g.,Vermeer 1989) have highlighted that translation is a purposeful activity, i.e. it is text production with a goal and an audience, with a precursor, the source text, which may require different levels of adaptation to the target culture. Nord (1989) introduces the spectrum between documentary and instrumental translation: The former is meant to highlight the original make-up of the source text with interlinear glosses being an extreme form, the latter aims at producing a text which is meant to act as a target culture text and should not be discernible from original texts. All in all, a functionalist approach to translation offers us a wide array of translation strategies, keeping in mind that, following Nord, we should remain loyal to both the creators of the source text as well as the intended audience of the target text.

Applied to the purpose of translating science, you might ask yourself, for instance, how to go about the subtitling of a video which introduces a scientific topic. While, of course, you will want to get the terminology and the science right, do think about what the idea of the source text is: Is it purely informatory or does the video at hand also aim to entertain? Assuming it does, what is your goal in translation: Do you mostly care about the science or do you want to entertain as well? What you probably will not want is a ‘close’ translation in a structural sense, i.e. trying to mimic the syntactic or lexical structures of the source text – unless you are aiming, e.g., at documenting which linguistic strategies can be used in a certain language for edutainment videos. Another example is that of the translation of Bron Taylor’s book “Dark green religion” into German, where the author explicitly encouraged the translator Kocku von Stuckrad to add comments explaining how historical US-related circumstances compare to those in Germany, making the text more accessible to a German audience (von Stuckrad in Taylor 2020: 303).

Technical translation and cultural influences

A common misconception is that terminology (or language on the whole) in the natural and engineering sciences is near-‘objective’ in a sense that it fosters a ‘simple’ one-to-one transfer between languages. However, cultural influences abound also in technical language, influencing terminology, phraseology, style, text structures, argumentation patterns etc. Cultural influence here does not solely refer to the larger setting of regional, national, areal or global cultures, but also to cultures of specific scientific fields and subfields (i.e., shared assumptions, traditions, practices, etc.). Even within a language, creating, e.g., something like a common terminology may be quite an undertaking especially in younger fields of research (see, e.g., Avizienis et al. 2004 for the field of dependable and fault tolerant computing). Between languages, even slightest differences in conceptualizations and uses can pose a challenge. On top of this, influence of larger cultural contexts is omnipresent not just in the humanities or social sciences, with the discussion about the English master/slave terminology in computing and electrical engineering as a very prominent and illustrative example (Charboneau, 2020). As pointed out above, these differences may extend to other linguistic levels such as phraseology, argumentation patterns or text structures, in some cases giving rise to strategies of translation which are often subsumed under adaptation, i.e. making deep(er) changes to the make-up of a (stretch of) text in order to make it more target culture adequate and fitting to the purpose, which can be quite in line with Nord’s principle of loyalty. Whichever strategy you choose, be aware of these cultural factors even in technical language.

Translating science

Who can translate science?

Translation is very likely more often than not: co-creation. Professional translators will have learned how to quickly adapt to the terminology, phraseology, style of a field, how to invent new terminology, how to perform effective research in cases of doubt and – actually one of the most challenging and frequent problems in translation – how to deal with faulty, ambiguous or badly formulated (stretches of) source texts; but technical expertise is still often required for translation, inside as much as outside of translating science. Many works are translated by (groups of) people with domain knowledge and the necessary linguistic competences, and it is not unusual to have MA or PhD translation students as well as career jumpers from completely different fields than linguistics given a certain background in their languages and cultures of interest.

This provides us with a number of options when it comes to the question of who could translate science: It could be scientists, alone or in groups with complementary domain or language skills; some institutions might even have translation services that can spare at least some time to (aid) translate science; or some stakeholders might have money on the side to commission translation. In all cases, however, the domain knowledge of scientists will be crucial, and should you be in the position to commission a translation, be prepared to answer questions on linguistic and other aspects of the field in question.

Aiding (commissioned) translation / translators

In any case, for a commissioned translation be prepared to act as the domain expert as a scientist. You can aid the linguistic side of a (commissioned) translation if you have some sort of terminology (e.g. any dictionary for your field that you have at hand) at the disposal of those who translate, or if you have a collection of texts (ideally in all languages involved) which you can make available so that term candidates and collocation patterns can be extracted quickly by means of the appropriate tools (see, e.g., on this wiki; professional translators should have acquired access to such tools). If you have commissioned a translation, the use of tools which allow for collaborative work can be a great help, e.g. in order to quickly comment on questions translators might have.

Literature

Avizienis, Algirdas, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. ‘Basic Concepts and Taxonomy of Dependable and Secure Computing’. IEEE Transactions on Dependable and Secure Computing 1 (1): 11–33.

Charboneau, Tyler. 2020. ‘How “Master” and “Slave” Terminology Is Being
Reexamined in Electrical Engineering – News’. Accessed 1 August 2022.
https://www.allaboutcircuits.com/news/how-master-slave-terminology-reexamined-in-electrical-engineering/

Nord, Christiane. 1989. ‘Loyalität Statt Treue. Vorschläge zu einer funktionalen Übersetzungstypologie’. Lebende Sprachen, no. 3: 100–105.

Taylor, Bron. 2020. Dunkelgrüne Religion: Naturspiritualität und die Zukunft des Planeten. Translated by Kocku von Stuckrad. Leiden, Netherlands: Brill, Wilhelm Fink.

Tymoczko, Maria. 2005. ‘Trajectories of Research in Translation Studies’. Meta 4 (50): 1082–97.1 Vermeer, Hans J. 1989. ‘Skopos and commission in translational action’. In Readings in Translation Theory, edited by Andrew Chesterman, 173–87. Helsinki: Oy Finn Lectura AB.

Top photo of keyboard made by Colin / Wikimedia Commons / CC BY-SA 4.0

Building a tool to find translated scientific articles

Posted on November 2, 2021

by Venema Victor

*Screenshot of a mock-up of the Translation Science Switchboard.*

You know an article exists, but cannot read its language. So you go to our tool, paste the Digital Object Identifier of the article and get a list with translated versions. You manage your articles in a reference manager and notice that an article on your reading list is now also available in your mother tongue. You are really enthusiastic about a new article that was just published, which has policy implications for your country and you want to translate it so that more people can read it, on our tool you find a partial translation made by a colleague from another university department; you jointly finish the translation, publish it on a repository and upload the link to our database.

These scenarios demonstrate that a translation finding tool would be really useful and could also stimulate the production of translations.

One of us started dreaming of such a tool attending a climate conference in Peru. Colleagues from the local weather service were doing interesting work, but many did not speak much English. An important way they kept up to date were the guidance reports written by the World Meteorological Organization (WMO), one of the oldest open science organizations. They translate all their guidance reports into many languages because the weather services who control the WMO see this as a priority. A colleague at the conference told me that she sometimes translates important English articles into Spanish and emails them to her colleagues; just like Albert Einstein translated important studies into English. That made me wonder whether we could spread translations in a better way and thus also stimulate their production.

Lingua Franca

Translations are part of the open science movement. Translated scientific articles make science more accessible to regular people, science enthusiasts, activists, advisors, trainers, consultants, architects, doctors, journalists, planners, administrators, technicians and many scientists.

English as a common language has mostly made global communication within science easier. However, this has made communication with non-English communities harder. This goes both ways, people who could benefit from scientific knowledge and people who have knowledge scientists should know.

For English-speakers it is easy to overestimate how many people speak English because we mostly deal with foreigners who do speak English. However, it is thought that that only about one billion people speak English. That means that seven billion people do not.

Translated scientific articles speed up scientific progress by tapping into more knowledge and avoiding double work. They thus improve the quality of science. The additional two-way knowledge transfer aids innovation and tackling the big global challenges in the fields of climate change, agriculture and health. Translations can improve public disclosure, scientific engagement and science literacy.

*Phone screenshot of the Translate Science Switchboard.*

Open Source tool

We want to develop and deploy an open source tool to make it easier to find translations and thus make them more worthwhile to make. In its simplest form people should be able to search using a Digital Object Identifier, a title or the names of the authors and be presented with a list with links to translations. People or organizations who made or have translations should be able to upload lists with links. Users and volunteers should be given moderation tools.

Also searching by topic and a topic directory would be useful as translated articles tend to be the more important ones in a field. The database should also be accessible via an Application Programming Interface (API) so that other tools and webpages can automatically display information on any translations and inform us about new translations.

People or organizations who made or have translations should be able to upload lists with links. There were similar databases during the Cold War to keep up with Soviet research and we want to try to rescue their datasets and upload them to our database. Many research libraries, international organizations and research institutes (World Meteorological Organization, UK Met Office, …) have translated articles and reports, which should be included. Also translations of articles written before English was the Lingua Franca.

The expensive organizations maintaining these databases and making translations collapsed after the Cold War. In the internet age, we can maintain large knowledge bases cost effectively with global volunteers, as Wikipedia has demonstrated, and include many more languages. Also translating has become much easier as a reasonable first draft can often be provided by machine learning. And we can now network people who only occasionally make translations (of their own articles).

Not every contribution will be perfect. Users and editors of such a database should be be able to indicate how good a translation is and need moderation tools. With versioning it should be easy to revert vandalism or spamming. We could green lists known scientific repositories and red lists known spammers.

If there are multiple translations for a language, editors or users should be able to rank them and indicate which one is best. If only because external systems using our information may be designed to only accept one translation per language as that will be the most typical case.

A “talk page”, similar to Wikipedia’s, could be useful to allow users to point to problems, discuss which translations are best and which quality flags need to be set. Possibly even to organize to jointly make a better translation. This could be implemented with a commenting or forum system in a background tab.

Copying the idea of Wikipedia of making a page with recent changes can help with quality control. Such a page can be filtered in several ways, e.g., for contributions by new people. In case someone finds that a participant made a problematic contribution a look at their user pages may find more problems.

Many more technical details of how such a system could look can be found on our Wiki.

*Mockup of the search page showing all articles on the search term for which the database has translations.*

Points to ponder

How hard would it be to make the system distributed, to have multiple servers who talk to each other and exchange data if they trust each other? We are doing this for science, but there are groups outside of science who could use similar system; the nearest to us would be education and science communication. (Disciplinary) groups within science may be able to use their networks to promote the production of translations. That would make bulk download of our data a good idea to get a new server started.

It could be worthwhile to make a (private) backup of the known translations and regularly check for broken links. The backup can help the editors find the new location of the translation or to upload it elsewhere if the license allows for this.

It may be a good idea to have multiple types of links to translations. Literal translations, but also related works in another language, for example a PhD thesis in language X and a corresponding article in language Y. Sometimes people may write a summary of an article in another language, which could be valuable if there is no full translation. Also links to partial translations can still be valuable and showing them could promote their completion.

The road ahead

The above mainly describes the technical aspects of such a Translations Switchboard, but there is also a human aspect. We will need a community of editors for every language to check submitted URLs to avoid spam and select the best version in case multiple ones are available. So we need tools to build and organize this community. We will also need publicity so that people know about the service. Part of the advertising could function via integration of our system in others. We will need volunteers who contact possible sources of translations which could be integrated into the database and to promote the production of translations in their circles.

Designing and coding the full system described above would be a considerable task. If someone has experience with similar projects and would like to apply for funding: feel free to make the idea your own; we are also happy to be the science advisory team. For now we decided to start small. Create a minimal system and add the data we know of to it. That way the idea becomes more concrete, which will hopefully help to find resources to build it and to fill the database. This first version will be coded using PHP, HTML, CSS and Maria DB.

You can already help us a lot by spreading this idea to increase the chance that people interested in contributing learn out it. Also feedback on the idea in the comments below is very valuable. If any of the above appeals to you, please get in touch on Mastodon or by email.

Translate Science Blog

Author Archives: Venema Victor