Table of Contents
Paper Overview
Create a 1000-language web text data set
is a long-tail language Building a machine translation model
Assessment
Home Technology peripherals AI Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.

Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.

Apr 08, 2023 pm 03:21 PM
Google Model translate

The quality of academic and commercial machine translation systems (MT) has improved dramatically over the past decade. These improvements are largely due to advances in machine learning and the availability of large-scale web mining datasets. At the same time, the emergence of deep learning (DL) and E2E models, large-scale parallel single-language data sets obtained from web mining, data enhancement methods such as back-translation and self-training, and large-scale multi-language modeling have brought about the ability to support more than 100 High-quality machine translation system for languages.

However, despite the huge progress in low-resource machine translation, the number of languages ​​​​for which widely available and general machine translation systems have been built is limited to about 100, which are obviously only the most comprehensive ones today. A few of the more than 7,000 languages ​​spoken in the world. In addition to the limited number of languages, the distribution of languages ​​supported by current machine translation systems is also greatly tilted towards European languages.

We can see that despite their large populations, there are fewer services for languages ​​spoken in Africa, South and Southeast Asia, and Native American languages. For example, Google Translate supports Frisian, Maltese, Icelandic, and Corsican, all of which have fewer than 1 million native speakers. By comparison, the Bihar dialect population not served by Google Translate is about 51 million, the Oromo population is about 24 million, the Quechua population is about 9 million, and the Tigrinya population is about 9 million (2022). These languages ​​are known as "long tail" languages, and the lack of data requires the application of machine learning techniques that can generalize beyond languages ​​with sufficient training data.

Building machine translation systems for these long-tail languages ​​is largely limited by the lack of available digitized data sets and NLP tools such as language identification (LangID) models. These are ubiquitous for high-resource languages.

In a recent Google paper "Building Machine Translation Systems for the Next Thousand Languages", more than two dozen researchers demonstrated their efforts to build practical machines that support more than 1,000 languages. Translation system results.

Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.

Paper address: https://arxiv.org/pdf/2205.03983.pdf

Specific Specifically, the researchers describe their results from the following three research areas.

First, a clean, web-mined dataset is created for 1500 languages ​​through semi-supervised pre-training for language recognition and data-driven filtering techniques.

Second, through large-scale multilingual models trained with supervised parallel data for more than 100 high-resource languages, as well as monolingual datasets for 1,000 additional languages. Create machine translation models that actually work for underserved languages.

Third, study the limitations of evaluation metrics for these languages, conduct a qualitative analysis of the output of machine translation models, and focus on several common error patterns of such models.

We hope this work will provide useful insights to practitioners working on building machine translation systems for currently under-researched languages. In addition, the researchers hope that this work can lead to research directions that address the weaknesses of large-scale multilingual models in data sparse settings.

At the I/O conference on May 12, Google announced that its translation system has added 24 new languages, including some niche Native American languages. For example, the Bihar dialect, Oromo, Quechua and Tigrinya mentioned above.

Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.

Paper Overview

This work is mainly divided into four major chapters. Here we only discuss each chapter. The contents of each chapter are briefly introduced.

Create a 1000-language web text data set

This chapter details the researcher’s efforts to crawl single-language text data for 1500 languages method used in the collection process. These methods focus on recovering high-precision data (i.e., a high proportion of clean, in-language text), so a large part are various filtering methods.

In general, the methods used by researchers include the following:

  • Remove languages ​​with poor training data quality and poor LangID performance from the LangID model, and train a 1629-language CLD3 LangID model and semi-supervised LangID (SSLID) model;
  • Perform clustering operation based on the error rate of language in the CLD3 model;
  • Use the CLD3 model to perform the first round of web crawling;
  • Filter sentences using document consistency;
  • Filter all corpora using a percentage threshold word list;
  • Use semi-supervised LangID (SSLID ) Filter all corpora;
  • Use relative recall to detect outlier languages ​​and filter using Term-Frequency-Inverse-Internet-Frequency (TF-IIF) ;
  • Use Token-Frequency Anomalousness scores to detect outlier languages ​​and manually design filters for them;
  • Face all corpora at the sentence level Perform deduplication operations.

The following is a histogram of document consistency scores on web text using the CLD3 LangID model of 1745-language.

Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.

Table 2 below shows the single-language data of the complete low-resource language (LRL) data set, part of the single-language data used to train the model, and includes Single-language statistics for the complete training set including high-resource languages.

Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.

The chapter directory is as follows:

Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.

is a long-tail language Building a machine translation model

For monolingual data mined from the web, the next challenge is to create a high-quality general machine translation model from a limited amount of monolingual training data. To this end, the researchers adopted a pragmatic approach of leveraging all parallel data available for higher-resource languages ​​to improve the quality of long-tail languages ​​where only monolingual data is available. They call this setup "zero-resource" because there is no direct oversight for long-tail languages.

Researchers have used several techniques developed for machine translation over the past few years to improve the quality of zero-resource translation of long-tail languages. These techniques include self-supervised learning from monolingual data, large-scale multilingual supervised learning, large-scale back-translation, and self-training of high-capacity models. They used these tools to create a machine translation model capable of translating 1,000 languages, leveraging existing parallel corpora covering approximately 100 languages ​​and a 1,000-language monolingual dataset built from the web.

Specifically, the researchers first emphasized the importance of model capacity in highly multilingual models by comparing the performance of 1.5 billion and 6 billion parameter Transformers on zero-resource translation (3.2) , and then increased the number of self-supervised languages ​​to 1000, verifying that as more monolingual data from similar languages ​​becomes available, performance improves for most long-tail languages ​​(3.3). While the researchers' 1000-language model demonstrated reasonable performance, they incorporated large-scale data augmentation to understand the strengths and limitations of their approach.

In addition, the researchers fine-tuned the generative model on a subset of 30 languages ​​containing a large amount of synthetic data through self-training and back-translation (3.4). They further describe practical methods for filtering synthetic data to enhance the robustness of these fine-tuned models to hallucinations and incorrect language translation (3.5).

We also used sequence-level distillation to refine these models into smaller, easier-to-reason architectures and highlighted the performance gap between teacher and student models (3.6).

The chapter directory is as follows:

Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.

Assessment

To evaluate their machine translation model, the researchers first translated English sentences into these languages ​​and constructed an evaluation set (4.1) for the 38 selected long-tail languages. They highlight the limitations of BLEU in long-tail settings and evaluate these languages ​​using CHRF (4.2).

The researchers also proposed an approximate reference-free metric based on round-trip translation to understand the quality of the model in languages ​​where the reference set is not available, and The quality of the model as measured by this metric is reported (4.3). They performed human evaluation of the model on a subset of 28 languages ​​and reported the results, confirming that it is possible to build useful machine translation systems following the approach described in the paper (4.4).

In order to understand the weaknesses of large-scale multilingual zero-resource models, researchers conducted qualitative error analysis on several languages. It was found that the model often confused words and concepts that were similar in distribution, such as "tiger" became "small crocodile" (4.5). And under lower resource settings (4.6), the model's ability to translate tokens decreases on tokens that appear less frequently.

The researchers also found that these models often cannot accurately translate short or single-word input (4.7). Research on refined models shows that all models are more likely to amplify bias or noise present in the training data (4.8).

The chapter table of contents is as follows:

Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.

##Additional experiments and notes

The researchers conducted some additional experiments on the above models, showing that they generally perform better at directly translating between similar languages ​​without using English as a pivot (5.1), and that they can be used between different scripts Zero-sample transliteration of (5.2).

They describe a practical technique for appending terminal punctuation to any input, called the period trick, which can be used to improve translation quality (5.3) .

Additionally, we demonstrate that these models are robust to the use of non-standard Unicode glyphs in some but not all languages ​​(5.4), and explore several non-Unicode fonts. (5.5).

The chapter list is as follows:

Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.

For more research details, please refer to the original paper.

The above is the detailed content of Google has created a machine translation system for 1,000+ 'long tail' languages ​​and already supports some niche languages.. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Sesame Open Door Exchange Web Page Login Latest version gateio official website entrance Sesame Open Door Exchange Web Page Login Latest version gateio official website entrance Mar 04, 2025 pm 11:48 PM

A detailed introduction to the login operation of the Sesame Open Exchange web version, including login steps and password recovery process. It also provides solutions to common problems such as login failure, unable to open the page, and unable to receive verification codes to help you log in to the platform smoothly.

Sesame Open Door Exchange Web Page Registration Link Gate Trading App Registration Website Latest Sesame Open Door Exchange Web Page Registration Link Gate Trading App Registration Website Latest Feb 28, 2025 am 11:06 AM

This article introduces the registration process of the Sesame Open Exchange (Gate.io) web version and the Gate trading app in detail. Whether it is web registration or app registration, you need to visit the official website or app store to download the genuine app, then fill in the user name, password, email, mobile phone number and other information, and complete email or mobile phone verification.

Top 10 recommended for crypto digital asset trading APP (2025 global ranking) Top 10 recommended for crypto digital asset trading APP (2025 global ranking) Mar 18, 2025 pm 12:15 PM

This article recommends the top ten cryptocurrency trading platforms worth paying attention to, including Binance, OKX, Gate.io, BitFlyer, KuCoin, Bybit, Coinbase Pro, Kraken, BYDFi and XBIT decentralized exchanges. These platforms have their own advantages in terms of transaction currency quantity, transaction type, security, compliance, and special features. For example, Binance is known for its largest transaction volume and abundant functions in the world, while BitFlyer attracts Asian users with its Japanese Financial Hall license and high security. Choosing a suitable platform requires comprehensive consideration based on your own trading experience, risk tolerance and investment preferences. Hope this article helps you find the best suit for yourself

Sesame Open Door Trading Platform Download Mobile Version Gateio Trading Platform Download Address Sesame Open Door Trading Platform Download Mobile Version Gateio Trading Platform Download Address Feb 28, 2025 am 10:51 AM

It is crucial to choose a formal channel to download the app and ensure the safety of your account.

Tutorial on how to register, use and cancel Ouyi okex account Tutorial on how to register, use and cancel Ouyi okex account Mar 31, 2025 pm 04:21 PM

This article introduces in detail the registration, use and cancellation procedures of Ouyi OKEx account. To register, you need to download the APP, enter your mobile phone number or email address to register, and complete real-name authentication. The usage covers the operation steps such as login, recharge and withdrawal, transaction and security settings. To cancel an account, you need to contact Ouyi OKEx customer service, provide necessary information and wait for processing, and finally obtain the account cancellation confirmation. Through this article, users can easily master the complete life cycle management of Ouyi OKEx account and conduct digital asset transactions safely and conveniently.

How to register and download the latest app on Bitget official website How to register and download the latest app on Bitget official website Mar 05, 2025 am 07:54 AM

This guide provides detailed download and installation steps for the official Bitget Exchange app, suitable for Android and iOS systems. The guide integrates information from multiple authoritative sources, including the official website, the App Store, and Google Play, and emphasizes considerations during download and account management. Users can download the app from official channels, including app store, official website APK download and official website jump, and complete registration, identity verification and security settings. In addition, the guide covers frequently asked questions and considerations, such as

Why is Bittensor said to be the 'bitcoin' in the AI ​​track? Why is Bittensor said to be the 'bitcoin' in the AI ​​track? Mar 04, 2025 pm 04:06 PM

Original title: Bittensor=AIBitcoin? Original author: S4mmyEth, Decentralized AI Research Original translation: zhouzhou, BlockBeats Editor's note: This article discusses Bittensor, a decentralized AI platform, hoping to break the monopoly of centralized AI companies through blockchain technology and promote an open and collaborative AI ecosystem. Bittensor adopts a subnet model that allows the emergence of different AI solutions and inspires innovation through TAO tokens. Although the AI ​​market is mature, Bittensor faces competitive risks and may be subject to other open source

Detailed tutorial on how to register for binance (2025 beginner's guide) Detailed tutorial on how to register for binance (2025 beginner's guide) Mar 18, 2025 pm 01:57 PM

This article provides a complete guide to Binance registration and security settings, covering pre-registration preparations (including equipment, email, mobile phone number and identity document preparation), and introduces two registration methods on the official website and APP, as well as different levels of identity verification (KYC) processes. In addition, the article also focuses on key security steps such as setting up a fund password, enabling two-factor verification (2FA, including Google Authenticator and SMS Verification), and setting up anti-phishing codes, helping users to register and use the Binance Binance platform for cryptocurrency transactions safely and conveniently. Please be sure to understand relevant laws and regulations and market risks before trading and invest with caution.

See all articles