Table of Contents
System Architecture
Speaker Switch Detection
Extract voiceprint features
Multi-stage clustering
Real-time correction and user annotation
Future Work
Home Technology peripherals AI Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

Apr 10, 2023 pm 07:31 PM
AI Google

In 2019, Google launched the recording software Recorder under Android system for its Pixel mobile phones, which is comparable to voice memos under iOS and supports the recording, management and editing of audio files. Since then, Google has successively added a large number of machine learning-based features to Recorder, including speech recognition, audio event detection, automatic title generation, and smart browsing.

However, when the recording file is long and contains multiple speakers, some Recorder users will feel inconvenienced during use. Because the text obtained through speech recognition alone cannot determine who said each sentence. At this year’s Made By Google conference, Google announced the automatic speaker annotation feature of the Recorder app. This feature will add anonymous speaker tags (such as "Speaker 1" or "Speaker 2") to speech-recognized text in real time. This feature will greatly improve the readability and practicality of recorded texts. The technology behind this feature is called speaker diarization. Google first introduced its voiceprint segmentation and clustering system called Turn-to-Diarize at the 2022 ICASSP conference.

Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

Left picture: The recording text with speaker annotation turned off. Right: The recording text with speaker annotation turned on.

System Architecture

Google’s Turn-to-Diarize system contains multiple highly optimized models and algorithms to implement mobile devices On the Internet, real-time voiceprint segmentation and clustering processing of hours-long audio is completed with very few computing resources. The system mainly consists of three components: a speaker switching detection model to detect speaker identity switching, a voiceprint encoder model to extract the voice characteristics of each speaker, and a multi-stage system that can efficiently complete speaker annotation. Clustering Algorithm. All components run entirely on the user's device and do not rely on any server connection.

Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

Architecture diagram of the Turn-to-Diarize system.

Speaker Switch Detection

The first component of the system is a speaker switch detection model based on Transformer Transducer (T-T) . This model can convert the acoustic feature sequence into a text sequence containing the special character . The special character indicates a speaker switching event. Previous papers published by Google used special characters such as or to represent the identity of a specific speaker. In the latest system, since the character is not limited to specific identities, its application is also more widespread.

For most applications, the output of the voiceprint segmentation and clustering system is generally not presented directly to the user, but is combined with the output of the speech recognition model. Since the speech recognition model has been optimized for the word error rate during the training process, the speaker switch detection model is more tolerant to the word error rate, but pays more attention to the accuracy of the special character . On this basis, Google proposed a new character-based loss function, which enables accurate detection of speaker switching events with only a smaller model.

Extract voiceprint features

After the audio signal is segmented according to speaker conversion events, the system extracts the features of each speaker segment through the voiceprint encoder model. The embedding code of voiceprint information, that is, d-vector. In previous papers published by Google, voiceprint embedding codes were generally extracted from fixed-length audio. In contrast, this new system has many improvements. First, the new system avoids extracting voiceprint embeddings from segments that contain multiple speaker information, thus improving the overall quality of the embeddings. Secondly, the speech fragment corresponding to each voiceprint embedding code is relatively long, so it contains more voiceprint information corresponding to the speaker. Finally, the final voiceprint embedding code sequence obtained by this method is shorter in length, making the subsequent clustering algorithm less computationally expensive.

Multi-stage clustering

The last step of voiceprint segmentation and clustering is to cluster the voiceprint embedding code sequences obtained in the previous steps. Since the recordings users generate using the Recorder app can range from just a few seconds to as long as 18 hours, a key challenge for clustering algorithms is being able to handle voiceprint embedding sequences of varying lengths.

To this end, Google’s multi-stage clustering strategy cleverly combines the advantages of several different clustering algorithms. For shorter sequences, the strategy uses aggregate hierarchical clustering (AHC). For sequences of medium length, this method uses spectral clustering and utilizes the maximum margin method of eigenvalues ​​to accurately estimate the number of speakers. For longer sequences, this method first uses aggregated hierarchical clustering to preprocess the sequence, and then calls spectral clustering, thereby reducing the computational cost of the clustering step. During the entire streaming processing process, by dynamically caching and reusing the previous clustering results, the upper limit of the time complexity and space complexity of each clustering algorithm call can be set to a constant.

Multi-stage clustering strategy is a key optimization for device-side applications. Because on the device side, resources such as CPU, memory, and battery are usually scarce. This strategy can still operate in a low-power state even after processing audio for several hours. The upper limit of the constant complexity of this strategy can usually be adjusted according to the specific device model to achieve a balance between accuracy and performance.

Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

Schematic diagram of multi-stage clustering strategy.

Real-time correction and user annotation

Because Turn-to-Diarize is a real-time streaming processing system, when the model is processed, it will be updated. With more audio, the predicted speaker labels will become more accurate. To this end, the Recorder application will continuously correct the previously predicted speaker labels during the user's recording process to ensure that the speaker labels that the user sees on the current screen are always more accurate labels.

At the same time, the user interface of the Recorder application also allows users to rename the speaker tag in each recording, for example, rename "Speaker 2" to "Car Dealership" "Business", thus making it easier for users to read and remember.

Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded

Recorder allows users to rename speaker tags to improve readability.

Future Work

Google has launched its self-developed chip Google Tensor on the latest Pixel phones. The current voiceprint segmentation and clustering system mainly runs on the CPU module of Google Tensor. In the future, Google plans to run the voiceprint segmentation and clustering system on the TPU module of Google Tensor to further reduce energy consumption. In addition, Google also hopes to expand this feature to other languages ​​​​in addition to English with the help of multi-lingual voiceprint encoders and speech recognition models.

The above is the detailed content of Google Recorder implements automatic speaker annotation, and its functionality and iOS voice memos are once again expanded. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1660
14
PHP Tutorial
1259
29
C# Tutorial
1233
24
Sesame Open Door Exchange Web Page Login Latest version gateio official website entrance Sesame Open Door Exchange Web Page Login Latest version gateio official website entrance Mar 04, 2025 pm 11:48 PM

A detailed introduction to the login operation of the Sesame Open Exchange web version, including login steps and password recovery process. It also provides solutions to common problems such as login failure, unable to open the page, and unable to receive verification codes to help you log in to the platform smoothly.

Top 10 recommended for crypto digital asset trading APP (2025 global ranking) Top 10 recommended for crypto digital asset trading APP (2025 global ranking) Mar 18, 2025 pm 12:15 PM

This article recommends the top ten cryptocurrency trading platforms worth paying attention to, including Binance, OKX, Gate.io, BitFlyer, KuCoin, Bybit, Coinbase Pro, Kraken, BYDFi and XBIT decentralized exchanges. These platforms have their own advantages in terms of transaction currency quantity, transaction type, security, compliance, and special features. For example, Binance is known for its largest transaction volume and abundant functions in the world, while BitFlyer attracts Asian users with its Japanese Financial Hall license and high security. Choosing a suitable platform requires comprehensive consideration based on your own trading experience, risk tolerance and investment preferences. Hope this article helps you find the best suit for yourself

Tutorial on how to register, use and cancel Ouyi okex account Tutorial on how to register, use and cancel Ouyi okex account Mar 31, 2025 pm 04:21 PM

This article introduces in detail the registration, use and cancellation procedures of Ouyi OKEx account. To register, you need to download the APP, enter your mobile phone number or email address to register, and complete real-name authentication. The usage covers the operation steps such as login, recharge and withdrawal, transaction and security settings. To cancel an account, you need to contact Ouyi OKEx customer service, provide necessary information and wait for processing, and finally obtain the account cancellation confirmation. Through this article, users can easily master the complete life cycle management of Ouyi OKEx account and conduct digital asset transactions safely and conveniently.

How to register and download the latest app on Bitget official website How to register and download the latest app on Bitget official website Mar 05, 2025 am 07:54 AM

This guide provides detailed download and installation steps for the official Bitget Exchange app, suitable for Android and iOS systems. The guide integrates information from multiple authoritative sources, including the official website, the App Store, and Google Play, and emphasizes considerations during download and account management. Users can download the app from official channels, including app store, official website APK download and official website jump, and complete registration, identity verification and security settings. In addition, the guide covers frequently asked questions and considerations, such as

Why is Bittensor said to be the 'bitcoin' in the AI ​​track? Why is Bittensor said to be the 'bitcoin' in the AI ​​track? Mar 04, 2025 pm 04:06 PM

Original title: Bittensor=AIBitcoin? Original author: S4mmyEth, Decentralized AI Research Original translation: zhouzhou, BlockBeats Editor's note: This article discusses Bittensor, a decentralized AI platform, hoping to break the monopoly of centralized AI companies through blockchain technology and promote an open and collaborative AI ecosystem. Bittensor adopts a subnet model that allows the emergence of different AI solutions and inspires innovation through TAO tokens. Although the AI ​​market is mature, Bittensor faces competitive risks and may be subject to other open source

Detailed tutorial on how to register for binance (2025 beginner's guide) Detailed tutorial on how to register for binance (2025 beginner's guide) Mar 18, 2025 pm 01:57 PM

This article provides a complete guide to Binance registration and security settings, covering pre-registration preparations (including equipment, email, mobile phone number and identity document preparation), and introduces two registration methods on the official website and APP, as well as different levels of identity verification (KYC) processes. In addition, the article also focuses on key security steps such as setting up a fund password, enabling two-factor verification (2FA, including Google Authenticator and SMS Verification), and setting up anti-phishing codes, helping users to register and use the Binance Binance platform for cryptocurrency transactions safely and conveniently. Please be sure to understand relevant laws and regulations and market risks before trading and invest with caution.

How to optimize jieba word segmentation to improve the keyword extraction effect of scenic spot comments? How to optimize jieba word segmentation to improve the keyword extraction effect of scenic spot comments? Apr 01, 2025 pm 06:24 PM

How to optimize jieba word segmentation to improve keyword extraction of scenic spot comments? When using jieba word segmentation to process scenic spot comment data, if the word segmentation results are ignored...

Ouyi okx official version download APP entrance Ouyi okx official version download APP entrance Mar 04, 2025 pm 11:24 PM

This article provides the latest download information about the official version of Ouyi OKX. This article will guide readers on how to securely and conveniently access the exchange's Android and iOS apps. This article contains step-by-step instructions and important tips designed to help readers easily download and install the Ouyi OKX app.

See all articles