

Breaking down data silos using a unified data warehouse: CDP based on Apache Doris
As enterprise data sources become increasingly diverse, the problem of data silos has become common. When insurance companies build customer data platforms (CDPs), they face the problem of component-intensive computing layers and scattered data storage caused by data silos. In order to solve these problems, they adopted CDP 2.0 based on Apache Doris, using Doris' unified data warehouse capabilities to break data silos, simplify data processing pipelines, and improve data processing efficiency.
The data silo problem is like arthritis for online businesses because almost everyone encounters it as they age. Businesses interact with customers through websites, mobile apps, HTML5 pages, and end devices. For some reason, integrating data from all these sources is tricky. The data remains in place and cannot be correlated with each other for further analysis. This is how data silos form. The larger your business becomes, the more diverse sources of customer data you have, and the more likely you are to become trapped in data silos.
That’s exactly what happened with the insurance company I’m going to discuss in this article. By 2023, they have served more than 500 million customers and signed 57 billion insurance contracts. When they began building their Customer Data Platform (CDP) to accommodate such massive data scale, they used multiple components.
Data silos in CDP
Like most data platforms, their CDP 1.0 has both batch pipelines and real-time streaming pipelines. Offline data is loaded into Impala via a Spark job, where it is labeled and divided into groups. At the same time, Spark also sends it to NebulaGraph for OneID calculation (more on this later in this article). On the other hand, real-time data is tagged by Flink and then stored in HBase for query.
This results in a component-intensive computing layer in CDP: Impala, Spark, NebulaGraph and HBase.
As a result, offline labels, live labels and graph data are scattered across multiple components. Integrating them to provide further data services is costly due to redundant storage and large data transfers. More importantly, due to storage differences, they had to expand the scale of the CDH cluster and NebulaGraph cluster, increasing resource and maintenance costs.
CDP based on Apache Doris
For CDP 2.0, they decided to introduce a unified solution to clean up the mess. In the computing layer of CDP 2.0, Apache Doris is responsible for real-time and offline data storage and calculation.
In order to ingest offline data, they utilize the stream loading method. Their 30-thread ingest test showed that it can perform over 300,000 update inserts per second. To load real-time data, they used a combination of Flink-Doris-Connector and Stream Load. Additionally, in real-time reporting that requires pulling data from multiple external data sources, they leverage multi-catalog capabilities for federated queries.
The customer analysis workflow on this CDP is as follows. First, they organize customer information and then label each customer. They group customers according to tags for more targeted analysis and actions.
Next, I'll dig into these workloads and show you how Apache Doris accelerates them.
One ID
Have you ever encountered this situation when your products and services have different user registration systems? You could collect User ID A's email from one product page, and then collect User ID B's Social Security number from another product page. You will then discover that UserID A and UserID B actually belong to the same person because they use the same phone number.
This is why OneID emerged as an idea. It is to collect the user registration information of all business lines into a large table in Apache Doris, organize it, and ensure that each user has a unique OneID.
This is how they leverage functionality in Apache Doris to determine which registrations belong to the same user.
Tag Service
This CDP accommodates 500 million customer information, which comes from more than 500 source tables, with a total of more than 2,000 tags attached.
According to timeliness, tags can be divided into real-time tags and offline tags. Real-time tags are computed by Apache Flink and written to flat tables in Apache Doris, while offline tags are computed by Apache Doris as they originate from user attribute tables, business tables, and user behavior tables in Doris. The following are the company’s best practices in data labeling:
1. Offline tags
During the peak period of data writing, due to the large scale of data, full updates are very difficult. It is easy to cause OOM errors. To avoid this, they leveraged Apache Doris' INSERT INTO SELECT functionality and enabled partial column updates. This will significantly reduce memory consumption and maintain system stability during data loading.
set enable_unique_key_partial_update=true; insert into tb_label_result(one_id, labelxx) select one_id, label_value as labelxx from .....
2. Live tags
Partial column updates can also be used for live tags, because even live tags update at different speeds. All that is required is to set partial_columns to true.
curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:id,balance,last_access_time" -T /tmp/test.csv http ://127.0.0.1:48037/api/db1/user_profile/_stream_load
3. High concurrency point query
With the current business scale, the company is using Receiving tag query requests at a concurrency level of over 5000 QPS. They use a combination of strategies to ensure high performance. First, they use Prepared Statement to precompile and preexecute SQL. Second, they fine-tune the parameters of the Doris backend and tables to optimize storage and execution. Finally, they enable row caching as a complement to column-oriented Apache Doris.
Fine-tune Doris’ backend parameters be.conf:
disable_storage_row_cache = false storage_page_cache_limit=40%
Fine-tuning table parameters when creating a table:
enable_unique_key_merge_on_write = true store_row_column = true light_schema_change = true
4. Tag calculation (Join)
In practice, many tag services are implemented through multi-table connections in the database. This typically involves more than 10 tables. In order to obtain the best computing performance, they adopted a co-located group policy in Doris.
Customer Grouping
The customer grouping pipeline in CDP 2.0 is like this: Apache Doris receives SQL from the customer service, performs calculations, and sends the result set through SELECT INTO OUTFILE Send to S3 object storage. The company has divided its customers into 1 million groups. A customer grouping task that used to take 50 seconds in Impala now takes only 10 seconds in Doris.
In addition to grouping customers for more fine-grained analysis, sometimes they also perform reverse analysis. That is, for a certain customer, find out which groups he/she belongs to. This helps analysts understand the characteristics of customers and how different customer groups overlap.
In Apache Doris, this is achieved through the BITMAP function: BITMAP_CONTAINS is a quick way to check whether a customer belongs to a certain group, BITMAP_OR, BITMAP_INTERSECT and BITMAP_XOR are the choices for cross analysis.
Conclusion
From CDP 1.0 to CDP 2.0, insurance companies use the unified data warehouse Apache Doris to replace Spark Impala HBase NebulaGraph. Improved data processing efficiency by breaking down data silos and simplifying data processing pipelines. In CDP 3.0, they hope to group customers by combining real-time tags and offline tags for more diverse and flexible analysis. The Apache Doris community and VeloDB team will continue to be support partners during this upgrade.
The above is the detailed content of Breaking down data silos using a unified data warehouse: CDP based on Apache Doris. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

There are various reasons for being unable to register for the BitgetWallet exchange, including account restrictions, unsupported regions, network issues, system maintenance and technical failures. To register for the BitgetWallet exchange, please visit the official website, fill in the information, agree to the terms, complete registration and verify your identity.

In the digital age, social media has become an integral part of people's lives. Douyin, as one of the most popular short video platforms in China, has attracted a large number of users. Some users even registered two accounts. So, why does Douyin have two accounts? This article will answer this question for you and explain how to install two Douyin accounts on your phone. 1. Why does Douyin have two accounts? Functional differentiation: Some users will differentiate accounts based on content type or function. For example, one account is used to share daily life, and another account is used to demonstrate professional skills. 2. Privacy protection: Some users hope to protect their privacy through two accounts, separate life and work, and avoid information leakage. 3. Interaction needs: Some users may register two due to interaction needs

In order to enhance user interaction and improve user experience, the Douyin platform has launched Spark, an interesting interactive mechanism. Users can activate and upgrade their sparks through a series of actions on Douyin. Different colors represent different achievements and honors. Understanding the color changing rules of Douyin Spark can help users better participate and interact, and enjoy the social fun brought by Douyin. 1. What is the detailed explanation of Douyin’s spark color changing rules? 1. Behavior activates users’ interactive behaviors, such as likes, comments, shares, etc., which can activate sparks. 2. Level improvement As user interaction increases, the sparks will gradually upgrade and the color will change accordingly. 3. Color change The color change of sparks is usually related to the user's interaction frequency, interaction quality, and enthusiasm for participating in activities. 4. The task is completed

DeepSeek's official website is now launching multiple discount activities to provide users with a shopping experience. New users sign up to get a $10 coupon, and enjoy a 15% limited time discount for the entire audience. Recommend friends can also earn rewards, and you can accumulate points for redemption of gifts when shopping. The event deadlines are different. For details, please visit the DeepSeek official website for inquiries.

You can earn coins and points by completing tasks on Tomato Novels. Methods include: completing new user registration tasks. Check in daily. Read the assigned novel chapter. Leave a comment on the specified novel chapter. Invite friends to register. Share novels on social platforms.

Mainland users can register on the XT.COM exchange through the following steps: Visit the XT.COM official website. Click the "Register" button in the upper right corner. Select the "Mobile Registration" option. Enter your mainland mobile phone number, obtain and enter the verification code. Set a password. Complete authentication. Registration completed.

On the Douyin platform, many users may want to register multiple accounts to meet different needs. So, how to register multiple accounts on Douyin? How to manage these accounts after registration? This article will explore these two issues to help users better understand and use the Douyin platform. 1. How to register multiple accounts on Douyin? Douyin account registration: First, users need to register a Douyin account through their mobile phone number or email address. During the registration process, you need to fill in personal information, such as name, gender, age, etc. Register multiple accounts: After registering the first account, users can register a new account again through their mobile phone number or email. The registration information for each account needs to be kept independent, such as name, gender, age, etc. 3. Notes: When registering multiple accounts, users need to pay attention to the following points: a.

Gate.io Sesame Open is the world's leading blockchain digital asset trading platform, including fiat currency trading, currency trading, leveraged trading, perpetual contracts, ETF leveraged tokens, wealth management, Startup initial public offering and other sections, providing users with security, stability, openness and transparency.