How WeChat’s Lakehouse Design Efficiently Handles Trillions of Records

Publish date: Jan 29, 2024 4:02:36 PM

About WeChat

WeChat is the world's largest standalone mobile app, serving over 1.3 billion monthly active users as a platform for instant messaging, social media, and mobile payments. To support its unprecedented and rapidly expanding user base, WeChat's technological backend has had to evolve quickly as well, transitioning from a Hadoop + data warehouse architecture to a modern open data lakehouse architecture.

Challenges

With over a billion users, it comes as no surprise that WeChat manages extremely large data volumes. In some cases, single tables are growing by trillions of records daily and queries regularly scan over 1 billion records.

WeChat's business scenarios demand rapid end-to-end response times, with a query latency P90 target of under 5 seconds, and data freshness requirements that vary from seconds to minutes. This complexity is elevated by the need to process often more than 50 dimensions and 100 metrics at a time.

WeChat's legacy data architecture involved a Hadoop-based data lake system along with a variety of data warehouses. This resulted in significant operational overhead and data governance challenges including:

Juggling multiple systems from separated real-time and batch analytics pipelines
Maintaining data ingestion pipelines for data warehouses
Governance challenges from managing multiple copies of the same data
Managing incompatible APIs of different systems
Challenges in standardizing data analysis processes

To address these issues, WeChat pursued a unified approach to their analytics which necessitated a redesign of their data architecture.

Solution

WeChat's revamped Data Lakehouse architecture now features StarRocks as the low latency query engine with Apache Spark as its batch processing engine. Data is stored as Parquet files on Tencent Cloud's COS (cloud object storage) with Apache Iceberg as the data lake table format.

WeChat Architecture

WeChat's new StarRocks-based architecture

This new architecture supports both real-time and near-real-time data ingestion:

For real-time ingestion: data is first ingested into a warehouse in real-time, then cold data is sunk into the data lake and queries can automatically union cold and hot data.
For near-real-time ingestion: raw data is directly ingested into Apache Iceberg, and then cleaned and transformed using StarRocks' materialized view.

Result

WeChat's StarRocks-based data lakehouse solution is now in production across multiple business scenarios within the company including livestreaming, WeChat Keyboard, WeChat Reading, and Public Accounts.

By unifying all workloads with one system, WeChat has experienced significant operational benefits from this improved efficiency. Their live streaming business is one example: the new lakehouse architecture halved the number of tasks data engineers are required to manage, reduced storage costs by over 65%, and shortened the development cycle of offline tasks by two hours.

Not only is their architecture now simplified, data freshness and query latency improved as well, with batch ingestion being eliminated and their near-real-time data pipeline and query latencies being brought down to the mostly sub-second level.

	Near-Real-Time Pipeline	Real-Time Ingestion Pipeline
Data Freshness	5 mins - 10 mins	Seconds - 2 mins
Query Latency: Detailed Data	Seconds	Sub-second
Query Latency: Pre-Processed Data	Sub-second	Sub-second

Table 1: WeChat's data lakehouse performance numbers

What's Next For WeChat

Looking ahead, WeChat's goal is to continually explore and refine their existing data lakehouse architecture to further integrate it across more critical operations including:

Enabling users to interact with the system using SQL without needing to understand the underlying architecture.
Unifying data access, querying, and the storage system.
Unifying SQL interaction standards across the platform.

Download a PDF of This Use Case