Some additional information in one line

DiDi Chuxing Technology Co., Ltd., commonly known as DiDi, is a Chinese multinational ride-sharing company headquartered in Beijing. It is one of the world's largest ride-hailing companies, serving 550 million users across over 400 cities.

 

Architectural Challenges for Real-Time Risk Features

Financial risk control features help institutions identify and mitigate risks, such as a user’s “last credit application submission time.” These features often require large-scale real-time processing, typically handled using Lambda and Kappa architectures: Kappa for recent (≤7 days) features, Lambda for longer-term data.

 
 
Lambda Architecture

(Lambda Architecture for Real-Time Features)

 

In Lambda, raw data flows through both real-time and batch pipelines, with results merged at query time—adding complexity due to duplicated logic and storage.

Kappa Architecture

(Kappa Architecture for Real-time Features)

 

Kappa simplifies this by removing the batch layer and using a single streaming pipeline, reducing development costs. However, its limited storage and slow reprocessing make it unsuitable for historical data.

In practice, real-time and offline features are often developed separately, worsening inefficiencies and coordination overhead. To solve this, a unified, low-cost stream-batch architecture that improves development efficiency across real-time and offline scenarios is needed.

 

Evaluation

DiDi explored unified stream-batch processing via two main paths: unified computing and unified storage.

 

Unified Computing

Unified Computing

With unified computing, the plan was to have DiDi's platform use Flink to unify stream and batch logic under one SQL, enabling both periodic and real-time jobs. However, stream and batch would still run separately—streaming on Flink with a custom windowing system, and batch on Spark. Due to high migration costs, this approach couldn't fully address the company's challenges.

 

Unified Storage

This model stores all data in a unified data lake, enabling flexible access via Hive, Spark, or Flink. While useful for offline workloads, it lacks true real-time performance. Writes are micro-batched, latency is too high, and concurrency is low—making it unsuitable for real-time risk control features requiring sub-second latency and high QPS.

 

To support real-time ingestion, updates, analytics, and high concurrency with low latency, DiDi needed a high-performance OLAP engine.

 

StarRocks vs. ClickHouse

After evaluating both StarRocks and ClickHouse, DiDi chose StarRocks.

StarRocks offers vectorized execution, MPP architecture, cost-based optimization, intelligent materialized views, and real-time updatable columnar storage. It supports both batch and streaming ingestion, queries over data lake formats, and integrates well with MySQL protocols and BI tools—making it a strong fit for DiDi's real-time risk control needs.

 

The table below summarizes DiDi's feature comparison for the two engines.

Evaluation Criteria StarRocks ClickHouse
Learning Cost Supports MySQL protocol SQL syntax covers over 80% of common use cases
Join Capabilities Good (flexible through star schema model and adaptable to dimension changes) Weaker (relies on wide tables to avoid join operations)
High Concurrency QPS Tens of thousands Hundreds
Data Updates Supported (simpler, supports various models, primary key model fits our business scenarios) Supported (provides multiple business engines)
SSB Benchmark In 13 SSB benchmark queries, ClickHouse’s response time is 2.2x that of StarRocks, with worse table performance
Data Ingestion Supports second-level data ingestion and real-time updates, offering reliable service capabilities Slower ingestion and updates; more suited for static data analysis
Cluster Scalability Online elastic scaling Requires manual intervention, higher operational cost
 

Validation

To validate StarRocks for real-time features, DiDi followed a minimal-risk trial-and-error approach, verifying feasibility in process, query performance, and data ingestion. They used their largest credit table with both real-time and offline features, holding over 1 billion rows and 20GB+ in size.

 

Data Import

  • Batch import (11 minutes)

Batch Import
  • Real-time import (1s latency)

Real-Time Import

Query Efficiency

StarRocks supports MySQL protocol, so DiDi used MySQL clients for querying.

Benchmark results for different query scenarios are shown in the table below:

Scenario Time (ms)
Prefix Index Hit 20-30
Secondary Index Hit 70-90
No Index Hit 300-400
 

Concurrency Performance

DiDi used their internal platform’s StarRocks query service to benchmark without additional server-side development.

 

Configuration: Domestic test cluster (3 FEs + 5 BEs), prefix index query.

QPS 100 500 800 1100 1500
CPU Utilization 5% 20% 30% 40% 60%

At 1500 QPS, the service load became heavy, forcing them to stop the test before capturing results for secondary and non-indexed queries. Their observed peak QPS over 13 days was 1109. These results confirmed that StarRocks met their performance standards for real-time feature services.

 

Production Implementation

Architecture Design

Architecture Design DiDi

(New architecture based on StarRocks)

 

DiDi split the data layer into batch (historical data) and stream (real-time updates), both feeding into the same StarRocks table.

They enhanced their service layer to support StarRocks SQL queries, enabling feature queries via unified logic, which is transparent to upstream applications.

After the initial validation, they implemented support for querying features from StarRocks in the service layer and provisioned a dedicated StarRocks cluster consisting of 3 FE nodes and 5 BE nodes.

 

Table Design

Mirror Tables Data is classified as either log-based or business-data-based.

Features Log Data Business Data
Data Structure Unstructured, Semi-Structured Structured
Storage Method Time-based Incremental Storage Business-Based Full Storage
Update and Delete None Yes
Source Logs or Event Logs Business Database

StarRocks, as a structured database with primary key support, is ideal for business table mirroring, allowing unified stream-batch processing with minimal logic transformation. However, this approach isn’t optimal for unstructured log data.

Despite this, since 86% of features are sourced from business data, mirror tables still offer significant value.

 

Bucket Field Design

To optimize query performance, StarRocks employs techniques like partitioning, bucketing, vectorized execution, and a primary key model that updates via delete-and-insert. For risk control features, DiDi designed tailored tables to meet high-concurrency point lookup needs, where queries typically use consistent parameters such as uid or bizId.

Unlike traditional indexes that aim for quick lookups, bucketing in big data helps filter out irrelevant data early. DiDi uses high-cardinality fields like uid or bizId as hash bucket keys to avoid data skew. With N buckets, roughly (N−1)/N of data can be skipped per query, significantly boosting efficiency.

However, too many buckets increase metadata overhead and impact data I/O. Following version 2.3 guidance (100MB–1GB per bucket), DiDi conservatively set the bucket count to allow for growth. They also use only a single field as the bucket key to ensure flexibility and maintain fast point lookups.

 

Service Performance

Functional Testing DiDi configured 34 features with matching logic to validate performance.

Functional Testing
Query count
P99 latency
P95 latency
Average latency
779178
75ms
55ms
39ms
 

Load Testing Results showed excellent performance for bucketed indexes and joined tables, but weaker performance for secondary/non-indexed queries.

To ensure consistent performance, DiDi bound bucket keys to feature tables in the feature management UI, making selection mandatory.

Scenario
Bucketed index
Secondary index
No index
Index-based join
Peak QPS
2300
18
10
1300
 

Update Timeliness DiDi created a feature using SELECT NOW() - MAX(update_time) to measure StarRocks update freshness.

Query count
% queries ≤ 1s latency
% queries ≤ 2s latency
% queries ≤ 5s latency
% queries ≤ 10s latency
Maximum latency (s)
52744
36.79%
92.53%
99.96%
100%
8
 

Benefits

Efficiency Gains

Requirements
Real-time and offline readiness latency
StarRocks readiness latency
Feature data volume
Requirement 1
7 days(offline)+1 day(real-time)+1 hr (test)
1 day (development)+ 1 hr (test)
10
Requirement 2
7 days(offline)+1 day(real-time)+1 hr (test)
1 day (development)+ 1 hr (test)
10

The above table shows the actual scheduling latency for two real-time and offline requirements. Using StarRocks feature configuration can improve efficiency by over 80%.

Under the traditional development model, the offline part of real-time and offline features usually has to wait for resource scheduling, with waiting times ranging from 2 to 7 days. As a result, the average delivery cycle for requirements is about 5 days. By adopting StarRocks development, which eliminates the dependency on offline scheduling, the development cycle was significantly shortened.

 

Capability Improvements

Higher Data Accuracy

  • Unified logic eliminates discrepancies between real-time and offline features.
Stronger Real-Time Support
  • Joins, deletes, and complex functions (math, string, aggregation, conditionals) are now supported.
Lower Development Complexity
  • MySQL support reduces the learning curve.
  • Business mirror tables reduce cognitive load.
  • Ad-hoc query support enables faster iteration and error handling.

Higher Resource Efficiency

  • Resources are used only during queries, avoiding waste from unused or duplicate features.

 

Future Plans

StarRocks’ MySQL compatibility and ad-hoc query support allow business teams to independently configure and validate features, boosting efficiency and responsiveness.

Currently, DiDi only operates one StarRocks cluster, so fault tolerance is limited. They plan to add backup clusters or implement tiered feature management to improve stability.

Their current implementation mainly addresses stream-batch unification for business-table-sourced data. For log-based sources (which are often unstructured and high-volume), they still rely on the Lambda architecture, and will continue to explore better alternatives.