Haezoom and CloudShift Overcome Apache Druid’s Limits with StarRocks
About the Author (Kim Byung Ju)
The author currently works at CloudShift with a strong passion for open-source and cloud-native data technologies. Having supported numerous clients with data consulting and implementation projects across various industries, the author also actively operates the StarRocks community.
Haezoom's data engineering team leads Korea's renewable energy shift through Virtual Power Plant operations, specializing in solar energy. Haezoom support over 6,000 power stations and 2.3 million users, managing a vast ecosystem of solar data for accurate generation forecasting and energy market optimization.
As Haezoom’s workloads grew, Apache Druid became costly and inefficient, limiting query flexibility and scalability. Partnering with CloudShift, Haezoom migrated to StarRocks—achieving 1.74x higher throughput, 44% faster queries, 4x gains for complex analytics, and 30% lower infrastructure costs.
About CloudShift
CloudShift was founded in 2020 and has been providing expert consulting services in Korea across various industries for the past five years. With deep expertise in modern data platform technologies, CloudShift supports comprehensive cloud and on-premises implementations.
Through their strategic partnership with CelerData, they’re delivering cutting-edge data solutions to their clients.
Original Architecture With Apache Druid
To handle large-scale IoT data and deliver real-time analytics, Haezoom had been leveraging Apache Druid as its core time-series database. While Druid enabled fast queries and real-time bidding, Haezoom’s platform's rapid growth revealed architectural limitations that required a transformation.
Resource Inefficiency
While consuming various data types through Haezoom’s Kafka (Confluent) + Druid stack, Haezoom discovered that not all data fit well into streaming pipelines. Some data sources delivered updates only hourly, yet Druid's peon processes consumed computing resources continuously throughout the day, regardless of actual data ingestion frequency. This mismatch between data arrival patterns and resource allocation led to significant waste.
High Infrastructure Costs
Running Apache Druid on AWS EKS with high availability requirements—including standby master pods, Zookeeper ensemble, and deep storage—resulted in considerable infrastructure overhead. The complexity of maintaining these components for reliability came at a steep financial cost that increasingly strained Haezoom’s operational budget.
Query Performance Bottlenecks
Certain analytical queries placed excessive load on Haezoom’s Druid cluster, requiring far more resources than typical operations. This forced them to scale up the entire cluster size to handle peak loads, even though most queries consumed minimal resources. This all-or-nothing scaling approach proved both costly and inefficient.
Limited Data Source Integration
Building a unified data platform became increasingly challenging as Haezoom needed to integrate multiple data sources such as S3, PostgreSQL, and Kafka. Druid's architecture constraints made it difficult to create a seamless data ecosystem that could efficiently handle diverse data sources and formats.
Complex Query Requirements
As Haezoom’s AI team's requirements grew more sophisticated, the limitations of minimizing JOIN and WITH clauses in Druid queries became apparent. Fully preprocessing all data before ingestion to avoid these operations incurred significant computational costs and development overhead, making it an unsustainable approach for their evolving analytical needs.
Why Haezoom Chose StarRocks Over Druid: A Feature-Based Evaluation
These limitations across cost, scalability, and query flexibility prompted Haezoom to seek an alternative—and that search ultimately led them to StarRocks. As Haezoom explored StarRocks, they found that its architecture directly addressed these pain points—offering higher efficiency, lower costs, and more flexible query capabilities out of the box.
Apache Druid Feature Requirements |
Feasibility for Migration |
Expected Benefits with StarRocks |
Implementation Strategy |
Real-time Ingestion (Kafka) |
Low feasibility if only performing real-time ingestion |
Increased efficiency for cost-effective real-time ingestion |
Resource-efficient periodic ingestion using Routine Load |
Real-time Ingestion (Kinesis) |
Low (StarRocks doesn't support native ingestion) |
Requires transition to Flink or StarRocks Pipe |
|
Batch Ingestion |
High: Supports various batch ingestion methods |
Various ingestion strategies available with INSERT statement support |
|
Real-time Query (Denormalized) |
Low feasibility for high real-time requirements, but StarRocks supports all necessary functions |
SQL API Additional complex ANSI SQL support |
|
Join Queries |
High: Druid has MSQ as a new feature, but it's inefficient due to MiddleManager usage |
Stable execution of join queries |
|
Long-term Report Queries |
High: Druid has MSQ as a new feature, but it's inefficient due to MiddleManager usage |
Stable execution of join queries |
|
Data Lake Export |
High: Druid only supports CSV export via MSQ feature |
Support for various export formats |
Supports various features, including External Catalog |
Data Lake Federation |
High: Druid has released an early version in this area, but has architectural inefficiencies |
Support for various federation environments |
Supports various features including External Catalog |
BI Connection |
High: Druid supports Avatica JDBC Driver |
StarRocks can use mysql connector driver |
|
(Imply) Pivot: Real-time BI |
Requires in-depth analysis: If business importance is high, alternatives should be considered |
Metabase can be used, but requires paid license for certain features |
Haezoom partnered with CloudShift to migrate its data infrastructure to StarRocks, leveraging CloudShift's expertise in modern data platforms to ensure a seamless transition.
Migrating From Apache Druid to StarRocks
For the data migration, much of Haezoom’s data resides in Confluent (Kafka) and S3, so Haezoom decided to migrate directly from those sources. If they need to migrate data from Druid to StarRocks, they’re considering exporting it via MSQ to S3, and then ingesting it into StarRocks from there.
-
Deploy StarRocks on EKS
-
Determine optimal cluster size and recommended parameters based on current workload patterns and expected growth
-
Configure appropriate node types, storage classes, and resource allocation for FE (Frontend) and BE (Backend) components
-
Provide current status, including Druid cluster specifications, data volume metrics, and query patterns for accurate resource provisioning.
-
-
Ingest Data from Kafka via Routine Jobs and Stream Load
-
Implement efficient data ingestion pipelines using StarRocks' native Kafka connector to minimize resource usage and latency
-
Configure appropriate parallelism, batch sizes, and commit intervals for optimal throughput
-
Share comprehensive topic and table name mappings, including partition strategies and data retention policies
-
-
Test and Transition SQL API Compatibility
-
Validate existing SQL syntax compatibility by running current Druid SQL queries against StarRocks
-
Conduct comprehensive performance benchmarking, comparing query execution times, resource utilization, and concurrency handling
-
Document any SQL syntax modifications required and create migration scripts for seamless transition
-
-
Connect BI Tools
-
Integrate with Metabase using StarRocks' MySQL protocol compatibility for dashboard migration
-
Enable Python client connectivity through SQLAlchemy and PyMySQL for Haezoom’s data science workflows
-
Test and validate all existing visualizations and ensure consistent data accuracy
-
-
Parallel Operation and Druid Decommissioning
-
Run both systems in parallel for up to two weeks to ensure data consistency and system stability
-
Implement data validation checks and monitoring to compare outputs between systems
-
Execute phased cutover plan before safely shutting down Druid cluster and reclaiming resources
-
Results from the Migration (vs Apache Druid)
Resource efficiency
By leveraging StarRocks' Routine Load feature, Haezoom efficiently resolved this issue by consuming data on-demand based on actual arrival patterns, eliminating the continuous resource consumption that plagued Druid peon processes regardless of data ingestion frequency.
1.74x Higher Throughput
Haezoom’s new architecture now handles 1.74 times more requests than the previous Druid setup, enabling Haezoom to support its rapidly growing user base of 2.3 million users.
44.3% Faster Average Response Time
Query performance improved dramatically with average response times decreasing by 44.3%, providing near-instantaneous insights for Haezoom’s solar energy forecasting models.
4x Performance Gain for Complex Queries
For specific complex analytical queries that previously bottlenecked Haezoom’s system, they achieved up to 4x performance improvements, enabling their AI team to run sophisticated models without preprocessing overhead.
30% Infrastructure Cost Reduction
Through optimized resource utilization and elimination of unnecessary standby components, Haezoom reduced its overall data infrastructure costs by 30%.
Future Plans
-
Performance Optimization: Fine-tuning query execution plans and resource allocation strategies to maximize throughput and minimize latency for Haezoom’s time-series solar energy data.
-
Data Lake Integration: Building a comprehensive Data Lake architecture using Apache Iceberg and implementing External Catalogs to seamlessly integrate diverse data sources, including S3, PostgreSQL, and streaming platforms.
-
Self-Service Analytics: Establishing a self-service analytics environment that empowers Haezoom’s data analysts and scientists to independently explore, query, and derive insights without engineering bottlenecks.
Curious to learn more about how StarRocks handles complex JOINs and other analytics challenges? Join the StarRocks Slack community to connect with us and explore further!