Some additional information in one line

 

✍🏼 About The Author:

Kevin Chen, Product Manager at CelerData 

 

StarRocks has consistently provided sub-second analytics for high-concurrency, multi-tenant workloads. However, as systems scale to hundreds of tenants with shifting data distributions and unpredictable growth, one problem repeatedly surfaces: the table layout designed at creation time no longer matches the workload it serves.

 

StarRocks 4.1 addresses this directly. Instead of requiring users to predict tenant growth and data skew in advance, the system now observes distributional changes and adjusts its physical layout at runtime. This is the most significant operability improvement in the release, and it changes who owns layout correctness: the system, not the operator.

 

Note: The adaptive layout capabilities described in this article apply to shared-data (cloud-native) clusters in StarRocks 4.1.

 

Multi-Tenant Systems Fail Gradually, Not All at Once

Most customer-facing analytics platforms do not fail abruptly; instead, their performance gradually declines.

 

Initially, queries are fast, data volumes are manageable, and operational complexity is limited. Early issues rarely present as compute limitations. Instead, they manifest as increasing unpredictability: a few tenants begin to dominate resource usage, query latency becomes less predictable, and operators must intervene more frequently to maintain stability.

 

It is common to attribute these issues to scale, such as increased data, tenants, or concurrency. However, compute resources are often not the first to reach their limit; physical data organization typically becomes the primary constraint.

 

Traditional analytical databases require key decisions at table creation, including partitioning, data distribution, bucket count, and tablet size. These choices are often made when data volumes are small, and tenant behavior is uncertain. Once in production, correcting these decisions is costly.

 

A Real-World Lesson: When One Tenant Outgrows the Table

This failure mode becomes most visible in real production environments.

 

A multi-tenant analytical workload we observed served dashboards through a single shared table. The system supported roughly 400 tenants, with all data stored in a shared primary-key table. Queries were almost always tenant-scoped, highly concurrent, and sensitive to tail latency.

 

At the beginning, tenant sizes were relatively balanced. The largest tenant accounted for less than 5% of the total data volume, and the table behaved predictably.

 

Over time, that balance changed. Within nine months, one tenant grew rapidly, eventually accounting for a disproportionate share of the table's data and query load.

 

undefined-May-06-2026-08-25-10-6183-AM

What initially appeared as intermittent latency spikes gradually became a persistent operational issue. Query performance degraded unevenly, compaction pressure concentrated on a small number of tablets, and tail latency became unpredictable.

 

The team tried to mitigate the problem by adjusting bucket counts and repartitioning the table. But each change required careful planning, migration windows, and weeks of validation. Performance would stabilize temporarily, only to degrade again when the next large tenant emerged.

 

The underlying issue was not query execution. It was that the table's physical layout assumed a stable tenant distribution that no longer existed.

 

This pattern — static layout assumptions quietly degrading under shifting tenant distributions — is not specific to one deployment. It is the normal operating reality of multi-tenant analytics at scale.

 

StarRocks 4.1: Correcting Assumptions at Runtime

StarRocks 4.1 introduces an adaptive physical data model for multi-tenant analytics. Rather than treating table layout as a static result of creation-time decisions, the system observes distribution changes, absorbs growth, and dynamically adjusts the layout as needed.

 

From Static Distribution Rules to Adaptive Semantics

One of the most significant changes in 4.1 is how table distribution is interpreted.

 

In previous versions, DISTRIBUTED BY effectively defined a permanent partitioning contract. The system assumed that the chosen key and bucket configuration would remain a reasonable proxy for data distribution over the table’s lifetime. In multi-tenant workloads, that assumption frequently breaks as tenant sizes diverge.

 

Before 4.1 (Hash distribution), this statement locked the table into a fixed physical layout:

CREATE TABLE events (
 tenant_id BIGINT,
 event_time DATETIME,
 -- remaining columns omitted
)
PRIMARY KEY (tenant_id, event_time)
DISTRIBUTED BY Hash (tenant_id)
BUCKETS 64; -- fixed bucket count commits to a permanent physical shape

 

In StarRocks 4.1 (Range distribution), the syntax looks similar, but the semantics are different:

CREATE TABLE events (
 tenant_id BIGINT,
 event_time DATETIME,
 -- remaining columns omitted
)
PRIMARY KEY (tenant_id, event_time)

In StarRocks 4.1, when Range-based distribution is enabled with enable_range_distribution = true, the explicit DISTRIBUTED BY clause is no longer required. Instead, the system uses the primary key, or an explicit ORDER BY, as the initial organization hint, then allows the physical layout to evolve as tenant behavior and data distribution change.

 

This behavior is opt-in in 4.1.0 and disabled by default to preserve backward compatibility.

 

The most important change in 4.1 is not the simpler syntax. It is the underlying shift in responsibility: table creation no longer permanently freezes the physical layout of data.

 

Large Tablets: The Buffer That Makes Adaptive Layout Possible

An adaptive model only works if the system can tolerate growth before it needs to react.

 

This is where large tablet support becomes a foundational capability, not just a performance optimization. In traditional designs, tablets are deliberately kept small, forcing users to over-partition and over-bucket early to anticipate future skew. In multi-tenant environments, that approach introduces metadata overhead, compaction pressure, and scheduling complexity long before skew actually appears.

 

StarRocks 4.1 takes a different approach. Tablets are allowed to grow significantly before the system takes corrective action. Large tablets act as a buffer against uncertainty: they absorb short-term imbalance, reduce premature fragmentation, and give the system time to observe real workload behavior.

 

The Lifecycle: Grow, Split, Merge

Grow: observe before intervening.

As data is ingested and tenant behavior evolves, tablets naturally expand. As long as ingestion, compaction, and query execution remain stable, no structural change is forced.

 

This phase is critical. It allows the runtime to distinguish temporary imbalance from persistent skew. Many real workloads exhibit short-lived spikes that do not justify long-term structural change.

 

In our internal validation, this phase is observable in the lifecycle of a single tablet. A newly created Range-distributed table starts with one tablet. After a 64.8 million-row INSERT, that single tablet grew to 1.5 GB — and the system, by design, did not intervene immediately. It observed first, allowing the workload to declare its actual distribution before any layout decision was committed.

 

Split: correct the imbalance precisely.

When a tablet exceeds the configured target size (tablet_reshard_target_size, default 10 GB), the split mechanism activates automatically. Instead of reshaping the entire table, StarRocks performs a localized split: only the affected Range is divided into smaller contiguous ranges, and those new tablets are scheduled to distribute load more evenly.

 

The intervention is fast and bounded. In the same validation, when the 1.5 GB tablet crossed the configured threshold, the FE scheduler enqueued a SPLIT_TABLET job and reported it as FINISHED within the same second the threshold was crossed. The single hot tablet was divided into three balanced tablets—228 MB / 656 MB / 656 MB—chosen along the actual sort-key distribution rather than a uniform Hash. Row count and tenant count were identical pre and post split: 64,795,589 rows across 10,000 tenants. A Range query across 100 tenants returned 10.3 million rows in 102 ms.

 

Screenshot 2026-05-06 at 16.57.59

Crucially, this correction is transparent: table schema does not change, SQL does not change, and applications remain unaware that any restructuring occurred. Split is not a redesign; it is a surgical correction.

 

Merge: reclaim fragmentation when it is no longer needed.

Adaptation does not stop at splitting.

 

In multi-tenant systems, change is not always one-directional. Tenants shrink, data is deleted, and access patterns shift. Over time, layouts that were once necessary can become unnecessarily complex.

 

StarRocks 4.1 allows adjacent Ranges to merge back into larger tablets when fragmentation no longer serves a purpose, via ALTER TABLE ... MERGE TABLETS. Merge reclaims metadata, reduces scheduling overhead, and restores a simpler physical layout.

 

This ability to converge is what prevents adaptation from degenerating into permanent fragmentation.

 

Note on availability: Tablet merge ships in 4.1, gated behind the tablet_reshard_enable_tablet_merge configuration flag (default false). The default-off setting reflects a measured rollout — broader default-on validation is targeted for a subsequent release. Workloads that primarily grow are well-served today; workloads that meaningfully shrink can opt into merge through the config flag.

 

What This Means for Users

Taken together, these capabilities change the role of table design. Physical layout is no longer a one-time decision that users must get right at creation time and keep correct forever; it becomes a system-controlled state that StarRocks can continuously adjust as data volume, tenant size, and workload patterns change.

 

Large tablets make growth tolerable, adaptive splits make correction precise, and merges make recovery possible. For users, this means table creation becomes simpler and more focused on business semantics rather than predicting future skew. At runtime, the layout can adapt without service disruption, while SQL and application logic remain unchanged. Operationally, correction becomes incremental system behavior instead of a high-risk manual migration event. Manual intervention is not eliminated, but it is no longer the baseline condition for stability.

 

Validation: From Theory to Measured Behavior

The test environment setup:

Consisted of two identical AWS shared-data clusters running StarRocks 4.1.0: the same hardware, the same dataset, and the same queries. Each cluster had one FE node on m6i.xlarge and three CN nodes on m6i.4xlarge, with 500 GB of gp3 storage per CN. The dataset was a 200 GB schema modeled on a typical HR analytics workload, including a 1-billion-row event log table partitioned by month and a 200-million-row primary-key table.

 

We tested three representative query patterns: a two-table join, a COUNT DISTINCT aggregation over the event log, and a primary-key-based aggregation. The only configuration difference between the two clusters was whether enable_range_distribution was enabled.

The results trace an inflection point that is essential to understanding the trade-offs.

 

With a single user and a warm cache, static distribution performs better:

The COUNT DISTINCT query runs in 1.5 seconds with Hash distribution and 3.2 seconds with Range distribution. In this case, Colocate Join allows local aggregation, while Range distribution requires a shuffle. For workloads dominated by a single analyst querying warm data, the existing model still has an advantage.

 

At eight concurrent threads, performance is effectively equivalent:

Hash distribution reaches 3.60 QPS, while Range distribution reaches 3.53 QPS, a difference of less than 2%. Below this concurrency, the choice of distribution model has a limited impact on throughput.

 

At 32 concurrent threads, however, the relationship reverses:

Range distribution delivers 1.86× higher throughput: 7.05 QPS versus 3.79 QPS. The difference in tail latency is even more pronounced. Hash distribution’s P99 latency rises to 36.6 seconds, while Range distribution keeps P99 bounded at 11.5 seconds.

 

This is the critical inflection point. With the same data, hardware, and queries, the statically distributed cluster does not simply run slower as concurrency increases. Its tail latency degrades sharply.

 

Screenshot 2026-04-29 at 15.01.58

The mechanism is visible in the monitoring data. Under 32-thread pressure, BE CPU idle time on the Hash-distributed cluster remains above 90% on every node. The CPUs are not saturated; they are waiting. Colocate-join plans contend on the same hot buckets. By contrast, the Range-distributed cluster spreads shuffle work across more independent tablets and makes better use of the available compute resources.

Screenshot 2026-04-29 at 14.58.47

Mixed real-time load:

The same pattern holds under mixed real-time load. When eight query threads run alongside four ingest workers — closer to the shape of a multi-tenant analytical workload backed by streaming data — Range distribution continues to improve query performance.

 

Range distribution delivers 1.86× higher QPS: 5.31 versus 2.86. Tail latency also improves, with P99 dropping from 9.7 seconds to 4.8 seconds.

 

These query gains do not come at the cost of ingest performance. Ingest throughput is essentially unchanged: 97 batches committed on Hash distribution versus 98 on Range distribution. Ingest P99 is also comparable, at 14.1 seconds on Hash and 12.5 seconds on Range.

 

Honest trade-offs:

The validation also clarifies the current trade-offs.

  • Bulk load performance is not uniformly better. On the duplicate-key event log table, Range distribution ingests slightly faster than Hash distribution: 591 seconds versus 625 seconds for 1 billion rows. On the 215 GB primary-key table, however, Range distribution is 1.7× slower: 1,075 seconds versus 624 seconds, largely due to post-commit compaction overhead. For one-time bulk reloads of large primary-key tables, this difference matters.
  • Sequential, low-concurrency queries can also be slower on Range distribution, typically by 1.3× to 2×, because Colocate Joins are not available. The same COUNT DISTINCT query that is 2.08× slower with a single user becomes faster under concurrency, delivering 1.86× higher QPS at 32 threads.

For a system designed to remain stable under workloads that cannot be predicted at table-creation time, these are the right trade-offs.

 

Conclusion: From Static Design to Continuous Correction

An evolvable physical data model is a deliberate system choice, not a one-time fix.

StarRocks 4.1 favors delayed intervention over aggressive early reshaping, tolerating short-term imbalance to avoid constant reorganization. The mechanisms that enable split and merge introduce complexity that must mature with safety and observability as first-class constraints.

 

Range distribution is opt-in in 4.1.0 (enable_range_distribution = true, default false). Existing Hash-distributed tables do not auto-upgrade — opting in requires creating new Range-distributed tables. Tablet merge ships in 4.1 behind a config flag (tablet_reshard_enable_tablet_merge, default false); broader default-on validation follows in subsequent releases.

 

This model does not eliminate manual intervention entirely, but it removes it as a prerequisite for stable operation.

 

Tenant sizes will shift, distributions will drift, and original table-design assumptions will eventually become wrong. StarRocks 4.1 does not attempt to prevent that reality—it changes what happens when it arrives, turning physical layout from a one-time decision into a continuously corrected system state.

 

Getting Started with StarRocks 4.1

To get started with adaptive multi-tenant data management, explore the StarRocks 4.1 documentation or join the StarRocks Community Slack to discuss your deployment.

 

Download StarRocks 4.1 from the StarRocks Community download page or the official StarRocks Docker Hub repository.