StarRocks Best Practices: Data Modeling

Publish date: Jun 5, 2024 1:47:07 PM

Over my years as a DBA and StarRocks contributor, I've gained a lot of experience working alongside a diverse array of community members and picked up plenty of best practices. In this time, I've found five specific models that stand out as absolutely critical: deployment, data modeling, data ingestion, querying, and monitoring.

In my previous article I shared some tips on StarRocks deployment, in this one, I'll be explaining the finer points of data modeling.

Data Modeling

When it comes to data modeling, you'll find there are a lot of decisions to make as you move through the process. I've tried to cover each of those key decisions below, but if you still have questions, I encourage you to reach out on the StarRocks Slack.

Table Creation Standards to Know

Only UTF8 encoding is supported.
Column renaming is not supported (support coming soon).
Maximum VARCHAR length: 1048576
KEY columns cannot use FLOAT, DOUBLE types.
Data directory names, database names, table names, view names, usernames, role names are case-sensitive; column names and partition names are not case-sensitive.
In the primary key table, the total size of primary key columns should not exceed 128 bytes.

What Table Type Should You Use?

Use the duplicated key table to retain detailed records (fact table).
use the primary key table for tables that have a unique column that is non-null, have frequent write and read, and non-primary key columns utilize indexes.
Use the unique key table for tables with a unique column that may be null, have frequent write, and infrequent read.
use the aggregate table to retain only aggregated data.
See the doc for a detailed comparison between different table types.

Choosing Sorting Columns and Prefix Indexes

For versions before StarRocks 3.0, the primary key table specifies sorting columns through PRIMARY KEY. For StarRocks 3.0 and later, sorting columns in primary key models are specified through ORDER BY.
Prefix index is a sparse index introduced on top of sorting columns to further improve query efficiency. All prefix indexes are loaded into memory for optimal performance. When using prefix indexes, consider:
- Columns frequently used as query conditions should be sorting columns. For instance, placing user_id first if queries often filter by user_id.
- Sorting columns should ideally be 3-5 columns. Too many can increase sorting overhead and slow down data ingestion.
- Prefix indexes should not exceed 36 bytes and must not include more than 3 columns. VARCHAR columns will be truncated in prefix indexes, and FLOAT or DOUBLE types are not allowed.

In combination with actual business query scenarios, when determining the Key columns and the order of fields, the advantages of prefix indexes should be fully considered. Order the Key columns that are frequently queried at the front as much as possible, and try to choose data types like date or integer types such as int for these fields

Here's an example:

CREATE TABLE site_access (
site_id BIGINT DEFAULT '10',
city_code INT,
site_name VARCHAR(50),
pv BIGINT DEFAULT '0'
)
DUPLICATE KEY(site_id, city_code, site_name)
DISTRIBUTED BY HASH(site_id);

In the 'site_access' table, the prefix index consists of site_id (8 Bytes) + city_code (4 Bytes) + site_name (first 24 Bytes).

If the query condition only includes the columns site_id and city_code, as shown below, it can significantly reduce the number of data rows that need to be scanned:

select sum(pv) from site_access where site_id = 123 and city_code = 2;

If the query condition only includes the column site_id, as shown below, it can specifically target rows containing only site_id:

select sum(pv) from site_access where site_id = 123;

If the query condition only includes the column city_code, as shown below, all data rows need to be scanned, and the sorting effect is greatly reduced:

select sum(pv) from site_access where city_code = 2;

If the ratio of combined queries of site_id and city_code to queries of city_code alone is about the same, consider creating a synchronized materialized view to adjust the column order to enhance query performance. In this case, we place the city_code column first in the materialized view.

create materialized view site_access_city_code_mv asselect
city_code,
site_id,
site_name,
pv
from
site_access;

Here's a bad case:

CREATE TABLE site_access_bad
(
site_name VARCHAR(20),
site_id BIGINT DEFAULT '10',
city_code INT,
pv BIGINT DEFAULT '0'
)
PRIMARY KEY(site_id)
DISTRIBUTED BY HASH(site_id)
ORDER BY(site_id,city_code);

In the 'site_access_bad' table, the prefix index only includes site_name.

Partitioning Choices

Recommended: Use a non-changing time column for WHERE filtering for partition creation.
Recommended: Choose dynamic partitions for scenarios with data expiration needs.
Required: Create partitions for data with distinct hot and cold characteristics, e.g., frequently updated data from the past week could be partitioned daily.
Required: Each partition must not exceed 100GB.
Required: Tables over 50GB or 5 million rows should have partitions.
Recommended: Create partitions as needed; avoid creating a large number of empty partitions to prevent excessive FE memory usage from metadata.
Currently supports time (Range, Expression partitions), string (List partitions), and number (Range, List partitions).
By default, up to 1024 partitions are supported, and adjustable via settings, though typically unnecessary.

Bucketing Choices

Required: Production must use 3 replicas.
Bucket Count Decision:
- Required: Estimate each bucket as 1GB, with raw data estimated at 10GB (compression ratio of 7:1 to 10:1). If the calculated bucket count is less than the number of BE nodes, the final bucket count should match the number of BE nodes, e.g., 6 BE nodes would result in 6 buckets.
- Required: Do not use dynamic bucketing for non-partitioned tables; estimate bucket count based on actual data volume.
- Required: Avoid dynamic bucketing if partitions within a partitioned table vary significantly in data size.
Bucketing and how to avoid data skew :
- Recommended: If a bucketing column is frequently used in WHERE conditions and has low duplication (e.g., user IDs, transaction IDs), then it can be used as a bucketing column.
- Recommended: If queries often include city_id and site_id together, and city_id has low cardinality, simply using city_id as a bucketing column could lead to significant data skew. In this case, consider combining city_id and site_id as bucketing fields. However, this approach may compromise on bucketing benefits when queries only include city_id.
- Recommended: If no suitable fields are available for dispersing data as bucketing fields, consider using random bucketing, although this eliminates the benefits of bucket trimming.
- Required: For joins involving two or more tables each with over a thousand rows, consider using Colocate join.

Field Types

Recommended: Avoid using null properties.
Required: Ensure the correct data types for time and numeric columns are used. Using inappropriate data types, such as storing time data like "2024-01-01 00:00:00" in VARCHAR, can significantly increase computational costs and hinder internal optimizations like Zonemap indexing.

Indexing Choices

Bitmap Index:
- Suitable for columns with cardinality between 10,000 and 100,000.
- Appropriate for columns queried with equality conditions (=) or within a specific range ([NOT] IN).
- Not supported for FLOAT, DOUBLE, BOOLEAN, and DECIMAL types.
Bloom Filter Index:
- Suitable for columns with a cardinality over 100,000 with low duplication.
- Appropriate for queries with in and = conditions.
- Not supported for TINYINT, FLOAT, DOUBLE, and DECIMAL types.
- In detail and primary key models, all columns can have Bitmap indexes; in aggregate and update models, only Key columns support Bitmap indexing.

Like I mentioned at the start, data modeling comes with a lot of decisions. Fortunately, you'll no longer have to think about many of this thanks to this guide.

This sums up my advice for data modeling, but there's a lot more to share. Head on over to my third article in this series that will take a look at data ingestion with StarRocks. If you have additional questions, reach out to me on the StarRocks' Slack channel.

Read Part 3