Some additional information in one line

Over my years as a DBA and StarRocks contributor, I've gained a lot of experience working alongside a diverse array of community members and picked up plenty of best practices. In this time, I've found five specific models that stand out as absolutely critical: deployment, data modeling, data ingestion, querying, and monitoring.

In my previous article I shared some tips on StarRocks data ingestion. In this one, I'll explore querying

 

Querying with StarRocks

Knowing the ins and outs of querying when it comes to StarRocks is one of the more valuable areas you can devote your time to. Not all queries are created equal, however, and there are specific things to keep in mind depending on your scenario. Below I'll discuss what I consider critical, or at least recommended, when working with queries in StarRocks.

 

High-Concurrency Scenarios

  • Recommended: Utilize partition and bucket trimming as detailed in the partition and bucketing sections.

  • Required: Increase the user's concurrency limit; Default value is 100. To set to 1000, SET PROPERTY FOR 'jack' 'max_user_connections' = '1000';.

  • Required: Enable Page Cache and Query Cache.

 

Data Precision

  • Required: For precise results, enforce the use of the DECIMAL type and avoid FLOAT, DOUBLE types.

 

SQL Querying

  • Required: Avoid SELECT *; specify the columns needed, e.g., SELECT col0, col1 FROM tb1.

  • Required: Avoid full table scans; include filtering predicates, e.g., SELECT col0, col1 FROM tb1 WHERE id=123, SELECT col0, col1 FROM tb1 WHERE dt>'2024-01-01'.

  • Required: To prevent the transferring of large amounts of data at once, enforce paginated queries. For example, use SELECT col0, col1, col2, ..., col50 FROM tb ORDER BY id LIMIT 0, 50000 to effectively manage and reduce the volume of data returned in a single query.

  • Required: Pagination operations must include an order by; otherwise, the results are unordered.

  • Recommended: Avoid using unnecessary functions or expressions in predicates.

 

Predicates containing 'cast' can be removed.

-- Q1
select l_tax
from lineitem
where cast(l_shipdate as varchar) > substr('1990-01-02 12:30:31',1,10);

-- Q2
select l_tax
from lineitem
where l_shipdate > '1990-01-02';

 

Excessive Use of Functions in Expressions
-- Q1 bad case
select count(1)
from lineitem
where l_shipdate >= regexp_extract("TIME:1996-01-02 20:00:00", "(\\d{4}-\\d{2}-\\d{2})", 1);

-- Q2
select count(1)
from lineitem
where l_shipdate >= "1996-01-02"

 

-- Q1 bad case
select count(1)
from lineitem
where DATE_FORMAT(l_shipdate,'%Y-%m-%d') >= "1996-01-02"

-- Q2 good case
select count(1)
from lineitem
where l_shipdate >= "1996-01-02"

 

Joins

  • Required: Ensure matching field types for joins; Although StarRocks performs implicit conversions internally for optimal performance, it's better to use matching types to avoid potential inaccuracies, especially avoid joining on FLOAT or DOUBLE types.

  • Required: Avoid using functions or expressions in join conditions, e.g., avoid JOIN ON DATE_FORMAT(tb1.col1,'%Y-%m-%d')=DATE_FORMAT(tb2.col1,'%Y-%m-%d').

  • Required: Consider using Colocate Join to reduce data shuffling for joins involving two or more tables each with over 10 million rows.

  • Recommended: Avoid Cartesian products;

    • Queries that involve multiple tables should explicitly specify join conditions.

-- bad case
SELECT *
FROM table1, table2;

-- good case
SELECT *
FROM table1, table2 ON table1.column1 = table2.column1;

 

Correct Association in Subqueries

  • Ensure a clear association between columns in the outer query and the subquery.

-- bad case
SELECT *
FROM table1
WHERE column1 IN (SELECT column2 FROM table2);

-- good case
SELECT *
FROM table1
WHERE column1 IN (SELECT column2 FROM table2 WHERE

 

Using Materialized Views to Accelerate Queries

Accurate Deduplication with Synchronous Materialized View:

    • This example is based on an advertising business detail table advertiser_view_record, recording click date click_time, advertiser code advertiser, click channel channel, and user ID user_id.

CREATE TABLE advertiser_view_record(
click_time DATE,
advertiser VARCHAR(10),
channel VARCHAR(10),
user_id INT) DISTRIBUTED BY hash(click_time);

 

This scenario frequently uses the following query to check the Unique Visitor (UV) count for clicked ads.

SELECT advertiser, channel, count(distinct user_id) FROM advertiser_view_record GROUP BY advertiser, channel;

 

To accelerate accurate deduplication queries, a materialized view can be created using the bitmap_union() function to pre-aggregate data. 
CREATE MATERIALIZED VIEW advertiser_uv AS SELECT advertiser, channel, bitmap_union(to_bitmap(user_id)) FROM advertiser_view_record GROUP BY advertiser, channel;

 

Once the materialized view is created, subsequent queries involving subqueries like count(distinct user_id) will be automatically rewritten to bitmap_union_count(to_bitmap(user_id)) to hit the materialized view.

 

Utilizing Cache to Accelerate Queries

  • Recommended: Page Cache - Advised to enable to accelerate data scanning scenarios; if memory is abundant, consider increasing the limit, default is 20% of mem_limit.

  • Recommended: Query Cache - Advised to enable to speed up aggregation scenarios involving single or multiple table joins.

  • Recommended: Data Cache - Advised to enable by default in compute-storage separation and lake analytics scenarios.

 

This sums up my advice for querying, but there's a lot more to share. Head on over to my final article in this series that will take a look at monitoring with StarRocks. Be sure to join me on StarRocks' Slack if you have questions or would like to learn more.