Announcing StarRocks Version 2.5
StarRocks version 2.5 comes with many new features and enhancements. The core features include:
- Delta Lake Catalog, Local Cache, support for querying Apache Hudi merge-on-read (MOR) tables, support for querying MAP and STRUCT data in data lakes.
- Support for creating multi-table async materialized views (MVs) based on external catalogs and existing MVs, and support for query rewrites.
- Query Cache.
- Lambda expressions and higher-order functions.
- Primary Key table support for conditional updates.
Version 2.5 is StarRocks' Long-term Support (LTS) version. You're invited to trial the new features and give us your feedback.
An Introduction to StarRocks 2.5
Version 2.5 is the last 2.x version before version 3.0 is released. It offers many important features and enhancements:
- Data Lake Analytics is further enhanced for closer integration with other data lake ecosystems, higher query performance, and a better user experience, which ensures easy and efficient data lake analytics (DLA).
- Materialized View: The building and refresh mechanism of asynchronous materialized views are optimized, and query rewrites are supported. These optimizations simplify data modeling.
- Query Cache accelerates queries that are "semantically equivalent".
- Lambda expressions and higher-order functions offer more flexible data queries.
- Primary key model conditional updates ensure that even with disordered upstream data, new data will not be overwritten by old data.
This article details only a handful of core features. For a full overview of features, see the official [Release Notes](https://docs.starrocks.io/en-us/main/release_notes/release-2.5).
Core Features and Enhancements
Data Lake Analytics
StarRocks 2.5 makes major enhancements to DLA, including:
- Integration with more data lake ecosystems
- Delta Lake Catalog: allows you to query Delta Lake tables with zero data migration.
- Support for querying Apache Hudi's MOR tables: provides better support for real-time DLA scenarios, such as analyzing real-time data ingested from Apache Flink CDC to Apache Hudi.
- File External Table: allows you to directly query Parquet and ORC files stored in distributed file system (DFS) or object storage without using a data lake metastore.
- Integration with AWS Glue: AWS Glue can be used as a lake analytics metastore for Apache Hive, Apache Hudi, Apache Iceberg, and Delta Lake, which brings a ready-to-use lake analytics experience to AWS public cloud users.
- Support for Local Cache: This feature splits files in external storage systems into blocks and caches these blocks in StarRocks' local disks, memory, or a combination of both. This significantly accelerates queries and improves the performance of DLA. With local cache, the query performance is comparable to querying StarRocks native tables.
- Optimized the efficiency of accessing Apache Hive, Apache Hudi, and Apache Iceberg metadata in the query planning phase.
Enhanced query experience
- Out-of-the-box metadata retrieve policy: In ad-hoc scenarios, this feature enables business teams to query data without perceiving data updates in underlying partitions.
- Support for querying MAP and STRUCT data in data lakes: Hive Catalog fully supports querying and analyzing MAP and STRUCT data in Parquet and ORC files, making data analytics smoother in complex DLA scenarios.
StarRocks released version 2.4 in October 2022, which supports asynchronously refreshed multi-table MVs. In version 2.5, StarRocks further improved this feature:
Flexible building of async MVs and enhanced modeling capabilities
- StarRocks supports creating async MVs based on external catalogs or based on tables across internal and external catalogs. This allows you to easily perform data modeling based on data in the lake.
- StarRocks supports creating async MVs from existing MVs, simplifying multi-layered data modeling.
- MVs can now have a different lifecycle than the base table, simplifying data governance.
Various async refresh policies to lower refresh overhead
- You can set the maximum number of partitions for a single refresh to split large refresh tasks. This way, large refresh tasks can be completed stably in batches.
- You can set excluded tables to avoid the unnecessary refreshment of historical data.
- You can define an automatic refresh scope to refresh only recent data.
Transparent query rewrites for seamless query acceleration
- Supports query rewrites for Select, Projection, Join, and Group By (SPJG).
- Supports Union Rewrites for partitions and predicates.
- Supports query rewrites based on nested MVs.
The above enhancements equip materialize views with basic data modeling capabilities. In the future, StarRocks will continue to enhance this feature by supporting incremental updates and MVs for primary key tables.
Query Cache stores the intermediate computation results of queries in the BEs' memory. As such, new queries that are "semantically equivalent" to previous ones can reuse the cached computation results to reduce latency and increase QPS. Query cache also supports reusing partial query results. Query Cache is more effective in the following scenarios:
- You frequently run aggregate queries on a denormalized table.
- Most of your aggregate queries are non-GROUP BY queries and low-cardinality GROUP BY queries.
- The data to query is appended by time partition. Different partitions have different access frequencies (hot and cold data).
In the future, StarRocks will continue to optimize Query Cache, including supporting the reuse of multi-table join query results.
Lambda Expressions and Higher-Order Functions
StarRocks is keen to deliver a flexible and powerful analytics engine for users. In 2.5, StarRocks released Lambda expressions and higher-order functions. Currently, four higher-order functions are supported: array_map(), array_filter(), array_sum(), and array_sortby(). These functions simplify calculations, filtering, and sorting on array elements. They are very efficient in scenarios such as user behavior analysis and variable capturing. In the future, StarRocks will offer more higher-order functions to deliver a more flexible and convenient data analytics experience.
Primary Key Model
Version 2.5 also makes the following enhancements to the Primary Key model:
- The Primary Key model supports conditional updates. You can specify a non-primary key column as the update condition. Data update will take place only when the value of the loaded data is greater than the current value in that column. Conditional updates can better support updates of disordered data and ensure data quality.
- The peak memory usage during data ingestion is reduced by 50%.
In the future, StarRocks will continue to improve data UPSERT of the Primary Key model, including more complete update capabilities and higher partial update efficiency.
Version 2.5 has also made many optimizations to data ingestion, including:
- Using Resource Group to isolate resources for data ingestion to control resource consumption of data ingestion.
- Support for copying replicas between BEs during data ingestion to reduce the overhead of generating data files for multiple replicas, doubling data ingestion performance.
- Allows configuring the priorities of Broker Load jobs.
- Removes the need to deploy brokers when loading data from HDFS or object storage such as AWS S3.
- Improves the performance of Broker Load when a large number of small ORC files are being loaded.
Backup and Restore
Previous versions of StarRocks support data backup and restore only at the table level and only on certain table models. In version 2.5, StarRocks supports data backup and restore at the database level, lowering management costs. Additionally, StarRocks supports data backup and restore on all table models, including the Primary Key model.
- Automatic setting of an appropriate number of tablets when you create a table, eliminating the need for manual operations.
- Support for user-defined variables.
- Support for zstd, Snappy, and zlib data compression algorithms. You can specify a compression algorithm when creating a table.
- Support using the results of the uuid() or uuid_numeric() functions as the default values of columns when creating a table.
- Optimized tables and columns data in the information_schema database and added a new table called table_config.
- New/optimized SQL functions
- Provides the QUALIFY clause to filter results of window functions.
- Added the following functions: map_size, map_keys, map_values, max_by, sub_bitmap, bitmap_to_base64, host_name, and date_slice.
- Supports specifying multiple arguments in the unnest function.
- Added a new mode INCREASE for the window_funnel function to avoid computing duplicate timestamps.
- The following ARRAY functions support querying JSON data: array_agg, array_sort, array_concat, array_slice, and reverse.
The release of StarRocks version 2.5 would have been impossible without the efforts of StarRocks' community of contributors. In this version, 165 contributors submitted a total of 2418 commits. If you are interested to know more about StarRocks, please star/follow us on GitHub and join our Slack community!
Last but not least, Thanks to all our contributors for making StarRocks better! ❤️
packy92, leoyy0316, Linkerist, DorianZheng, luohaha, meegoo, evelynzhaojie, amber-create, mchades, Astralidea, satanson, mofeiatwork, ZiheLiu, Seaven, trueeyu, sevev, stdpain, banmoy, Smith-Cruise, ABingHuang, fzhedu, xiaoyong-z, decster, gengjun-git, HangyuanLiu, wuyunfeng, stephen-shelby, alvin-coding, chaoyli, shshenhua, titianqx, Knight0xffff, hellolilyliuyi, Youngwb, EsoragotoSpirit, sduzh, rickif, dirtysalt, shileifu, waittttting, xlfjcg, zombee0, kevincai, wangsimo0, ss892714028, starrocks-xupeng, hffariel, nshangyiming, zaorangyang, liuyehcf, tomscut, femiiii, srlch, silverbullet233, choury, GavinMar, caneGuy, Pslydhh, JackeyLee007, jiacheng-celonis, LiShuMing, kangkaisen, dulong41, imay, guangxuCheng, zhenxiao, miomiocat, padmejin, motto1314, wanpengfei-git, wangruin, huangfeng1993, kateshaowanjou, wyb, wanweiqiangintel, Johnsonginati, QingdongZeng3, smartlxh, wangshisan, cbcbq, hiliuxg, blackstar-baba, abc982627271, zuyu, mxdzs0612, lixiaoer666, southernriver, dufeng1010, badboynt1, mateng0915, kingpluspk, amorynan, chen9t, xuzifu666, melt-code, selectbook, mapleFU, Alittleben, TszKitLo40, adzfolc, Toms1999, ucasfl, zddr, mikedias, srikker, hongli-my, feihengye, liukun4515, ryanyuan, predator4ann, RowenWoo, harveyyue, wanghuan2054, xlwh, wuleistarrocks, zhuxt2015, TBCCC, jaogoy, wuxueyang96, Crystal-LiuJing, samredai, aaawuanjun, rubiesvelt, harui7890, only2yangcao, even986025158, goodqiang, minchowang, changli6, screnwei, zhongyuankai, dreamay, sym-liuyang, MonsterChenzhuo, wangxiaobaidu11, happut, Zhangruichao, shyamrox, itweixiang, zdsg1024, ylcq, laotan332, zbtzbtzbt, long2ice, etr2460, liuqian1990, mklzl, chenyjsr, DeepThinker666, johndinh391, karan-kap00r, creatstar, Gabriel39, Ielihs, SaintBacchus, bigdata-kuxingseng, staman96, Gri-ffin, jsinwell, DebayanSen96, RishiKumarRay, Ccuurryy, wuqiao