Comparison: StarRocks vs Trino
StarRocks is an OLAP database and Trino is a query engine
Exploring the Differences
What makes a database? A database is a place where you can load and store data in the most optimal way for my queries. You move data into a database because of the optimizations and because someone is holding your team to a service level agreement on the amount of data stored, how fast questions are answered and how many concurrent users are asking their questions about the data. In a database, you might need ACID compliance; if you ask for a change, you want to be reasonable sure the change happens even if a hardware error or other random error occurs. StarRocks is a database.
On the other hand, a query engine is a piece of software that you can bring to the data to query it. In this case, you don’t really want to move the data. Maybe the data is too big and will take too long to move or can't upgrade/migrate because it's too expensive / too much work / lack of investment. Instead, you’d like to bring a query engine directly to the nodes or cloud that’s holding it and use the resources in place. In querying the data with a query engine, you are less concerned about optimizations and more interested in the ability to integrate disparate data sources or you may want to explore the data that is outside the constraints of service level agreements. Since some other process is probably writing the data, things like ACID compliance are less important. Trino is a query engine as are solutions like Apache Drill, Cloudera Impala and even Apache Spark.
Use Case Differences
You need a query engine when you have a lot of data stored and need to bring analytics to it. Companies will frequently store data in Amazon S3 or Hadoop without knowing the value of much of it. They may peel off portions of the data to their database, usually a data warehouse, to perform analytics. You can also point multiple query engines at the data, but the engines won’t necessarily run in isolation, nor will you automatically have workload management that can handle fast running queries and long running queries without bumping into each other.
You need a database when your database needs a new home that can deliver performance, ACID compliance and where backup and restore are part of the system. A database provides advanced methods for optimization through query and storage optimization techniques. Most importantly, you store data in a database when you’re expecting it to meet service level agreements on analytics. In other words, if you have to run x number of reports in x number of minutes, it’s a database that will get you there.
Performance
To calculate Trino's performance, you would add up the following components:
Trino query optimization
+
longest execution time [# of targets x (network time + target query execution)]
+
Trino query results merging
=
Trino query execution time
Basically your slowest source response will determine the speed of your query. Many times, the source data system is another SQL database so you're calling Trino to execute a SQL query on another database (which will do their own checking, their own optimizations, etc etc, a lot of duplicated work) and you also have the penalty of the network overhead in the call.
To calculate StarRock's performance, you would add up the following components:
StarRocks query optimization
+
disk or object store query time
+
StarRocks query results merging
=
StarRocks query execution time
Basically it's all local calls and thus the speed of your query is fast. Everything is internal and there is no outside system or dependency.
Benchmark
We performed a test on 99 queries against a TPC-DS 1TB dataset. We used StarRocks and Trino to query the same copy of data that is stored in Apache Iceberg table format with Parquet files
Trino's overall query response time is 5.54x slower than that of StarRocks. See https://www.starrocks.io/blog/benchmark-test for more details.
When is StarRocks a fit?
- Performance: Since it is a database with it's own storage, it's highly optimized to store and retrieve data. It also has the benefit to implement the latest in database technology like vectorized engine, columnar store, cost based optimization query planner and others. What is key is that it's not a "bolt" on but designed to full take advantage of those features.
- Large or costly infrastructure: Could be due to Trino being in-memory or due to the type of queries. More research has to be done but StarRocks could be better.
- SQL wire compatible protocol: Connect to anything and everything that supports the mySQL driver.
- Migration: Limited list of data targets and that list matches up with StarRocks; StarRocks will perform much better. That is due to new architecture, optimization techniques, etc etc that has come out since Trino was initially created.
When is Trino a fit?
- I need to integrate with many different data targets.
Resources