How to Manage Schema Migrations in StarRocks with SQLAlchemy and Alembic

StarRocks delivers high-performance analytics, but those gains depend on well-managed schemas. As systems grow, manual DDL quickly becomes fragile, ALTER TABLE statements get scattered, changes go untracked, and environments drift out of sync.
The problem isn’t tooling—it’s lack of a versioned workflow. Without a versioned workflow, teams lack a reliable history of schema changes, risk partial failures due to non-transactional DDL, and struggle to coordinate across environments.
Two tools from the Python ecosystem solve this directly. SQLAlchemy is a database toolkit that lets you define your schema as Python classes instead of raw SQL; tables, columns, and relationships become code that any engineer can read, review, and version. Alembic is a migration framework built on top of SQLAlchemy: it compares your SQLAlchemy models with the current database schema and generates migration scripts for supported differences. Together, they bring the same discipline to database schemas that Git brings to application code.
Why Use SQLAlchemy + Alembic with StarRocks
The StarRocks SQLAlchemy dialect, combined with Alembic, replaces this with a declarative, version-controlled approach. Instead of managing DDL manually, you define schema in code and let Alembic generate migrations by diffing models against the current database state.
This turns schema changes into a repeatable workflow: changes are tracked, reviewable, and applied consistently across environments.
Setting Up SQLAlchemy and Alembic for StarRocks
Prerequisites
Before starting, confirm you have the following:
- StarRocks Python client ≥ 1.3.2 — the minimum version required by the dialect
- SQLAlchemy ≥ 1.4 — 2.0 recommended, and required for sqlacodegen
- Alembic ≥ 1.16
Install the StarRocks client, which bundles the SQLAlchemy dialect:
|
pip install starrocks |
Then install Alembic separately if not already present:
|
pip install alembic |
Step 1: Connect to StarRocks
The connection URL supports an optional catalog prefix—useful for multi-catalog deployments. The default query port is 9030:
|
starrocks://<user>:<password>@<FE_host>:<query_port>/[<catalog>.]<database> |
A minimal connection with a quick sanity check:
|
from sqlalchemy import create_engine, text engine = create_engine( "starrocks://root@localhost:9030/mydatabase" ) with engine.connect() as conn: conn.execute(text("SELECT 1")) |
Step 2: Define Schema Models
StarRocks extends SQLAlchemy via starrocks_-prefixed table arguments. The starrocks_ prefix must be lowercase. The suffix is accepted in either case.
Supported StarRocks features:
- Key models: PRIMARY, DUPLICATE, UNIQUE, AGGREGATE
- Partitioning: RANGE, LIST, expression
- Distribution: HASH, RANDOM
- ENGINE: OLAP
- Table properties: replication_num, storage_medium, and others
ORM Example
Any column used in a key (e.g., starrocks_primary_key) must also set primary_key=True:
|
from sqlalchemy.orm import Mapped, mapped_column, declarative_base from starrocks import BIGINT, VARCHAR, DATE Base = declarative_base() class Event(Base): __tablename__ = "events" user_id: Mapped[int] = mapped_column(BIGINT, primary_key=True) __table_args__ = { "starrocks_primary_key": "user_id", "starrocks_partition_by": """ RANGE (event_date) ( START (\"2026-01-01\") END (\"2027-01-01\") EVERY (INTERVAL 1 DAY) ) """, "starrocks_distributed_by": "HASH(user_id)", "starrocks_properties": {"replication_num": "3"}, } |
Core Style Example
|
from sqlalchemy import Table, Column, MetaData from starrocks import INTEGER metadata = MetaData() events = Table( "events", metadata, Column("id", INTEGER, primary_key=True), starrocks_primary_key="id", starrocks_distributed_by="HASH(id)", ) |
Step 3: Initialize Alembic
Running alembic init migrations scaffolds the project structure—alembic.ini, migrations/env.py, and the migrations/versions/ directory where migration scripts live.
Step 4: Configure Alembic for StarRocks
alembic.ini
|
sqlalchemy.url = starrocks://root@localhost:9030/mydatabase |
env.py
Import the StarRocks-specific helpers. Without these, autogenerate silently mishandles StarRocks-native column types and skips views and materialized views entirely:
|
from alembic import context from starrocks.alembic import render_column_type, include_object_for_view from starrocks.alembic.starrocks import StarRocksImpl from myapp.models import Base target_metadata = Base.metadata # Optional: set to match your cluster’s replication factor version_table_kwargs = {"starrocks_properties": {"replication_num": "1"}} |
Offline mode
Use offline mode to generate SQL scripts without a live database connection:
|
def run_migrations_offline(): url = context.config.get_main_option("sqlalchemy.url") context.configure( url=url, target_metadata=target_metadata, render_item=render_column_type, include_object=include_object_for_view, version_table_kwargs=version_table_kwargs, ) with context.begin_transaction(): context.run_migrations() |
Online mode
Use online mode when applying migrations directly against a live database:
|
def run_migrations_online(): with connectable.connect() as connection: context.configure( connection=connection, target_metadata=target_metadata, render_item=render_column_type, include_object=include_object_for_view, version_table_kwargs=version_table_kwargs, ) with context.begin_transaction(): context.run_migrations() |
Step 5: Generate and Apply Migrations
With the model declared and Alembic configured, generate the first migration:
|
alembic revision --autogenerate -m "create events table" |
Before applying, always preview the generated SQL—especially in production, where a partial failure requires manual remediation:
|
alembic upgrade head --sql |
Once reviewed, apply the migration:
|
alembic upgrade head |
Check current migration status at any time with alembic current.
Evolving an Analytics Events Table: A Production Walkthrough
Schema evolution is where the value of this approach becomes concrete. The following walkthrough traces an events table from initial design through column additions, views, and materialized views—all managed through migrations.
Phase 1: Adding Columns Safely (Without Manual DDL)
When marketing requires attribution tracking, the schema needs utm_source and utm_campaign. Update the model, then generate and apply the migration:
|
alembic revision --autogenerate -m "add marketing attribution columns" alembic upgrade head |
Alembic compares your SQLAlchemy models with the current database schema and generates migration scripts for supported differences. Because StarRocks allows only one schema change job per table at a time, each column addition should be its own migration—not batched together:
Migration 1 — add utm_source:
|
def upgrade(): op.add_column('events', sa.Column('utm_source', sa.VARCHAR(length=100))) |
Migration 2 — add utm_campaign (separate migration run):
|
def upgrade(): op.add_column('events', sa.Column('utm_campaign', sa.VARCHAR(length=100))) |
StarRocks DDL is non-transactional: if an upgrade fails midway, there is no automatic rollback. Use alembic upgrade head --sql to inspect generated statements before each run, and check SHOW ALTER TABLE COLUMN to verify no schema change job is already in progress before applying.
Phase 2: Views as First-Class Schema Objects
The StarRocks dialect provides a View class in starrocks.schema that integrates with SQLAlchemy metadata and Alembic autogenerate:
|
from starrocks.schema import View from myapp.models import metadata active_users_view = View( "active_users_view", metadata, definition="SELECT user_id, event_type FROM events WHERE event_date >= '2026-01-01'", columns=[ {"name": "user_id", "comment": "User identifier"}, {"name": "event_type", "comment": "Event type"}, ], comment="Active users since Jan 2026" ) |
Because include_object=include_object_for_view is set in env.py, Alembic detects and generates migrations for views automatically—no op.execute() required.
Phase 3: Materialized Views as Versioned Schema
As query volume grows, materialized views can precompute aggregations and reduce scan cost. The StarRocks dialect exposes a MaterializedView class in starrocks.schema that handles creation, refresh strategy, and partitioning as code:
|
from starrocks.schema import MaterializedView from myapp.models import metadata daily_event_counts_mv = MaterializedView( "mv_daily_event_counts", metadata, definition=""" SELECT event_date, event_type, COUNT(*) AS cnt FROM events GROUP BY event_date, event_type """, starrocks_refresh="ASYNC", starrocks_partition_by="event_date", ) |
Alembic supports creation and deletion of materialized views through autogenerate, but not all ALTER operations. Changes to an existing MV’s definition or refresh strategy may require a manual drop-and-recreate migration. Partitioning on the MV must be compatible with the base table design—this is not validated automatically.
Phase 4: Extending Partition Windows Carefully
When your data grows beyond the original partition range, update starrocks_partition_by in your model, then generate a migration:
|
alembic revision --autogenerate -m "extend partition range" alembic upgrade head --sql # always review before applying |
Partition changes can be complex and are not always fully automated—Alembic detects supported differences, but always review the generated SQL carefully before applying to production.
Best Practices Grounded in StarRocks’ DDL Behavior
These practices address the specific characteristics of StarRocks DDL, not general migration hygiene.
-
The One-at-a-Time Rule
StarRocks generally allows only one schema change job per table at a time. If your migration script issues multiple op.add_column calls against the same table, the second job may be rejected. Split each column addition into its own migration, and use SHOW ALTER TABLE COLUMN to confirm the previous job completed before applying the next.
-
Omit BUCKETS for Auto-Scaling
Do not hardcode a bucket count in starrocks_distributed_by. Omitting it lets StarRocks use Automatic Bucket Determination. Hardcoding it creates constant diff noise in Alembic migrations whenever the bucket count hasn’t intentionally changed.
- The Replication Trap
If you’re developing on a single-BE cluster and your model defaults to "replication_num": "3", migrations—including Alembic’s internal version table creation—will hang indefinitely. Set version_table_kwargs in env.py to match your cluster topology, and keep dev and production configs separate.
-
Safe Reverse Engineering
If you have an existing StarRocks schema, use sqlacodegen with the dialect-specific flags to ensure distribution keys and partition definitions are preserved in the generated models:
|
sqlacodegen --options include_dialect_options,keep_dialect_types \ --generator tables \ starrocks://user:pass@host:9030/db > models.py |
Summary of the Workflow
- Modify Model: Update your SQLAlchemy class (e.g., add a column or change a property).
- Generate: alembic revision --autogenerate -m "description"
- Inspect: Use alembic upgrade head --sql to verify the StarRocks-native DDL before it touches a live system.
- Apply: alembic upgrade head
Schema drift is a slow-moving problem, teams don’t notice it until they’re debugging a production query that works in dev and breaks in prod. Getting the env.py configuration right, particularly include_object_for_view and version_table_kwargs, is what makes autogenerate reliable enough to trust in production. Everything after that is workflow.
📚 Recommended reading:
To get started and explore the full dialect reference, visit the StarRocks SQLAlchemy documentation.
💬 Have questions about StarRocks? Join the StarRocks community on Slack — it’s where the team and thousands of users discuss real-world deployments, troubleshooting, and best practices.