Some additional information in one line

Gemini_Generated_Image_n7dp5wn7dp5wn7dp

StarRocks delivers high-performance analytics, but those gains depend on well-managed schemas. As systems grow, manual DDL quickly becomes fragile, ALTER TABLE statements get scattered, changes go untracked, and environments drift out of sync.

The problem isn’t tooling—it’s lack of a versioned workflow. Without a versioned workflow, teams lack a reliable history of schema changes, risk partial failures due to non-transactional DDL, and struggle to coordinate across environments.

Two tools from the Python ecosystem solve this directly. SQLAlchemy is a database toolkit that lets you define your schema as Python classes instead of raw SQL; tables, columns, and relationships become code that any engineer can read, review, and version. Alembic is a migration framework built on top of SQLAlchemy: it compares your SQLAlchemy models with the current database schema and generates migration scripts for supported differences. Together, they bring the same discipline to database schemas that Git brings to application code.

 

Why Use SQLAlchemy + Alembic with StarRocks

The StarRocks SQLAlchemy dialect, combined with Alembic, replaces this with a declarative, version-controlled approach. Instead of managing DDL manually, you define schema in code and let Alembic generate migrations by diffing models against the current database state.

This turns schema changes into a repeatable workflow: changes are tracked, reviewable, and applied consistently across environments.

 

Setting Up SQLAlchemy and Alembic for StarRocks

Prerequisites

Before starting, confirm you have the following:

  1. StarRocks Python client ≥ 1.3.2 — the minimum version required by the dialect
  2. SQLAlchemy ≥ 1.4 — 2.0 recommended, and required for sqlacodegen
  3. Alembic ≥ 1.16

 

Install the StarRocks client, which bundles the SQLAlchemy dialect:

pip install starrocks

 

Then install Alembic separately if not already present:

pip install alembic

 

Step 1: Connect to StarRocks

The connection URL supports an optional catalog prefix—useful for multi-catalog deployments. The default query port is 9030:

starrocks://<user>:<password>@<FE_host>:<query_port>/[<catalog>.]<database>

 

A minimal connection with a quick sanity check:

from sqlalchemy import create_engine, text

engine = create_engine(

"starrocks://root@localhost:9030/mydatabase"

)

with engine.connect() as conn:

conn.execute(text("SELECT 1"))

 

Step 2: Define Schema Models

StarRocks extends SQLAlchemy via starrocks_-prefixed table arguments. The starrocks_ prefix must be lowercase. The suffix is accepted in either case.

 

Supported StarRocks features:

 

  1. Key models: PRIMARY, DUPLICATE, UNIQUE, AGGREGATE
  2. Partitioning: RANGE, LIST, expression
  3. Distribution: HASH, RANDOM
  4. ENGINE: OLAP
  5. Table properties: replication_num, storage_medium, and others

 

ORM Example

Any column used in a key (e.g., starrocks_primary_key) must also set primary_key=True:

from sqlalchemy.orm import Mapped, mapped_column, declarative_base

from starrocks import BIGINT, VARCHAR, DATE

Base = declarative_base()

class Event(Base):

__tablename__ = "events"

user_id: Mapped[int] = mapped_column(BIGINT, primary_key=True)

__table_args__ = {

"starrocks_primary_key": "user_id",

"starrocks_partition_by": """

RANGE (event_date) (

START (\"2026-01-01\") END (\"2027-01-01\") EVERY (INTERVAL 1 DAY)

)

""",

"starrocks_distributed_by": "HASH(user_id)",

"starrocks_properties": {"replication_num": "3"},

}

 

Core Style Example

from sqlalchemy import Table, Column, MetaData

from starrocks import INTEGER

metadata = MetaData()

events = Table(

"events",

metadata,

Column("id", INTEGER, primary_key=True),

starrocks_primary_key="id",

starrocks_distributed_by="HASH(id)",

)

 

Step 3: Initialize Alembic

Running alembic init migrations scaffolds the project structure—alembic.ini, migrations/env.py, and the migrations/versions/ directory where migration scripts live.

 

Step 4: Configure Alembic for StarRocks

alembic.ini

sqlalchemy.url = starrocks://root@localhost:9030/mydatabase

 

env.py

Import the StarRocks-specific helpers. Without these, autogenerate silently mishandles StarRocks-native column types and skips views and materialized views entirely:

from alembic import context

from starrocks.alembic import render_column_type, include_object_for_view

from starrocks.alembic.starrocks import StarRocksImpl

from myapp.models import Base

target_metadata = Base.metadata

# Optional: set to match your cluster’s replication factor

version_table_kwargs = {"starrocks_properties": {"replication_num": "1"}}

 

Offline mode

Use offline mode to generate SQL scripts without a live database connection:

def run_migrations_offline():

url = context.config.get_main_option("sqlalchemy.url")

context.configure(

url=url,

target_metadata=target_metadata,

render_item=render_column_type,

include_object=include_object_for_view,

version_table_kwargs=version_table_kwargs,

)

with context.begin_transaction():

context.run_migrations()

 

Online mode

Use online mode when applying migrations directly against a live database:

def run_migrations_online():

with connectable.connect() as connection:

context.configure(

connection=connection,

target_metadata=target_metadata,

render_item=render_column_type,

include_object=include_object_for_view,

version_table_kwargs=version_table_kwargs,

)

with context.begin_transaction():

context.run_migrations()

 

Step 5: Generate and Apply Migrations

With the model declared and Alembic configured, generate the first migration:

alembic revision --autogenerate -m "create events table"

 

Before applying, always preview the generated SQL—especially in production, where a partial failure requires manual remediation:

alembic upgrade head --sql

 

Once reviewed, apply the migration:

alembic upgrade head

 

Check current migration status at any time with alembic current.

 

Evolving an Analytics Events Table: A Production Walkthrough

Schema evolution is where the value of this approach becomes concrete. The following walkthrough traces an events table from initial design through column additions, views, and materialized views—all managed through migrations.

 

Phase 1: Adding Columns Safely (Without Manual DDL)

When marketing requires attribution tracking, the schema needs utm_source and utm_campaign. Update the model, then generate and apply the migration:

alembic revision --autogenerate -m "add marketing attribution columns"

alembic upgrade head

 

Alembic compares your SQLAlchemy models with the current database schema and generates migration scripts for supported differences. Because StarRocks allows only one schema change job per table at a time, each column addition should be its own migration—not batched together:

 

Migration 1 — add utm_source:

def upgrade():

op.add_column('events', sa.Column('utm_source', sa.VARCHAR(length=100)))

 

Migration 2 — add utm_campaign (separate migration run):

def upgrade():

op.add_column('events', sa.Column('utm_campaign', sa.VARCHAR(length=100)))

 

StarRocks DDL is non-transactional: if an upgrade fails midway, there is no automatic rollback. Use alembic upgrade head --sql to inspect generated statements before each run, and check SHOW ALTER TABLE COLUMN to verify no schema change job is already in progress before applying.

 

Phase 2: Views as First-Class Schema Objects

The StarRocks dialect provides a View class in starrocks.schema that integrates with SQLAlchemy metadata and Alembic autogenerate:

from starrocks.schema import View

from myapp.models import metadata

active_users_view = View(

"active_users_view",

metadata,

definition="SELECT user_id, event_type FROM events WHERE event_date >= '2026-01-01'",

columns=[

{"name": "user_id", "comment": "User identifier"},

{"name": "event_type", "comment": "Event type"},

],

comment="Active users since Jan 2026"

)

 

Because include_object=include_object_for_view is set in env.py, Alembic detects and generates migrations for views automatically—no op.execute() required.

 

Phase 3: Materialized Views as Versioned Schema

As query volume grows, materialized views can precompute aggregations and reduce scan cost. The StarRocks dialect exposes a MaterializedView class in starrocks.schema that handles creation, refresh strategy, and partitioning as code:

from starrocks.schema import MaterializedView

from myapp.models import metadata

daily_event_counts_mv = MaterializedView(

"mv_daily_event_counts",

metadata,

definition="""

SELECT event_date, event_type, COUNT(*) AS cnt

FROM events

GROUP BY event_date, event_type

""",

starrocks_refresh="ASYNC",

starrocks_partition_by="event_date",

)

 

Alembic supports creation and deletion of materialized views through autogenerate, but not all ALTER operations. Changes to an existing MV’s definition or refresh strategy may require a manual drop-and-recreate migration. Partitioning on the MV must be compatible with the base table design—this is not validated automatically.

 

Phase 4: Extending Partition Windows Carefully

When your data grows beyond the original partition range, update starrocks_partition_by in your model, then generate a migration:

alembic revision --autogenerate -m "extend partition range"

alembic upgrade head --sql # always review before applying

 

Partition changes can be complex and are not always fully automated—Alembic detects supported differences, but always review the generated SQL carefully before applying to production.



Best Practices Grounded in StarRocks’ DDL Behavior

These practices address the specific characteristics of StarRocks DDL, not general migration hygiene.

  1. The One-at-a-Time Rule

    StarRocks generally allows only one schema change job per table at a time. If your migration script issues multiple op.add_column calls against the same table, the second job may be rejected. Split each column addition into its own migration, and use SHOW ALTER TABLE COLUMN to confirm the previous job completed before applying the next.

  2. Omit BUCKETS for Auto-Scaling

    Do not hardcode a bucket count in starrocks_distributed_by. Omitting it lets StarRocks use Automatic Bucket Determination. Hardcoding it creates constant diff noise in Alembic migrations whenever the bucket count hasn’t intentionally changed.

  3. The Replication Trap

    If you’re developing on a single-BE cluster and your model defaults to "replication_num": "3", migrations—including Alembic’s internal version table creation—will hang indefinitely. Set version_table_kwargs in env.py to match your cluster topology, and keep dev and production configs separate.

  4. Safe Reverse Engineering

    If you have an existing StarRocks schema, use sqlacodegen with the dialect-specific flags to ensure distribution keys and partition definitions are preserved in the generated models:

sqlacodegen --options include_dialect_options,keep_dialect_types \

--generator tables \

starrocks://user:pass@host:9030/db > models.py

 

Summary of the Workflow

  1. Modify Model: Update your SQLAlchemy class (e.g., add a column or change a property).
  2. Generate: alembic revision --autogenerate -m "description"
  3. Inspect: Use alembic upgrade head --sql to verify the StarRocks-native DDL before it touches a live system.
  4. Apply: alembic upgrade head

 

Schema drift is a slow-moving problem, teams don’t notice it until they’re debugging a production query that works in dev and breaks in prod. Getting the env.py configuration right, particularly include_object_for_view and version_table_kwargs, is what makes autogenerate reliable enough to trust in production. Everything after that is workflow.

 

📚 Recommended reading:

To get started and explore the full dialect reference, visit the StarRocks SQLAlchemy documentation.

 

💬 Have questions about StarRocks? Join the StarRocks community on Slack — it’s where the team and thousands of users discuss real-world deployments, troubleshooting, and best practices.