Pre-populated sample databases as Docker images — ready-to-run PostgreSQL, MySQL, CockroachDB, and SQLite containers loaded with real, valid sample data (Chinook, Northwind, Sakila/Pagila, World, AdventureWorks, Stack Exchange, and more).
Ever needed a database already populated with valid data — to practice SQL, run tests, demo an app, or benchmark — without hand-crafting rows or hunting for a usable dump? Every image ships exactly one dataset in its own database, so you just docker run and connect.
- Dataset support matrix
- Databases
- PostgreSQL images
- MySQL images
- CockroachDB images
- SQLite images
- Usage
- Custom images
- Testing
- Build caching
- Future Work
Each cell is the image tag to pull for that dataset on that engine; — means it isn't shipped there (yet). The dataset name links to its upstream source when every engine pulls from the same one; where engines use different upstreams, the source link is on the individual tag instead. All images are published for linux/amd64 and linux/arm64.
| Dataset | PostgreSQL | MySQL | CockroachDB | SQLite |
|---|---|---|---|---|
| AdventureWorks | adventureworks |
— | — | — |
| Airlines | airlines |
— | — | — |
| Chinook | yugabyte-chinook |
chinook |
chinook |
chinook |
| Dell DVD Store | dellstore |
dellstore |
— | dellstore |
| French Towns | frenchtowns |
frenchtowns |
— | frenchtowns |
| ISO 3166 | iso3166 |
iso3166 |
— | iso3166 |
| MoMA | moma |
moma |
— | moma |
| Northwind | yugabyte-northwind |
northwind |
northwind |
northwind |
| OMDb | omdb |
— | — | — |
| PGExercises | yugabyte-pgexercises |
pgexercises |
— | pgexercises |
| Sakila / Pagila | pagila |
sakila |
— | sakila |
| SportsDB | sportsdb, yugabyte-sportsdb |
sportsdb |
— | sportsdb |
| Stack Exchange¹ | stackexchange-<site> |
stackexchange-<site> |
— | stackexchange-<site> |
| USDA | usda |
usda |
— | usda |
| World | world |
world |
— | world |
¹ <site> is one of beer, coffee, poker, woodworking, chess, cooking (e.g. stackexchange-chess).
Every engine also publishes a latest tag: it tracks world on PostgreSQL and MySQL, and chinook on CockroachDB and SQLite.
Four database engines are supported, each published as its own image repository:
- PostgreSQL as
aa8y/postgres-dataset. We use thealpineversion of the official image as the base image to keep our image slim. - MySQL as
aa8y/mysql-dataset. There is no official Alpine image for Oracle MySQL (the officialmysqlimage is Oracle Linux / Debian based) and Alpine's own package repositories ship MariaDB in place of MySQL, so to keep the "thin, Alpine-based" goal we build on the communityyobasystems/alpine-mariadbimage. MariaDB is the MySQL drop-in Alpine substitutes, and its entrypoint honours the sameMYSQL_*env vars and/docker-entrypoint-initdb.d/*.sqlconvention as the official postgres image, so the dataset pattern carries over unchanged. - CockroachDB as
aa8y/cockroach-dataset. There is no official Alpine image (the officialcockroachdb/cockroachimage is UBI-minimal), but it is slim (~170 MB) and multi-arch, and its entrypoint honours the same/docker-entrypoint-initdb.d/*.sqlconvention as the official postgres image (plus aCOCKROACH_DATABASEenv var) when the container is started withstart-single-node. CockroachDB is PostgreSQL wire- and SQL-compatible, so the dataset pattern carries over and these reuse the same PostgreSQL-dialect sample dumps. - SQLite as
aa8y/sqlite-dataset. SQLite is serverless — a database is just a file — so there is no server to boot and no init scripts; the build assembles the database file and the image ships it. We use the Alpine, statically-linkedkeinos/sqlite3image (multi-arch) as the base, keeping the image genuinely thin and Alpine-based.
The original images: each aa8y/postgres-dataset image carries exactly one dataset, loaded into a database named after the dataset, and is built through an Extract -> Transform -> Load Dockerfile driven by manifest.yml. (Sources are in the matrix; the notes below are PostgreSQL-specific.)
yugabyte-chinook(dbchinook): 11 tables in thepublicschema, quoted CamelCase identifiers (e.g."Track","InvoiceLine").yugabyte-pgexercises(dbpgexercises): 3 tables in a dedicatedcdschema (notpublic).sportsdb/yugabyte-sportsdb: all 107 tables are created, but only the generic infrastructure tables plus American football, baseball, basketball, and ice hockey carry data — motor racing, soccer, tennis, wagering, and weather are schema-only.pagila: thepaymenttable is range-partitioned by month (payment_p2022_NN), so row counts split across the parent and its partitions; upstream periodically shifts the sample dates to the current year, so absolute dates change between rebuilds.omdb: CSVs are fetched at build time and shipped in the image so\copyresolves at start; the init script creates thetsm_system_rowsextension the upstream views rely on. Heavy (~150 MB of CSV + indexes).adventureworks: the upstream port pulls Microsoft's CSV bundle and runs a Python reformat before loading (68 tables across 5 schemas). Heavy (~90 MB of CSV).airlines: 9 tables in abookingsschema (search_pathdefaults to it). Upstream ships a single gzippedpg_dumpof its owndemodatabase, so the build decompresses it and strips theDROP/CREATE DATABASE/\connectdirectives so it loads into theairlinesdatabase. The heaviest dataset (several million inlined rows); the snapshot URL is date-stamped and may need bumping if postgrespro retires the pinned file.moma: MoMA ships only CSV/JSON, so the schema is authored in-repo (postgres/scripts/moma/schema.sql, every columntext) and the CSVs ship alongside the init script; counts drift as MoMA refreshes its exports (recorded as floors).stackexchange-<site>(db = bare site name): the dump ships only per-table XML, so a shared build hook (postgres/scripts/stackexchange/transform) converts it toCREATE TABLE+ inlineCOPY+ indexes at build time (8 tables inpublic). Every site shares one schema and builds through that one hook, so adding a site is just another tag; counts are recorded as floors.cookingis the largest (~500k votes, ~230k post-history rows).
The database inside each image is the bare dataset name — the tag minus any yugabyte-/stackexchange- prefix. Those source prefixes exist so a dataset could ship from a second mirror later; sportsdb and yugabyte-sportsdb are the same image today, with the unprefixed sportsdb kept as a backwards-compatible alias.
There is no multi-dataset all image anymore — each image is one dataset; for several at once, run one container per dataset (e.g. via docker-compose). pagila was removed in 2019 over an upstream breakage and is back as a regular tag, since the fork loads cleanly on modern Postgres and one dataset can no longer break the others.
The MySQL images mirror the PostgreSQL ones — one dataset per image, same ETL Dockerfile driven by manifest.yml. The engine is MariaDB (see Databases for why); it is wire- and SQL-compatible with MySQL for these samples. Because each image is a single dataset, the build strips any database-level DDL the upstream dump ships (CREATE/DROP DATABASE/SCHEMA, USE) and loads everything into one database named after the dataset.
Start a container and connect with the mariadb (MySQL-compatible) client:
docker run -d --name my-ds-<tag> aa8y/mysql-dataset:<tag>
docker exec -it my-ds-<tag> mariadb -uroot -pmysql <db_name>
where <tag> is one of the tags in the MySQL column of the matrix and <db_name> is the matching dataset name (the tag itself, minus any stackexchange- prefix). The root password is mysql.
MySQL-native sources, used directly (sources in the matrix; notes below are MySQL-specific):
sakila: stands in for PostgreSQL'spagila(which is itself a Sakila port); 16 base tables.film_textis populated by anAFTER INSERTtrigger onfilmrather than by bulk data, and unlikepagilathepaymenttable is not partitioned.world: MySQL's nativeworld—city,country,countrylanguage(3 tables), identical row counts to the PostgreSQLworld.chinook: the vendor's MySQL-specificChinook_MySql.sql(releasev1.4.5); CamelCase identifiers (e.g.`Track`), with the script'sCREATE DATABASE Chinookstripped so it loads into the lowercasechinookdatabase.northwind: the dalers/mywind port of Microsoft's Access sample (snake_case, 20 tables) — a larger conversion than the 14-table PostgreSQLyugabyte-northwind.moma: schema authored in-repo (mysql/scripts/moma/schema.sql, every columntext); CSVs bulk-loaded at start with server-sideLOAD DATA INFILE. Counts drift as MoMA refreshes its exports (recorded as floors).stackexchange-<site>: per-table XML converted at build time by a shared hook (mysql/scripts/stackexchange) toCREATE TABLE+ batchedINSERTs + indexes (CamelCase identifiers). It maps PostgreSQLint/timestamp/texttoINT/DATETIME(6)/MEDIUMTEXTand adds a key-prefix length to text-column indexes (MySQL cannot index a fullTEXT).cookingis the largest; counts are recorded as floors.
The remaining datasets have no MySQL-native source, but their PostgreSQL dumps are plain DDL + data, so they are hand-translated at build time through a shared mysql/scripts/pgsql transform hook. It converts COPY blocks to batched INSERTs, rewrites PostgreSQL types to their MySQL equivalents (character varying→varchar, timestamp→datetime, double precision→double, bare numeric→decimal, and text→varchar(255) so a text column can serve as a key, which MySQL forbids for TEXT), drops PostgreSQL-only noise (sequences, OWNER TO, GRANT/REVOKE, USING btree/hash/lsm, schema qualifiers), lower-cases table identifiers, transcodes Latin-1 dumps to UTF-8, and drops any PL/pgSQL stored functions (no mechanical MySQL translation — the schema and all data still load). Row counts match the PostgreSQL datasets exactly.
iso3166: thetwo_lettercountry primary key (referenced bysubcountry) becomesvarcharso it can be a key.frenchtowns: the dump declares CamelCase tables but loads lower-case, which the hook reconciles by lower-casing table names; accented commune names survive (the source is UTF-8).usda: the pgFoundry tarball is Latin-1, so the hook transcodes it to UTF-8 before loading.pgexercises: upstream lives in a dedicatedcdschema; the hook strips thecd.qualifier so it loads into the singlepgexercisesdatabase.sportsdb: beyond the usual fixes the hook drops Yugabyte'sUSING lsmindex access method and an unusedCREATE DOMAIN; the 96 unique constraints and 137 foreign keys survive the translation.dellstore: the dump ships anew_customerPL/pgSQL function (an unused app helper); the hook drops it — the schema, keys, and all data still load.
The remaining PostgreSQL datasets are either sourced from PostgreSQL-only upstreams or rely on PostgreSQL-specific features (PL/pgSQL, extensions, pg_dump internals) that can't be hand-translated without diverging from the upstream dataset. Plain DDL + data dumps are instead hand-translated (see the group above); these are the ones that remain PostgreSQL-only:
pagila: not omitted but replaced —pagilais a port of Sakila to PostgreSQL, and MySQL uses the original Sakila directly (tagsakila, above).adventureworks: the only maintained open port (lorint/AdventureWorks-for-Postgres) targets PostgreSQL. AdventureWorks is a Microsoft SQL Server sample with no comparable, maintained MySQL port, and its build relies on a Python reformat plus multiple schemas and materialized views — too much PostgreSQL-specific machinery to hand-translate faithfully.airlines: the postgrespro demo is distributed as a binary-ish PostgreSQLpg_dumpand leans on PostgreSQL features (jsonb, several million inlined rows); it is PostgreSQL-only.omdb: df7cb/omdb-postgresql is PostgreSQL-specific — its views rely on thetsm_system_rowsextension (no MySQL equivalent), so a port would have to drop them and would no longer be the upstream dataset.yugabyte-chinook,yugabyte-northwind,yugabyte-sportsdb: superseded on MySQL by the native/portedchinook,northwind, andsportsdbtags above (the Yugabyte SQL is PostgreSQL dialect;sportsdbis hand-translated from the same dump, so the prefixed tag is not duplicated here).
The CockroachDB images mirror the PostgreSQL ones — one dataset per image, same ETL Dockerfile driven by manifest.yml. The engine is CockroachDB (see Databases for the base-image choice); it is PostgreSQL wire- and SQL-compatible, so these reuse the same PostgreSQL-dialect dumps the Yugabyte PostgreSQL tags do. The official cockroachdb/cockroach entrypoint creates the database named by the COCKROACH_DATABASE env var and runs every /docker-entrypoint-initdb.d/*.sql script against it (under start-single-node), so — unlike the postgres images — the build emits no CREATE DATABASE header; the database is the bare dataset name.
The images run a single-node cluster in insecure mode (these are throwaway practice/test images, mirroring the trivial credentials the postgres/mysql images use), which keeps connecting simple. Start a container and connect with the built-in cockroach sql client:
docker run -d --name cr-ds-<tag> aa8y/cockroach-dataset:<tag>
docker exec -it cr-ds-<tag> cockroach sql --insecure --database <db_name>
where <tag> is one of the tags in the CockroachDB column of the matrix and <db_name> is the matching dataset name. Schema and identifier details match the PostgreSQL chinook/northwind notes above (same dumps).
The SQLite images follow the same one-dataset-per-image model, but since SQLite is serverless the build inverts: rather than shipping init scripts that run at container start, the build assembles the database file and the final image carries it. Each aa8y/sqlite-dataset image carries exactly one dataset as /data/<dataset>.db, built through the Dockerfile driven by manifest.yml: a dataset is described either by a native SQLite SQL script (fed to the sqlite3 CLI to build the database) or by a prebuilt SQLite database file (shipped as-is).
Start a container and open the database with the bundled sqlite3 shell:
docker run -it --rm aa8y/sqlite-dataset:<tag>
which opens /data/<db_name>.db directly. You can also run a one-off query:
docker run --rm aa8y/sqlite-dataset:<tag> /usr/bin/sqlite3 /data/<db_name>.db "SELECT count(*) FROM ..."
where <tag> is one of the tags in the SQLite column of the matrix and <db_name> is the matching dataset name.
Sources are in the matrix; the notes below are SQLite-specific:
chinook: built at image-build time from the vendor's nativeChinook_Sqlite.sqlscript (releasev1.4.5); CamelCase identifiers (Track,InvoiceLine), with row counts matching the otherchinooktags exactly.northwind: the prebuilt jpwhite3/northwind-SQLite3 database shipped as-is — the port's expanded edition, whoseOrdersand especially"Order Details"tables carry far more rows than the classic sample, so this image is heavier than the others.world: the pgFoundry PostgreSQLworlddump hand-translated at build time through the sharedsqlite/scripts/pgsqltransform hook (COPY→ batchedINSERTs, Postgres-only noise stripped); three tables (city,country,countrylanguage) with row counts matching the otherworldtags exactly.iso3166,frenchtowns,usda,pgexercises,dellstore,sportsdb: same sharedsqlite/scripts/pgsqlhook asworld— plain PostgreSQL DDL + data dumps rewritten for SQLite at build time. SQLite cannot add constraints viaALTER TABLE, so PK/FK/unique constraints from the dump are dropped; tables and row counts still load faithfully (matching the MySQL tags for these datasets).sakila: the bradleygrant/sakila-sqlite3 port's prebuiltsakila_master.dbshipped as-is — stands in for PostgreSQL'spagila(16 base tables, MySQL-compatible row counts).moma: schema authored in-repo (sqlite/scripts/moma/schema.sql, every columntext); CSVs bulk-loaded at build time with the sqlite3 CLI's.importdot-command. Counts drift as MoMA refreshes its exports (recorded as floors).stackexchange-<site>: per-table XML converted at build time by a shared hook (sqlite/scripts/stackexchange) toCREATE TABLE+ batchedINSERTs + indexes (double-quoted CamelCase identifiers).cookingis the largest; counts are recorded as floors.
The remaining datasets are either sourced from PostgreSQL-only upstreams or rely on PostgreSQL-specific features that can't be hand-translated without diverging from the upstream dataset. Plain DDL + data dumps are instead hand-translated (see the group above); these are the ones that remain PostgreSQL-only:
pagila: not omitted but replaced —pagilais a port of Sakila to PostgreSQL, and SQLite uses a native Sakila port directly (tagsakila, above).adventureworks: the only maintained open port targets PostgreSQL; AdventureWorks is a Microsoft SQL Server sample with no comparable, maintained SQLite port, and its build relies on a Python reformat plus multiple schemas and materialized views — too much PostgreSQL-specific machinery to hand-translate faithfully.airlines: the postgrespro demo is distributed as a binary-ish PostgreSQLpg_dumpand leans on PostgreSQL features (jsonb, several million inlined rows); it is PostgreSQL-only.omdb: df7cb/omdb-postgresql is PostgreSQL-specific — its views rely on thetsm_system_rowsextension (no SQLite equivalent), so a port would have to drop them and would no longer be the upstream dataset.
You can start the container by running:
docker run -d --name pg-ds-<tag> aa8y/postgres-dataset:<tag>
and access it by:
docker exec -it pg-ds-<tag> psql -d <db_name>
where <tag> is one of the tags in the matrix and <db_name> is the dataset baked into it — the tag itself, minus any yugabyte-/stackexchange- prefix (e.g. yugabyte-chinook → chinook, stackexchange-beer → beer). You can also use them with docker-compose. See this example for information on how to use them.
Each image carries one dataset, selected with the DATASET build arg along with that dataset's sources (declared per tag in manifest.yml). The simplest way to build a tag is through dave:
dave build -c postgres -t dellstore
The recommended way to add or change a dataset is to declare its extractUrl, sqlFiles and any extras (extraPrereqs, dbExtension, cdDir) under a new tag in manifest.yml — the ETL Dockerfile reads them as build args. You can also invoke docker build directly by passing those same args, e.g.:
docker build -t aa8y/postgres-dataset:world postgres \
--build-arg DATASET=world \
--build-arg EXTRACT_URL=https://ftp.postgresql.org/pub/projects/pgFoundry/dbsamples/world/world-1.0/world-1.0.tar.gz \
--build-arg SQL_FILES=dbsamples-0.1/world/world.sql
and then following the same aforementioned steps for using your custom image.
There are two layers of tests.
Structure tests are static container-structure-test configs under
test/config/ — a shared common.yaml plus one file per dataset. They assert
on the image filesystem and the shipped init scripts without booting Postgres.
The configs to apply per tag are declared in manifest.yml under
structureTest: and run natively by dave structure-test.
Integration (smoke) tests actually boot each image and query the live
database. For every dataset shipped in a tag, test/integration/run.sh:
- waits for Postgres to finish initializing (TCP readiness, so all init scripts have completed),
- lists the base tables and asserts the set exactly matches the expected set (no missing tables, no unexpected extras), and
- runs
SELECT count(*)on every table and asserts the row counts match.
Expected tables and counts live per-dataset as JSON under test/expected/,
e.g. test/expected/iso3166.json:
{
"public.country": 242,
"public.subcountry": 3995
}keyed by schema-qualified table name, with authoritative count(*) values.
A value is normally an exact count. For datasets whose data is fetched from a
live upstream at build time and so drifts between builds (omdb from
www.omdb.org, moma from MoMA's CSV exports, and beer from the
Stack Exchange data dump), the value is instead a floor like
">=59274" and the test asserts count(*) >= that number. --update writes
floors automatically for such datasets.
These are wired into dave test via the test: template in manifest.yml,
which renders per tag and passes the image tag plus the one dataset baked into
it (run.sh still accepts a comma-separated list, so it keeps working if a
multi-dataset image is ever reintroduced). The script needs docker and jq
on the host; psql runs inside the container.
brew install container-structure-test jq # one-time
dave build
dave structure-test # static checks
dave test # live smoke tests (boots images)
# scope to specific tags locally (note: -c postgres is required with -t):
dave test -c postgres -t iso3166 -t dellstoreTo (re)generate an expected file after an intentional dataset change, run the script in update mode against a freshly built image:
test/integration/run.sh --update iso3166 iso3166 # <tag> <datasets-csv>CI runs all three commands; see .github/workflows/ci.yml.
Every tag uses a registry build cache
so the expensive EXTRACT layer (the upstream download) and the
TRANSFORM/LOAD layers after it are reused across builds instead of being
redone from scratch on every CI run. The cache is read on dave build
(--cache-from) and written on dave push (--cache-to ... mode=max),
stored per tag as <repository>:buildcache-<tag>. A missing cache ref is a
cache miss, not an error, so the first build of a new tag simply populates it.
Correctness is gated on the dataset's actual upstream content. Before each
build, bin/dataset-checksum computes a cheap, stable fingerprint of the
tag's source(s) — the git ls-remote HEAD SHA for *.git sources, the real
md5/sha1 from archive.org's JSON metadata API
(via jq) for the Stack Exchange dumps, or the
ETag/Last-Modified/Content-Length from an HTTP HEAD (with a ranged-GET
fallback) for other file URLs — without downloading the data. The fingerprint is
passed to the build as the DATASET_CHECKSUM build arg, which the builder
references just before EXTRACT:
- when the upstream is unchanged, the fingerprint is identical and the cached layers are reused (fast);
- when the upstream changes, the fingerprint changes, busting
EXTRACTand cascading a rebuild throughTRANSFORMandLOAD(fresh).
The fingerprint is recorded on the final image as the
org.opencontainers.image.revision label (docker inspect). The computation
is fail-open: a source that exposes no usable metadata (or is momentarily
unreachable) collapses to a stable marker and caches as before rather than
forcing a spurious full rebuild. The script needs curl, jq, and git on
the host (all present on the CI runners; jq is already required by the
integration tests).
- More MySQL datasets: port additional PostgreSQL datasets where a MySQL-native source exists or the upstream is format-neutral enough to hand-translate faithfully (see Datasets not ported to MySQL).
- More CockroachDB datasets: most plain DDL + data PostgreSQL dumps should load with little or no change.
- Images for other popular databases.
- Find and add more free data sources.