The Data Definition Engine (DDE) is an open-source tool that automatically generates CDISC regulatory submission artifacts from a structured clinical trial protocol, the famous Unified Study Definitions Model (USDM). It is developed as part of the CDISC 360i Program by the Define-XML Generation Project Team.
USDM stands for Unified Study Definitions Model β a Transcelerate / CDISC standard that represents a clinical trial's complete protocol in a machine-readable JSON format. It captures things like study objectives, arms, visits, eligibility criteria, and assessments in a structured, vendor-neutral way. It's the input to this tool β the "source of truth" for the study. It was co-developed through a formal partnership between CDISC and TransCelerate BioPharma as part of TransCelerate's Digital Data Flow (DDF) initiative.
- Background
- How It Works
- Study Artifacts
- Prerequisites
- Installation
- Usage
- Project Structure
- Key Concepts
- Current Status & Roadmap
- Contributing
- License
- References
Clinical trials submitted to regulatory agencies (such as the FDA) must include standardized metadata files that describe the structure, content, and meaning of all datasets. Producing these files β most notably Define-XML β has traditionally been a manual, error-prone, and time-consuming process.
The CDISC 360i program aims to automate this process end-to-end: starting from a machine-readable study protocol (USDM), the DDE derives all the metadata needed to generate regulatory submission artifacts. It eliminates the gap between protocol design and data submission by using the same source of truth throughout the study lifecycle.
The DDE implements a three-stage pipeline:
USDM Study Design JSON
β
βΌ
βββββββββββββββββββββ
β LOADER β create_define_json.py
β β β’ Reads the USDM protocol
β β β’ Fetches Biomedical Concepts
β β β’ Retrieves Dataset Specializations
β β β’ Calls the CDISC Library API
ββββββββββ¬βββββββββββ
β
βΌ DDS JSON (define.json) βββ central intermediate model
β
ββββββββββ΄βββββββββββ
β GENERATOR β define_generator.py
β β β’ Reads the DDS JSON
β β β’ Builds Define-XML v2.1 elements
β β β’ Writes the output XML
ββββββββββ¬βββββββββββ
β
βΌ
Define-XML v2.1 (.xml)
HTML rendering (.html)
Loaders extract metadata from various sources and populate the central Data Definition Specification (DDS) model (a JSON file). Generators consume that DDS JSON and produce the final submission artifacts.
Why a central JSON model? No single source has all the metadata needed for a full submission. The DDS acts as an aggregation layer, combining protocol content, biomedical concept definitions, controlled terminology, and any manually filled gaps into one validated model.
| Artifact | Format | Description | Status |
|---|---|---|---|
| SDTM Define-XML | .xml |
Metadata file describing SDTM dataset structure, variables, and controlled terminology for FDA submission | β Implemented |
| ADaM Define-XML | .xml |
Same for Analysis Datasets (ADaM) | π Planned |
| ODM CRFs | .xml |
Case Report Form definitions in ODM format for data collection | π Planned |
| Trial Design Datasets | .json |
TA, TD, TE, TI, TM, TS, TV datasets describing study design | π Planned |
| Dataset-JSON shells | .json |
Empty dataset templates in Dataset-JSON format | π Planned |
- Python 3.8+
- A CDISC Library API key β required to fetch Biomedical Concepts and Dataset Specializations during loading. Request one at CDISC Library.
- Git (to clone the repository)
# Clone the repository
git clone https://github.com/cdisc-org/data-definition-engine.git
cd data-definition-engine
# Install loader dependencies
pip install -r src/define-xml/requirements.txt
# Install generator dependencies
pip install -r src/generators/define/requirements.txtSet your CDISC Library API key β create a .env file in src/define-xml/:
# src/define-xml/.env
CDISC_API_KEY=your_api_key_hereOr pass it directly on the command line with --cdisc_api_key.
The loader reads a USDM protocol JSON file, enriches it with metadata from the CDISC Library, and writes a DDS JSON file.
Windows (PowerShell): use a backtick
`for line continuation instead of\.
# bash / macOS / Linux
cd src/define-xml
python create_define_json.py \
--usdm_file ../../data/protocol/LZZT/usdm/pilot_LLZT_protocol.json \
--output_template ./output/define.json \
--sdtmct 2024-09-27# Windows PowerShell
cd src/define-xml
python create_define_json.py `
--usdm_file ..\..\data\protocol\LZZT\usdm\pilot_LLZT_protocol.json `
--output_template .\output\define.json `
--sdtmct 2024-09-27All arguments:
| Argument | Required | Default | Description |
|---|---|---|---|
--usdm_file |
Yes | β | Path to the USDM input JSON file |
--output_template |
Yes | β | Path for the output DDS JSON file |
--sdtmct |
Yes | β | SDTM Controlled Terminology date (yyyy-mm-dd) |
--sdtmig |
No | 3.4 |
SDTM Implementation Guide version |
--studyversion |
No | 0 |
Study version index in the USDM file (0-based) |
--studydesign |
No | 0 |
Study design index (0-based) |
--docversion |
No | 0 |
Document version index (0-based) |
--cdisc_api_key |
No | env var | CDISC Library API key (falls back to CDISC_API_KEY) |
--cosmosversion |
No | v2 |
CDISC Cosmos API version |
--validate |
No | β | Validate output against a LinkML YAML schema (uses define.yaml if no path given) |
--validation_report |
No | β | Path to write an Excel validation report (required with --validate) |
--patch_file |
No | β | Generate a YAML patch file listing all placeholder/null fields |
--apply_patch |
No | β | Apply a completed patch file to fill in placeholder values |
--debug |
No | False |
Save intermediate dictionaries as JSON files for inspection |
--cacert |
No | β | Path to a CA bundle (.pem) for SSL verification β use when behind a corporate proxy |
--no_ssl_verify |
No | False |
Disable SSL certificate verification (use only in trusted environments) |
Tip: On the first run, use --patch_file gaps.yaml to generate a list of all fields that could not be derived automatically. Fill in the values, then re-run with --apply_patch gaps.yaml.
The generator reads the DDS JSON file and produces a Define-XML v2.1 file.
# bash / macOS / Linux
cd src/generators/define
python define_generator.py \
--template ../../define-xml/output/define.json \
--define ./output/define.xml# Windows PowerShell
cd src\generators\define
python define_generator.py `
--template ..\..\define-xml\output\define.json `
--define .\output\define.xmlAll arguments:
| Argument | Short | Required | Default | Description |
|---|---|---|---|---|
--template |
-t |
Yes | β | Path to the DDS JSON input file |
--define |
-d |
No | (built-in default) | Path for the output Define-XML .xml file |
--validate |
-s |
No | False |
Schema-validate the generated XML after writing |
--log-level |
-l |
No | INFO |
Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL |
Processing is logged to define_generator.log.
# bash / macOS / Linux β from the repository root
# 1. Load: USDM β DDS JSON
cd src/define-xml
python create_define_json.py \
--usdm_file ../../data/protocol/LZZT/usdm/pilot_LLZT_protocol.json \
--output_template ../../output/define.json \
--sdtmct 2024-09-27 \
--validate \
--validation_report ../../output/validation_report.xlsx
# 2. Generate: DDS JSON β Define-XML
cd ../generators/define
python define_generator.py \
--template ../../output/define.json \
--define ../../output/define.xml \
--validate# Windows PowerShell β from the repository root
# 1. Load: USDM β DDS JSON
cd src\define-xml
python create_define_json.py `
--usdm_file ..\..\data\protocol\LZZT\usdm\pilot_LLZT_protocol.json `
--output_template ..\..\output\define.json `
--sdtmct 2024-09-27 `
--validate `
--validation_report ..\..\output\validation_report.xlsx
# 2. Generate: DDS JSON β Define-XML
cd ..\generators\define
python define_generator.py `
--template ..\..\output\define.json `
--define ..\..\output\define.xml `
--validateThe resulting output/define.xml is your SDTM Define-XML v2.1 submission file.
To render it as HTML for human review, apply the bundled XSL stylesheet:
# Using xsltproc (Linux/macOS) or Saxon (Windows)
xsltproc src/generators/define/define2-1.xsl output/define.xml > output/define.htmldata-definition-engine/
β
βββ data/ # Sample study data for development and testing
β βββ protocol/LZZT/usdm/ # CDISC pilot study LZZT in USDM format
β βββ metadata_xlsx/LZZT/ # SDTM and ADaM metadata spreadsheets (LZZT)
β
βββ documents/
β βββ Solution_Overview.md # Architecture design document
β βββ glossary.md # Definitions of key terms
β
βββ HowTos/ # Guides and GIF walkthroughs
β
βββ src/
βββ define-xml/ # LOADER: USDM β DDS JSON
β βββ create_define_json.py # Main loader script
β βββ define.yaml # LinkML schema for the DDS model
β βββ requirements.txt
β
βββ generators/
βββ define/ # GENERATOR: DDS JSON β Define-XML
βββ define_generator.py
βββ define2-1.xsl # XSL stylesheet for HTML rendering
βββ requirements.txt
βββ tests/
βββ fixtures/ # Sample DDS JSON and expected XML/HTML outputs
| Term | Definition |
|---|---|
| USDM (Unified Study Definitions Model) | A TransCelerate / CDISC standard that represents a complete clinical trial protocol as structured, machine-readable JSON. It is the primary input to the DDE. |
| CDISC 360i | A CDISC initiative to make the full clinical trial lifecycle β from protocol to submission β machine-readable and interoperable. |
| DDS (Data Definition Specification) | The central intermediate JSON model in the DDE pipeline. It aggregates metadata from all sources and acts as the single input for all generators. |
| Define-XML | An XML file submitted alongside clinical trial datasets that describes their structure, variables, permitted values, and controlled terminology. Required by the FDA. It is based on the ODM version 2.0 |
| Biomedical Concepts (BCs) | Standardized, reusable definitions of clinical observations (e.g., "Heart Rate") maintained in the CDISC Library. |
| Dataset Specializations (DSSs) | CDISC Library mappings that describe how a Biomedical Concept is represented in a specific SDTM domain. |
| CDISC Library | CDISC's REST API providing access to controlled terminology, SDTM variables, Biomedical Concepts, and Dataset Specializations. |
| odmlib | A Python library for creating and parsing CDISC ODM and Define-XML documents, used internally by the generator. |
| LinkML | A modeling language used to define and validate the DDS JSON schema (define.yaml). |
| VLM (Variable Level Metadata) | Metadata that applies to specific values within a variable (e.g., rules that only apply when VSTEST = "SYSBP"). |
The project is in active development, currently completing Phase 2 of the CDISC 360i Program.
Phase 1 (complete):
- USDM loader (
create_define_json.py) - SDTM Define-XML generator (
define_generator.py) - DDS JSON schema (
define.yaml)
Phase 2 (in progress):
- ADaM Define-XML generator
- ODM CRF generator
- Dataset-JSON shell generator
- Incremental loading with metadata provenance tracking
- Quality and conformance checks
This project is provided "as is" without warranty or guarantee of suitability for any particular purpose. Expect breaking changes as the new ADaM models and additional generators are developed.
Contributions are welcome. Please read CONTRIBUTING.md before submitting pull requests. All contributions must follow the Code of Conduct and will fall under the project licenses below.
Licensed under the MIT License.
Licensed under CC-BY-4.0.
When re-using content, please cite as:
Content based on Data Definition Engine (GitHub) used under the CC-BY-4.0 license.