Working with Galaxy Workflow Models#

This guide describes the typed model layers available for programmatically working with Galaxy workflows. The layers are organized from highest convenience (top) to lowest level (bottom):

  ┌─────────────────────────────────────────────────────┐
  │  ensure_format2() / ensure_native()                 │
  │  "Give me any workflow, I'll give you typed models"  │
  │  Auto-detects format, converts, normalizes, expands │
  │  ↳ from gxformat2.normalized                        │
  ├─────────────────────────────────────────────────────┤
  │  Expanded Models                                    │
  │  ExpandedFormat2 / ExpandedNativeWorkflow            │
  │  All subworkflow refs resolved to inline models     │
  │  ↳ from gxformat2.normalized                        │
  ├─────────────────────────────────────────────────────┤
  │  Normalized Models                                  │
  │  NormalizedFormat2 / NormalizedNativeWorkflow         │
  │  Structural guarantees: lists, ids, parsed state    │
  │  ↳ from gxformat2.normalized                        │
  ├─────────────────────────────────────────────────────┤
  │  Schema Models (lax / strict)                       │
  │  GalaxyWorkflow / NativeGalaxyWorkflow               │
  │  Direct pydantic parse — no normalization           │
  │  ↳ from gxformat2.schema.gxformat2                  │
  │  ↳ from gxformat2.schema.native                     │
  ├─────────────────────────────────────────────────────┤
  │  Raw dicts                                          │
  │  yaml.ordered_load() / json.load()                  │
  │  No validation, no types                            │
  └─────────────────────────────────────────────────────┘

The ensure_ Layer — Recommended Entry Point#

The ensure_format2() and ensure_native() functions are the primary way to get typed workflow models. They accept any workflow representation and return normalized (or expanded) models in the target format, handling format detection and conversion automatically.

ensure_format2#

from gxformat2.normalized import ensure_format2

# From a file path (native .ga or Format2 .gxwf.yml)
nf2 = ensure_format2("workflow.ga")
nf2 = ensure_format2("workflow.gxwf.yml")

# From a raw dict (auto-detects native vs Format2)
nf2 = ensure_format2(workflow_dict)

# From any typed model — passes through or converts
nf2 = ensure_format2(native_galaxy_workflow)
nf2 = ensure_format2(normalized_native_workflow)
nf2 = ensure_format2(galaxy_workflow_schema_model)

# With expansion (resolves @import, URLs, subworkflow refs)
from gxformat2.options import ConversionOptions
expanded = ensure_format2(workflow, options=ConversionOptions(), expand=True)

Returns: NormalizedFormat2 by default, ExpandedFormat2 when expand=True. Overloads provide correct static types based on the expand literal.

ensure_native#

from gxformat2.normalized import ensure_native

# Same flexibility — any input format
nnw = ensure_native("workflow.gxwf.yml")
nnw = ensure_native(native_dict)
nnw = ensure_native(normalized_format2_model)

# With expansion
expanded = ensure_native(workflow, options=ConversionOptions(), expand=True)

Returns: NormalizedNativeWorkflow by default, ExpandedNativeWorkflow when expand=True.

Format Detection#

For raw dicts, the functions detect the format by inspecting keys:

  • a_galaxy_workflow == "true" → native Galaxy format

  • class == "GalaxyWorkflow" → Format2

For file paths (str or Path), the file is loaded and then inspected. Typed model inputs (NormalizedFormat2, NativeGalaxyWorkflow, etc.) are routed by isinstance checks — already-normalized models pass through without re-processing.

Access Patterns for Applications#

Once you have a NormalizedFormat2, the model provides structured typed access to all workflow components:

nf2 = ensure_format2(workflow)

# Workflow metadata
nf2.label          # str | None
nf2.doc            # str | None (joined if originally a list)
nf2.tags           # list[str] | None
nf2.creator        # list[CreatorPerson | CreatorOrganization] | None
nf2.license        # str | None

# Inputs — always a list, shorthands expanded
for inp in nf2.inputs:
    inp.id           # str — always populated
    inp.type_        # GalaxyType | list[GalaxyType] | None
    inp.optional     # bool | None
    inp.default      # Any
    inp.doc          # str | None
    inp.position     # StepPosition | None

# Steps — always a list, ids populated
for step in nf2.steps:
    step.id          # str — always populated
    step.label       # str | None
    step.tool_id     # str | None
    step.type_       # WorkflowStepType | None
    step.doc         # str | None (joined if originally a list)
    step.position    # StepPosition | None
    step.run         # NormalizedFormat2 | str | dict | None
    step.in_         # list[WorkflowStepInput] — always a list
    step.out         # list[WorkflowStepOutput] — always a list
    step.connected_paths  # frozenset[str] — input ids with a source connection

# Workflow-level tool inventory (recurses into inline subworkflows)
nf2.unique_tools  # frozenset[ToolReference] — (tool_id, tool_version) pairs

# Outputs — always a list
for out in nf2.outputs:
    out.id           # str | None
    out.outputSource # str | None
    out.type_        # GalaxyType | list[GalaxyType] | None

Source Reference Resolution#

Format2 workflows reference step outputs using strings like "step_label/output_name". When step labels themselves contain /, parsing is ambiguous. The normalized layer provides helpers:

# Resolve a source reference against this workflow's steps
ref = nf2.resolve_source("filter_step/output1")
ref.step_label   # "filter_step"
ref.output_name  # "output1"

# known_labels is a cached set of all step labels/ids + input ids
nf2.known_labels  # {"filter_step", "input1", "map_step", ...}

# Works with ambiguous labels containing "/"
# e.g. step labeled "Mass Spec Runs (MS/MS)"
ref = nf2.resolve_source("Mass Spec Runs (MS/MS)/spectra")
ref.step_label   # "Mass Spec Runs (MS/MS)"
ref.output_name  # "spectra"

The resolver tries known labels longest-first, then falls back to splitting on the first /. Bare references (no /) default to output_name="output".

The standalone function resolve_source_reference() is also available:

from gxformat2.normalized import resolve_source_reference, SourceReference

ref: SourceReference = resolve_source_reference("step/out", known_labels)

ConversionOptions#

Both ensure_ functions and the underlying to_format2() / to_native() converters accept an optional ConversionOptions:

from gxformat2.options import ConversionOptions

options = ConversionOptions(
    workflow_directory="/path/to/dir",  # For resolving @import paths
    url_resolver=my_resolver,          # Custom fetcher for URL refs
)
expanded = ensure_format2(workflow, options=options, expand=True)

Key options:

  • workflow_directory — base path for @import resolution

  • url_resolver — callable for fetching URL / TRS references

  • encode_tool_state_json — JSON-encode tool_state in native output (default True)

  • state_encode_to_native — Galaxy-provided callback to encode Format2 state back to native tool_state (accepts (step_dict, state_dict), returns clean dict or None for default dict passthrough — no JSON encoding)

  • state_encode_to_format2 — Galaxy-provided callback to decode native tool_state to Format2 state (accepts step dict, returns state dict or None for default)

  • strict_structure — validate input against the strict schema (extra="forbid") before normalization and sanity-check conversion output against the target strict schema. Raises ValidationError on any unrecognised keys. Threaded into nested sub-workflow normalization ($graph dedup entries and inline run: dicts). Default False.

See ConversionOptions for the full parameter list.

Real-World Usage#

The application modules in this project all use the ensure_ layer:

# abstract.py — CWL export
nf2 = ensure_format2(workflow_dict)
for step in nf2.steps:
    if isinstance(step.run, NormalizedFormat2):
        # Recurse into subworkflow
        ...

# cytoscape.py — visualization
nf2 = ensure_format2(workflow_path)
for step in nf2.steps:
    for step_input in step.in_:
        ref = nf2.resolve_source(step_input.source)
        # Build edge from ref.step_label → current step

# lint.py — validation
nf2 = ensure_format2(workflow_dict, expand=True)  # ExpandedFormat2
nnw = ensure_native(workflow_dict)                 # For native-specific checks

Normalized and Expanded Models#

For cases where you know the input format and don’t need auto-detection, the gxformat2.normalized package provides direct constructors. For cross-format conversion, to_format2() and to_native() convert between formats and return normalized (or expanded) models.

Cross-Format Conversion#

to_format2() converts native workflows to Format2, and to_native() converts Format2 workflows to native. Both accept the source format as a raw dict, file path, or typed model, and return normalized models in the target format:

from gxformat2.normalized import to_format2
from gxformat2.normalized import to_native

# Native → Format2
nf2 = to_format2(native_dict)
nf2 = to_format2(normalized_native_workflow)

# Format2 → Native
nnw = to_native(format2_dict)
nnw = to_native(normalized_format2_model)

# With expansion (resolves subworkflow refs in the result)
ef2 = to_format2(native_dict, options=options, expand=True)
enw = to_native(format2_dict, options=options, expand=True)

Unlike the ensure_ functions, these only accept the other format as input — to_format2() expects native input, to_native() expects Format2 input. Use ensure_ when you don’t know (or don’t care about) the input format.

Normalized Models#

NormalizedFormat2 guarantees:

  • steps, inputs, outputs are always list (never dict)

  • Step and input id fields are always populated

  • Input type shorthands expanded ("File""data", etc.)

  • doc fields joined from list to string

  • $graph multi-workflow documents resolved

  • $link entries in step state resolved — replaced with {"__class__": "ConnectedValue"} in state, connection source added to step.in_. After normalization, step.in_ contains ALL connections and step.state is clean

  • Step type_ always populated — inferred from context when not explicit in the YAML (run present → subworkflow, otherwise → tool). Convenience properties is_tool_step, is_subworkflow_step, is_pause_step, is_pick_value_step, and connected_paths (frozenset of input ids with a source) are also available

  • unique_tools property returns frozenset[ToolReference] of all (tool_id, tool_version) pairs, recursing into inline subworkflows (unresolved @import/URL refs are skipped; use expanded models for full coverage)

NormalizedNativeWorkflow guarantees:

  • tool_state always a parsed dict (JSON strings auto-decoded)

  • Optional containers default to empty (never None)

  • input_connections values always list[NativeInputConnection] (single connections wrapped during construction — no isinstance checks needed)

  • connected_paths property returns frozenset[str] of input_connections keys — O(1) membership test for checking whether a state path has an incoming connection

  • unique_tools property returns frozenset[ToolReference] of all (tool_id, tool_version) pairs, recursing into inline subworkflows

  • Tags normalized (empty string → empty list)

  • Subworkflows recursively normalized

All normalized models provide a to_dict() method that serializes to a JSON/YAML-compatible dict with aliases resolved and None values stripped:

nf2 = ensure_format2(workflow)
workflow_dict = nf2.to_dict()  # ready for json.dump() or yaml.dump()

nnw = ensure_native(workflow)
native_dict = nnw.to_dict()   # includes "a_galaxy_workflow", "format-version", etc.
from gxformat2.normalized import normalized_format2, normalized_native

# From same-format input
nf2 = normalized_format2(format2_dict)
nf2 = normalized_format2("/path/to/workflow.gxwf.yml")
nf2 = normalized_format2(galaxy_workflow_model)

nnw = normalized_native(native_dict)
nnw = normalized_native("/path/to/workflow.ga")
nnw = normalized_native(native_galaxy_workflow_model)

# For native dicts, use ensure_format2 or to_format2 instead:
from gxformat2.normalized import ensure_format2
nf2 = ensure_format2(native_dict)

Normalized models use extra="ignore": unknown keys in the input dict are silently dropped and never appear in to_dict() output. To reject them instead, pass strict_structure=True — the raw dict is validated against the strict schema (extra="forbid") before normalization and a pydantic.ValidationError is raised on any unrecognised keys:

from pydantic import ValidationError

try:
    nf2 = normalized_format2(format2_dict, strict_structure=True)
    nnw = normalized_native(native_dict, strict_structure=True)
except ValidationError as e:
    ...  # extra/unknown keys present

GalaxyUserToolStub and ImportReference (opaque passthroughs) keep extra="allow" regardless of this flag. The flag is also available on ConversionOptions for the to_* / ensure_* / expanded_* functions.

Expanded Models#

ExpandedFormat2 and ExpandedNativeWorkflow inherit from their normalized counterparts and additionally guarantee all external references are resolved to inline model instances:

  • ExpandedFormat2: every step.run is ExpandedFormat2 | None (no URL strings, no @import dicts)

  • ExpandedNativeWorkflow: every step.subworkflow is ExpandedNativeWorkflow | None (no content_id URL refs)

from gxformat2.normalized import expanded_format2, expanded_native
from gxformat2.options import ConversionOptions

options = ConversionOptions(workflow_directory="/path/to/dir")

ef2 = expanded_format2(format2_dict, options)
enw = expanded_native(native_dict, options)

# Also accepts already-normalized models
ef2 = expanded_format2(nf2, options)

Expansion fetches @import paths, HTTP/TRS URLs, and $graph references. Circular references raise ValueError. Max depth is 10.

Use expanded models when you need to traverse into subworkflows (linting, full-tree analysis). Use normalized models when you only need the top-level structure (visualization, simple metadata extraction).

Schema Models — Direct Pydantic Validation#

The schema models are auto-generated from schema-salad YAML definitions using schema-salad-plus-pydantic. They provide direct pydantic parsing with no normalization — fields appear exactly as in the source document.

Lax Models (extra=”allow”)#

Accept unknown fields. Use for real-world workflows that may contain Galaxy-version-specific or undocumented fields:

from gxformat2.schema.gxformat2 import GalaxyWorkflow
from gxformat2.schema.native import NativeGalaxyWorkflow

gw = GalaxyWorkflow.model_validate(format2_dict)
ngw = NativeGalaxyWorkflow.model_validate(native_dict)

Strict Models (extra=”forbid”)#

Reject unknown fields. Use for schema compliance checking:

from gxformat2.schema.gxformat2_strict import GalaxyWorkflow as StrictFormat2
from gxformat2.schema.native_strict import NativeGalaxyWorkflow as StrictNative

StrictFormat2.model_validate(format2_dict)   # raises on extra fields
StrictNative.model_validate(native_dict)     # raises on extra fields

The linter uses both: strict failure with lax success means only extra fields are present (warning), while lax failure indicates fundamental schema errors (error).

Key Types from the Schema#

These types are used throughout the model hierarchy:

from gxformat2.schema.gxformat2 import (
    GalaxyType,              # Enum: data, collection, string, int, float, ...
    WorkflowStepType,        # Enum: tool, subworkflow, pause, pick_value
    WorkflowInputParameter,  # Input definition
    WorkflowOutputParameter, # Output definition
    WorkflowStepInput,       # Step input (source, default)
    WorkflowStepOutput,      # Step output (id)
    CreatorPerson,           # Creator with name, identifier, ...
    CreatorOrganization,     # Organization creator
    Report,                  # Invocation report template
)

from gxformat2.schema.native import (
    NativeStepType,          # Enum: data_input, data_collection_input, ...
    NativeInputConnection,   # Step input connection (id, output_name)
    NativePostJobAction,     # Post-job action definition
)

Raw Dict Access#

If you need to work with workflow dicts without typed models (e.g., for forward-compatibility with fields not yet in the schema), load them directly:

from gxformat2.yaml import ordered_load, ordered_load_path

# From file
workflow_dict = ordered_load_path("workflow.ga")

# From stream
with open("workflow.gxwf.yml") as f:
    workflow_dict = ordered_load(f)

# Format detection
is_native = workflow_dict.get("a_galaxy_workflow") == "true"
is_format2 = workflow_dict.get("class") == "GalaxyWorkflow"

The schema models document what fields to expect — refer to gxformat2.schema.gxformat2 and gxformat2.schema.native for field names, types, and aliases.

Slicing into Workflow Components#

Sometimes you don’t need the full workflow model — you just want the inputs, the outputs, or the steps. The gxformat2.normalize module provides focused accessors that extract individual pieces from any workflow representation, handling format detection, conversion, and normalization behind the scenes.

from gxformat2.normalize import inputs, outputs, steps

# From a file path, a raw dict, or any typed model
workflow_inputs = inputs("workflow.ga")
workflow_outputs = outputs(format2_dict)
all_steps = steps(normalized_native_workflow)

inputs() returns a list of WorkflowInputParameter models, outputs() returns WorkflowOutputParameter models, and steps() returns input parameters followed by NormalizedWorkflowStep models — the same objects you’d get from the full NormalizedFormat2 model, just without needing to build one yourself.

All three accept the same arguments:

from gxformat2.options import ConversionOptions

# With conversion options and expansion
opts = ConversionOptions(workflow_directory="/path/to/dir")
expanded_inputs = inputs(workflow_dict=wf, options=opts, expand=True)

Dict-returning variants (deprecated)#

The older inputs_normalized(), outputs_normalized(), and steps_normalized() functions return the same data as plain dicts instead of typed models. These are retained for backward compatibility with tools like Planemo:

from gxformat2.normalize import inputs_normalized

# Returns list[dict] — each dict has "id", "type", "default", etc.
input_dicts = inputs_normalized(workflow_path="workflow.gxwf.yml")

New code should prefer the typed accessors above.