# Working with Galaxy Workflow Models This guide describes the typed model layers available for programmatically working with Galaxy workflows. The layers are organized from highest convenience (top) to lowest level (bottom): ``` ┌─────────────────────────────────────────────────────┐ │ ensure_format2() / ensure_native() │ │ "Give me any workflow, I'll give you typed models" │ │ Auto-detects format, converts, normalizes, expands │ │ ↳ from gxformat2.normalized │ ├─────────────────────────────────────────────────────┤ │ Expanded Models │ │ ExpandedFormat2 / ExpandedNativeWorkflow │ │ All subworkflow refs resolved to inline models │ │ ↳ from gxformat2.normalized │ ├─────────────────────────────────────────────────────┤ │ Normalized Models │ │ NormalizedFormat2 / NormalizedNativeWorkflow │ │ Structural guarantees: lists, ids, parsed state │ │ ↳ from gxformat2.normalized │ ├─────────────────────────────────────────────────────┤ │ Schema Models (lax / strict) │ │ GalaxyWorkflow / NativeGalaxyWorkflow │ │ Direct pydantic parse — no normalization │ │ ↳ from gxformat2.schema.gxformat2 │ │ ↳ from gxformat2.schema.native │ ├─────────────────────────────────────────────────────┤ │ Raw dicts │ │ yaml.ordered_load() / json.load() │ │ No validation, no types │ └─────────────────────────────────────────────────────┘ ``` ## The `ensure_` Layer — Recommended Entry Point The {py:func}`~gxformat2.normalized.ensure_format2` and {py:func}`~gxformat2.normalized.ensure_native` functions are the primary way to get typed workflow models. They accept **any** workflow representation and return normalized (or expanded) models in the target format, handling format detection and conversion automatically. ### ensure_format2 ```python from gxformat2.normalized import ensure_format2 # From a file path (native .ga or Format2 .gxwf.yml) nf2 = ensure_format2("workflow.ga") nf2 = ensure_format2("workflow.gxwf.yml") # From a raw dict (auto-detects native vs Format2) nf2 = ensure_format2(workflow_dict) # From any typed model — passes through or converts nf2 = ensure_format2(native_galaxy_workflow) nf2 = ensure_format2(normalized_native_workflow) nf2 = ensure_format2(galaxy_workflow_schema_model) # With expansion (resolves @import, URLs, subworkflow refs) from gxformat2.options import ConversionOptions expanded = ensure_format2(workflow, options=ConversionOptions(), expand=True) ``` **Returns:** {py:class}`~gxformat2.normalized.NormalizedFormat2` by default, {py:class}`~gxformat2.normalized.ExpandedFormat2` when `expand=True`. Overloads provide correct static types based on the `expand` literal. ### ensure_native ```python from gxformat2.normalized import ensure_native # Same flexibility — any input format nnw = ensure_native("workflow.gxwf.yml") nnw = ensure_native(native_dict) nnw = ensure_native(normalized_format2_model) # With expansion expanded = ensure_native(workflow, options=ConversionOptions(), expand=True) ``` **Returns:** {py:class}`~gxformat2.normalized.NormalizedNativeWorkflow` by default, {py:class}`~gxformat2.normalized.ExpandedNativeWorkflow` when `expand=True`. ### Format Detection For raw dicts, the functions detect the format by inspecting keys: - `a_galaxy_workflow == "true"` → native Galaxy format - `class == "GalaxyWorkflow"` → Format2 For file paths (`str` or `Path`), the file is loaded and then inspected. Typed model inputs ({py:class}`~gxformat2.normalized.NormalizedFormat2`, {py:class}`~gxformat2.schema.native.NativeGalaxyWorkflow`, etc.) are routed by `isinstance` checks — already-normalized models pass through without re-processing. ### Access Patterns for Applications Once you have a {py:class}`~gxformat2.normalized.NormalizedFormat2`, the model provides structured typed access to all workflow components: ```python nf2 = ensure_format2(workflow) # Workflow metadata nf2.label # str | None nf2.doc # str | None (joined if originally a list) nf2.tags # list[str] | None nf2.creator # list[CreatorPerson | CreatorOrganization] | None nf2.license # str | None # Inputs — always a list, shorthands expanded for inp in nf2.inputs: inp.id # str — always populated inp.type_ # GalaxyType | list[GalaxyType] | None inp.optional # bool | None inp.default # Any inp.doc # str | None inp.position # StepPosition | None # Steps — always a list, ids populated for step in nf2.steps: step.id # str — always populated step.label # str | None step.tool_id # str | None step.type_ # WorkflowStepType | None step.doc # str | None (joined if originally a list) step.position # StepPosition | None step.run # NormalizedFormat2 | str | dict | None step.in_ # list[WorkflowStepInput] — always a list step.out # list[WorkflowStepOutput] — always a list step.connected_paths # frozenset[str] — input ids with a source connection # Workflow-level tool inventory (recurses into inline subworkflows) nf2.unique_tools # frozenset[ToolReference] — (tool_id, tool_version) pairs # Outputs — always a list for out in nf2.outputs: out.id # str | None out.outputSource # str | None out.type_ # GalaxyType | list[GalaxyType] | None ``` ### Source Reference Resolution Format2 workflows reference step outputs using strings like `"step_label/output_name"`. When step labels themselves contain `/`, parsing is ambiguous. The normalized layer provides helpers: ```python # Resolve a source reference against this workflow's steps ref = nf2.resolve_source("filter_step/output1") ref.step_label # "filter_step" ref.output_name # "output1" # known_labels is a cached set of all step labels/ids + input ids nf2.known_labels # {"filter_step", "input1", "map_step", ...} # Works with ambiguous labels containing "/" # e.g. step labeled "Mass Spec Runs (MS/MS)" ref = nf2.resolve_source("Mass Spec Runs (MS/MS)/spectra") ref.step_label # "Mass Spec Runs (MS/MS)" ref.output_name # "spectra" ``` The resolver tries known labels longest-first, then falls back to splitting on the first `/`. Bare references (no `/`) default to `output_name="output"`. The standalone function {py:func}`~gxformat2.normalized.resolve_source_reference` is also available: ```python from gxformat2.normalized import resolve_source_reference, SourceReference ref: SourceReference = resolve_source_reference("step/out", known_labels) ``` ### ConversionOptions Both `ensure_` functions and the underlying {py:func}`~gxformat2.normalized.to_format2` / {py:func}`~gxformat2.normalized.to_native` converters accept an optional {py:class}`~gxformat2.options.ConversionOptions`: ```python from gxformat2.options import ConversionOptions options = ConversionOptions( workflow_directory="/path/to/dir", # For resolving @import paths url_resolver=my_resolver, # Custom fetcher for URL refs ) expanded = ensure_format2(workflow, options=options, expand=True) ``` Key options: - `workflow_directory` — base path for `@import` resolution - `url_resolver` — callable for fetching URL / TRS references - `encode_tool_state_json` — JSON-encode tool_state in native output (default True) - `state_encode_to_native` — Galaxy-provided callback to encode Format2 state back to native `tool_state` (accepts `(step_dict, state_dict)`, returns clean dict or `None` for default dict passthrough — no JSON encoding) - `state_encode_to_format2` — Galaxy-provided callback to decode native `tool_state` to Format2 `state` (accepts step dict, returns state dict or `None` for default) - `strict_structure` — validate input against the strict schema (`extra="forbid"`) before normalization and sanity-check conversion output against the target strict schema. Raises `ValidationError` on any unrecognised keys. Threaded into nested sub-workflow normalization (`$graph` dedup entries and inline `run:` dicts). Default False. See {py:class}`~gxformat2.options.ConversionOptions` for the full parameter list. ### Real-World Usage The application modules in this project all use the `ensure_` layer: ```python # abstract.py — CWL export nf2 = ensure_format2(workflow_dict) for step in nf2.steps: if isinstance(step.run, NormalizedFormat2): # Recurse into subworkflow ... # cytoscape.py — visualization nf2 = ensure_format2(workflow_path) for step in nf2.steps: for step_input in step.in_: ref = nf2.resolve_source(step_input.source) # Build edge from ref.step_label → current step # lint.py — validation nf2 = ensure_format2(workflow_dict, expand=True) # ExpandedFormat2 nnw = ensure_native(workflow_dict) # For native-specific checks ``` ## Normalized and Expanded Models For cases where you know the input format and don't need auto-detection, the {py:mod}`gxformat2.normalized` package provides direct constructors. For cross-format conversion, {py:func}`~gxformat2.normalized.to_format2` and {py:func}`~gxformat2.normalized.to_native` convert between formats and return normalized (or expanded) models. ### Cross-Format Conversion {py:func}`~gxformat2.normalized.to_format2` converts native workflows to Format2, and {py:func}`~gxformat2.normalized.to_native` converts Format2 workflows to native. Both accept the source format as a raw dict, file path, or typed model, and return normalized models in the target format: ```python from gxformat2.normalized import to_format2 from gxformat2.normalized import to_native # Native → Format2 nf2 = to_format2(native_dict) nf2 = to_format2(normalized_native_workflow) # Format2 → Native nnw = to_native(format2_dict) nnw = to_native(normalized_format2_model) # With expansion (resolves subworkflow refs in the result) ef2 = to_format2(native_dict, options=options, expand=True) enw = to_native(format2_dict, options=options, expand=True) ``` Unlike the `ensure_` functions, these only accept the *other* format as input — {py:func}`~gxformat2.normalized.to_format2` expects native input, {py:func}`~gxformat2.normalized.to_native` expects Format2 input. Use `ensure_` when you don't know (or don't care about) the input format. ### Normalized Models {py:class}`~gxformat2.normalized.NormalizedFormat2` guarantees: - `steps`, `inputs`, `outputs` are always `list` (never dict) - Step and input `id` fields are always populated - Input type shorthands expanded (`"File"` → `"data"`, etc.) - `doc` fields joined from list to string - `$graph` multi-workflow documents resolved - `$link` entries in step `state` resolved — replaced with `{"__class__": "ConnectedValue"}` in state, connection source added to `step.in_`. After normalization, `step.in_` contains ALL connections and `step.state` is clean - Step `type_` always populated — inferred from context when not explicit in the YAML (`run` present → `subworkflow`, otherwise → `tool`). Convenience properties `is_tool_step`, `is_subworkflow_step`, `is_pause_step`, `is_pick_value_step`, and `connected_paths` (frozenset of input ids with a source) are also available - `unique_tools` property returns `frozenset[ToolReference]` of all `(tool_id, tool_version)` pairs, recursing into inline subworkflows (unresolved `@import`/URL refs are skipped; use expanded models for full coverage) {py:class}`~gxformat2.normalized.NormalizedNativeWorkflow` guarantees: - `tool_state` always a parsed `dict` (JSON strings auto-decoded) - Optional containers default to empty (never `None`) - `input_connections` values always `list[NativeInputConnection]` (single connections wrapped during construction — no `isinstance` checks needed) - `connected_paths` property returns `frozenset[str]` of `input_connections` keys — O(1) membership test for checking whether a state path has an incoming connection - `unique_tools` property returns `frozenset[ToolReference]` of all `(tool_id, tool_version)` pairs, recursing into inline subworkflows - Tags normalized (empty string → empty list) - Subworkflows recursively normalized All normalized models provide a `to_dict()` method that serializes to a JSON/YAML-compatible dict with aliases resolved and `None` values stripped: ```python nf2 = ensure_format2(workflow) workflow_dict = nf2.to_dict() # ready for json.dump() or yaml.dump() nnw = ensure_native(workflow) native_dict = nnw.to_dict() # includes "a_galaxy_workflow", "format-version", etc. ``` ```python from gxformat2.normalized import normalized_format2, normalized_native # From same-format input nf2 = normalized_format2(format2_dict) nf2 = normalized_format2("/path/to/workflow.gxwf.yml") nf2 = normalized_format2(galaxy_workflow_model) nnw = normalized_native(native_dict) nnw = normalized_native("/path/to/workflow.ga") nnw = normalized_native(native_galaxy_workflow_model) # For native dicts, use ensure_format2 or to_format2 instead: from gxformat2.normalized import ensure_format2 nf2 = ensure_format2(native_dict) ``` Normalized models use `extra="ignore"`: unknown keys in the input dict are silently dropped and never appear in `to_dict()` output. To reject them instead, pass `strict_structure=True` — the raw dict is validated against the strict schema (`extra="forbid"`) before normalization and a `pydantic.ValidationError` is raised on any unrecognised keys: ```python from pydantic import ValidationError try: nf2 = normalized_format2(format2_dict, strict_structure=True) nnw = normalized_native(native_dict, strict_structure=True) except ValidationError as e: ... # extra/unknown keys present ``` `GalaxyUserToolStub` and `ImportReference` (opaque passthroughs) keep `extra="allow"` regardless of this flag. The flag is also available on `ConversionOptions` for the `to_*` / `ensure_*` / `expanded_*` functions. ### Expanded Models {py:class}`~gxformat2.normalized.ExpandedFormat2` and {py:class}`~gxformat2.normalized.ExpandedNativeWorkflow` inherit from their normalized counterparts and additionally guarantee all external references are resolved to inline model instances: - {py:class}`~gxformat2.normalized.ExpandedFormat2`: every `step.run` is `ExpandedFormat2 | None` (no URL strings, no `@import` dicts) - {py:class}`~gxformat2.normalized.ExpandedNativeWorkflow`: every `step.subworkflow` is `ExpandedNativeWorkflow | None` (no `content_id` URL refs) ```python from gxformat2.normalized import expanded_format2, expanded_native from gxformat2.options import ConversionOptions options = ConversionOptions(workflow_directory="/path/to/dir") ef2 = expanded_format2(format2_dict, options) enw = expanded_native(native_dict, options) # Also accepts already-normalized models ef2 = expanded_format2(nf2, options) ``` Expansion fetches `@import` paths, HTTP/TRS URLs, and `$graph` references. Circular references raise `ValueError`. Max depth is 10. Use expanded models when you need to traverse into subworkflows (linting, full-tree analysis). Use normalized models when you only need the top-level structure (visualization, simple metadata extraction). ## Schema Models — Direct Pydantic Validation The schema models are auto-generated from schema-salad YAML definitions using `schema-salad-plus-pydantic`. They provide direct pydantic parsing with no normalization — fields appear exactly as in the source document. ### Lax Models (extra="allow") Accept unknown fields. Use for real-world workflows that may contain Galaxy-version-specific or undocumented fields: ```python from gxformat2.schema.gxformat2 import GalaxyWorkflow from gxformat2.schema.native import NativeGalaxyWorkflow gw = GalaxyWorkflow.model_validate(format2_dict) ngw = NativeGalaxyWorkflow.model_validate(native_dict) ``` ### Strict Models (extra="forbid") Reject unknown fields. Use for schema compliance checking: ```python from gxformat2.schema.gxformat2_strict import GalaxyWorkflow as StrictFormat2 from gxformat2.schema.native_strict import NativeGalaxyWorkflow as StrictNative StrictFormat2.model_validate(format2_dict) # raises on extra fields StrictNative.model_validate(native_dict) # raises on extra fields ``` The linter uses both: strict failure with lax success means only extra fields are present (warning), while lax failure indicates fundamental schema errors (error). ### Key Types from the Schema These types are used throughout the model hierarchy: ```python from gxformat2.schema.gxformat2 import ( GalaxyType, # Enum: data, collection, string, int, float, ... WorkflowStepType, # Enum: tool, subworkflow, pause, pick_value WorkflowInputParameter, # Input definition WorkflowOutputParameter, # Output definition WorkflowStepInput, # Step input (source, default) WorkflowStepOutput, # Step output (id) CreatorPerson, # Creator with name, identifier, ... CreatorOrganization, # Organization creator Report, # Invocation report template ) from gxformat2.schema.native import ( NativeStepType, # Enum: data_input, data_collection_input, ... NativeInputConnection, # Step input connection (id, output_name) NativePostJobAction, # Post-job action definition ) ``` ## Raw Dict Access If you need to work with workflow dicts without typed models (e.g., for forward-compatibility with fields not yet in the schema), load them directly: ```python from gxformat2.yaml import ordered_load, ordered_load_path # From file workflow_dict = ordered_load_path("workflow.ga") # From stream with open("workflow.gxwf.yml") as f: workflow_dict = ordered_load(f) # Format detection is_native = workflow_dict.get("a_galaxy_workflow") == "true" is_format2 = workflow_dict.get("class") == "GalaxyWorkflow" ``` The schema models document what fields to expect — refer to {py:mod}`gxformat2.schema.gxformat2` and {py:mod}`gxformat2.schema.native` for field names, types, and aliases. ## Slicing into Workflow Components Sometimes you don't need the full workflow model — you just want the inputs, the outputs, or the steps. The {py:mod}`gxformat2.normalize` module provides focused accessors that extract individual pieces from any workflow representation, handling format detection, conversion, and normalization behind the scenes. ```python from gxformat2.normalize import inputs, outputs, steps # From a file path, a raw dict, or any typed model workflow_inputs = inputs("workflow.ga") workflow_outputs = outputs(format2_dict) all_steps = steps(normalized_native_workflow) ``` {py:func}`~gxformat2.normalize.inputs` returns a list of {py:class}`~gxformat2.schema.gxformat2.WorkflowInputParameter` models, {py:func}`~gxformat2.normalize.outputs` returns {py:class}`~gxformat2.schema.gxformat2.WorkflowOutputParameter` models, and {py:func}`~gxformat2.normalize.steps` returns input parameters followed by {py:class}`~gxformat2.normalized.NormalizedWorkflowStep` models — the same objects you'd get from the full {py:class}`~gxformat2.normalized.NormalizedFormat2` model, just without needing to build one yourself. All three accept the same arguments: ```python from gxformat2.options import ConversionOptions # With conversion options and expansion opts = ConversionOptions(workflow_directory="/path/to/dir") expanded_inputs = inputs(workflow_dict=wf, options=opts, expand=True) ``` ### Dict-returning variants (deprecated) The older {py:func}`~gxformat2.normalize.inputs_normalized`, {py:func}`~gxformat2.normalize.outputs_normalized`, and {py:func}`~gxformat2.normalize.steps_normalized` functions return the same data as plain dicts instead of typed models. These are retained for backward compatibility with tools like Planemo: ```python from gxformat2.normalize import inputs_normalized # Returns list[dict] — each dict has "id", "type", "default", etc. input_dicts = inputs_normalized(workflow_path="workflow.gxwf.yml") ``` New code should prefer the typed accessors above.