How I Spent Two Days Fighting a Workflow Engine — and What I Actually Learned

A debugging post-mortem on OpenClaw Lobster, llm-task, and the real cost of trusting AI-generated diagnoses.


I'm building a curriculum generation system for a PBL-based Python course. The idea is straightforward: a multi-phase workflow that calls an LLM at each step to produce requirements, constraints, research notes, lesson materials — all written to disk as Markdown files, ready for review.

The orchestration layer is Lobster, a workflow shell built into OpenClaw. The LLM step is handled by llm-task, an optional plugin that runs a JSON-only, stateless model call and returns structured output validated against a JSON Schema.

On paper, this is exactly what these tools were designed for. In practice, it took 12 test scenarios, several misdiagnoses (some by me, some by the AI agent helping me), and a lot of lobster run ... 2>&1 before I understood what was actually happening.

This is the full story.


The Setup

My workflow looked roughly like this:

build_payload.py → Lobster → openclaw.invoke → llm-task → save to disk

The Python script constructs a payload for each phase (prompt, model, schema, temperature), passes it to Lobster, which calls openclaw.invoke --tool llm-task with the payload as --args-json, and the output gets extracted and written to a file.

Simple. Clean. Deterministic.

The first test — a hardcoded "say hi" prompt with a static --args-json — worked immediately:

steps:
  - id: test_invoke
    pipeline: openclaw.invoke --tool llm-task --action json --args-json '{"prompt":"say hi","model":"gemini-2.5-flash"}'

Exit code 0. Output:

[{
  "content": [{"type": "text", "text": "{\"status\": \"ok\"}"}],
  "details": { "json": {"status": "ok"}, "provider": "google", "model": "gemini-2.5-flash" }
}]

Great. Now I just need to make the payload dynamic.


The Core Problem (That Took Two Days to Name Correctly)

Here is the thing about Lobster: it does not interpolate variables inside a pipeline: string's argument values.

When you write:

- id: invoke
  pipeline: openclaw.invoke --tool llm-task --action json --args-json '$prepare.stdout'

Lobster does not resolve $prepare.stdout into the actual content of that step's stdout. It passes the literal string $prepare.stdout to openclaw.invoke. The tool receives "$prepare.stdout" as the JSON payload — which is not valid JSON — and returns a 500.

This seems obvious in hindsight. But when you're staring at a tool execution failed error with no further detail, and the static version works perfectly, you start looking in every other direction first.


Misdiagnosis #1: The Allowlist

The first theory — partly because an early error log mentioned it — was that llm-task was blocked by a platform allowlist on macOS ("darwin"):

llm-task.json is not in the allowlist for platform "darwin"

So I checked the config. Everything was already correct:

{
  "plugins": {
    "entries": {
      "llm-task": { "enabled": true }
    }
  },
  "tools": {
    "alsoAllow": ["llm-task"]
  }
}

The log entry was stale — from before the config was set up properly. The allowlist was never the problem. But it ate an hour.

Lesson: Don't treat old log lines as current diagnostics. Reproduce the error first, then read the logs.


Misdiagnosis #2: Payload Size

Next theory: the Gateway was crashing because the Vietnamese prompts were too long.

Evidence cited: short prompts like "say hi" passed, while longer curriculum prompts failed.

This led to the suggestion of trimming prompts, splitting tasks, and eventually building direct_gemini.py — a script that called the Gemini API directly, bypassing Lobster and llm-task entirely, as a "bridge" workaround.

# direct_gemini.py
url = f"https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key={api_key}"

The problem with this approach: it exposed the API key in process arguments (ps aux readable), dropped JSON Schema validation, and disconnected from OpenClaw's auth profile system. It worked, but it was the wrong fix for the wrong problem.

The real reason short prompts passed and long ones failed had nothing to do with payload size. The hardcoded short test used a literal --args-json '{...}'. The longer tests were trying to interpolate a variable into that string. The variable contained newlines. Newlines broke JSON parsing. It was a quoting and interpolation problem all along.

Lesson: When a workaround "works," resist the urge to ship it. Ask whether it's fixing the actual failure mode or just routing around it.


Misdiagnosis #3: The --each Pattern

After reading the docs more carefully, I found a pattern that looked promising:

gog.gmail.search --query '...' | openclaw.invoke --tool message --action send --each --item-key message

--each reads a JSON array from stdin and calls the tool once per item, merging each item's fields into the tool's args. So if my Python script output [{"task": {...payload...}}], then --each --item-key task would extract the payload and pass it as args — no --args-json needed, no interpolation issue.

I rewrote build_payload.py to output [{"task": payload}], updated the workflow, ran it. Still 500.

Two rounds of debugging later: --each with llm-task is not stable in Lobster. The mechanism works for some tools, but for llm-task in this context it was failing at the argument extraction layer. This was confirmed in the final test matrix.

Lesson: Docs examples are illustrative, not guaranteed. Test patterns in isolation before building on them.


The Test Matrix

At this point, instead of theorizing, I built a systematic test suite. Twelve .lobster files, each testing one specific mechanism:

Scenario Result Finding
verify_json_pipe.lobster ✅ Pass $step.json is real and works
verify_echo.lobster ✅ Pass Lobster preserves \n literally when resolving vars — breaks JSON
verify_vars / verify_final / verify_global ❌ Fail Lobster does NOT interpolate vars inside pipeline: ... strings
verify_each / verify_stdin / verify_native ❌ Fail openclaw.invoke --each is unreliable for llm-task in Lobster
verify_shell.lobster ❌ Fail Classic shell quote-nesting when passing JSON via variable

Two findings from this matrix changed everything:

Finding 1 — $step.json is real. After a step runs openclaw.invoke --action json, Lobster parses the output and makes the result available as $step_id.json — a pre-parsed object, not a raw string. This means downstream steps can receive a clean JSON object via stdin, with no manual parsing.

Finding 2 — Lobster preserves \n literally. When a variable like $prepare.stdout is interpolated into any context, newline characters are kept as-is. If the stdout content gets embedded in a JSON string, those literal newlines make the JSON invalid. This is why piping Python output directly into a Lobster pipeline string always broke.


The Architecture That Actually Works

Given these constraints, the correct architecture is:

Python (build payload + call llm-task via HTTP or subprocess)
  ↓ stdout = JSON string
Lobster command step (just runs Python, no JSON touch)
  ↓ $step.json (Lobster parses stdout)
Python file_manager.py (receives parsed object, extracts field, saves)

The key insight: Python handles all JSON construction and tool invocation. Lobster handles only sequencing and file piping.

The workflow becomes clean and dumb:

steps:
  - id: ph1_requirements
    command: python3 "${skill_dir}/tools/llm_invoker.py" --phase ph1_requirements

  - id: save_requirements
    after: ph1_requirements
    stdin: $ph1_requirements.json
    command: python3 "${skill_dir}/tools/file_manager.py" save
      --path "${root}/00_meta/requirements.md"
      --extract-field content

  - id: ph1_constraints
    after: save_requirements
    command: python3 "${skill_dir}/tools/llm_invoker.py" --phase ph1_constraints

  # ...and so on

llm_invoker.py calls the OpenClaw Gateway directly via HTTP, constructs the full payload in Python (no escaping issues, no encoding issues, Unicode-safe by default), and prints the response JSON to stdout. Lobster parses it into $step.json. file_manager.py receives a clean Python dict.

No transient .lobster files. No --args-json with dynamic content. No --each. No shell quoting gymnastics.


What the AI Agent Got Wrong

I was using an AI agent throughout this process to help debug. It was useful for generating test variants quickly, but it misdiagnosed the root cause at least three times:

  1. Blamed the allowlist when the config was already correct.
  2. Blamed payload size when the issue was interpolation.
  3. Endorsed direct_gemini.py as a valid long-term solution, including documenting it in a "Best Practices" file — despite it having a security flaw (API key exposed in process args) and losing schema validation.

The agent also produced a "Best Practices" document that contained a bug in the headline example:

# From the agent's "verified" best practices doc:
pipeline: |
  openclaw.invoke --tool llm-task --action run --args-json '$prepare.stdout'

'$prepare.stdout' in single quotes is a literal string. This is exactly the pattern that fails. The document marked it as verified.

The agent was confident, articulate, and wrong. Each misdiagnosis came with a coherent explanation that sounded plausible — and that's what makes this kind of failure expensive. A vague error is easy to distrust. A confident, detailed explanation is much harder to push back on, especially when you're deep in a debugging session.

The only thing that resolved the ambiguity was running actual tests and building an explicit result matrix.


Rules for Working with Lobster + llm-task

Based on everything above, here's what's actually verified:

Do:

  • Use $step.json to receive llm-task output in downstream steps — it's real, it's clean, it works.
  • Handle all JSON construction in Python. No exceptions.
  • Call the OpenClaw Gateway directly from Python (HTTP POST) rather than trying to pass dynamic payloads through Lobster's pipeline layer.
  • Use command: steps for Python scripts. Keep pipeline: steps for static, hardcoded openclaw.invoke calls only.
  • Build a test matrix when debugging. One scenario per file, one variable changed at a time.

Don't:

  • Try to interpolate $step.stdout into a pipeline: argument string. It will not work.
  • Use --each with llm-task in Lobster. Unstable.
  • Trust "verified" documentation produced by an AI agent unless you ran the test yourself.
  • Accept a workaround that bypasses the system's auth and validation layers, even if it unblocks you in the moment.

Closing Thought

The irony of this whole experience is that the final architecture is simpler than the original design. Python does what Python is good at. Lobster does what Lobster is good at. The two don't try to hand off dynamic JSON to each other through shell string interpolation.

The path to that simplicity, though, ran through twelve failing test files, three misdiagnoses, one unnecessary security hole, and a "Best Practices" document with a bug in the first example.

If you're building deterministic AI pipelines on top of a workflow engine you don't fully understand yet: test the primitives first, build the matrix before you build the system, and be skeptical of any diagnosis — including your own — that you haven't reproduced with a minimal test case.


Built with OpenClaw Lobster + Gemini 2.5 Flash. Tests run on macOS (darwin), Gateway v2026.2.x.