Why AI Cron Jobs Lie to You: The Exit 0 With Empty Output Pattern

My health check cron ran 180 times over six hours. Exit code 0 every time. Green dashboard. And you never once asked. Neither did I — until I checked the output and found six hours of perfect silence.

I am a 40-something senior developer in Jakarta. Fifteen years in tech. I have shipped monoliths, microservices, and now autonomous AI agent pipelines that run from my Jakarta apartment at odd hours. I thought I understood monitoring. I did not. What I learned broke my confidence in every green dashboard I have ever built.

Dashboard showing 180 green status indicators over six hours, red warning dot appears after manual inspection

The Lie Your Cron Tells You

An exit code of 0 means a process terminated. That is all. It does not mean the process did anything useful. It does not mean it produced output. It means the shell got a clean goodbye.

Bob Renze documented this exact failure in his AgentChat monitoring setup. His health check cron executed 180 times across six hours. Every single invocation returned exit code 0. The dashboard stayed green. The alerts stayed silent. The actual log output told a different story:

1
2
3
[WARNING] Agent did not produce output within timeout window (60s default).
[WARNING] No response generated. Returning empty result.
[WARNING] Task completed with warnings. Exit code: 0.

One hundred and eighty warnings. Zero actual responses. The system reported success because the wrapper script caught the timeout, swallowed the error, and returned 0 to the shell. The monitoring tool saw “exit 0” and marked the run as healthy.

This is not a bug in one framework. This is the default behavior of most cron runners and agent orchestrators. They check whether the process died, not whether the process worked. A zombie task that returns cleanly is indistinguishable from a healthy one.

Renze identified three detection mechanisms that actually catch this: wall-clock timeout enforcement, checkpoint heartbeat monitoring, and output verification. The first two are standard in production systems. The third — actually checking that the agent produced something meaningful — is almost never implemented.

The Five Silent Failure Patterns

Temur Khan cataloged five failure patterns that plague production AI systems. I have hit every single one. They are not theoretical. They are what happens when you deploy agent pipelines without outcome-level monitoring.

Pattern 1: Exit code 0 with empty output. The classic. The agent times out, the wrapper catches it, returns 0. The cron log shows success. The output file is 0 bytes. Nobody notices for days.

Pattern 2: The “just this once” hook bypass that becomes permanent. During development, you add a bypass flag — --skip-validation — to debug a flaky test. You commit it. It stays in the cron config. Three months later, every run skips validation. The pipeline is a no-op wrapped in a success code.

Pattern 3: Action budget leak through agent loops. An agent calls an API in a retry loop. Each call costs money. The budget guard was set for a single pass, not for the loop. Khan found cases where costs ran 4x the expected budget and nobody noticed because the job still completed. I have seen $1500/month cost overruns on pipelines doing 1000 runs/day. The per-run overrun was small. The aggregate was a surprise.

Pattern 4: Semantic validation gap. The output passes schema validation. The JSON is well-formed. The required fields are present. But the content is garbage — placeholder text, repeated paragraphs, or an AI-generated summary that says “Content could not be generated” wrapped in a valid response envelope. Schema validation says yes. The user sees nothing.

Pattern 5: The “successful retry” that hides repeated failure. A task fails, retries with exponential backoff, and eventually succeeds on attempt four. The cron log shows success. The retry delay was 21 seconds total. The user who triggered the task gave up after 30 seconds of waiting. The system says it worked. The user says it did not.

Five icons representing the failure patterns, each showing green with a red warning indicator

How I Built My First Failure-Resistant Cron Pipeline

After my six-hour silence incident, I added three layers of verification to every cron job I run. The first layer checks output existence. The second checks output size against historical baseline. The third checks content patterns. Here is the actual bash trap pattern I use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#!/bin/bash
# verify_output.sh — exit-0 trap that checks output before allowing clean exit

OUTPUT_FILE="/tmp/agent_output_$(date +%Y%m%d_%H%M%S).json"
WARNINGS_LOG="/tmp/agent_warnings_$(date +%Y%m%d_%H%M%S).log"

# Run the actual agent task
run_agent_task "$@" > "$OUTPUT_FILE" 2> "$WARNINGS_LOG"
TASK_EXIT=$?

# Verification function — runs on EXIT trap
verify_output() {
    # Check 1: Output file must exist and be non-empty
    if [ ! -s "$OUTPUT_FILE" ]; then
        echo "FAILURE: Output file is empty or missing" >&2
        return 1
    fi

    # Check 2: Output must exceed minimum size threshold
    local size=$(stat -c%s "$OUTPUT_FILE")
    local min_size=500  # bytes — tuned to your expected minimum
    if [ "$size" -lt "$min_size" ]; then
        echo "FAILURE: Output too small (${size}B < ${min_size}B minimum)" >&2
        return 1
    fi

    # Check 3: Warnings log must not contain silent-failure markers
    if grep -q "No response generated\|empty result\|timeout window" \
        "$WARNINGS_LOG"; then
        echo "FAILURE: Agent produced warnings indicating empty processing" >&2
        return 1
    fi

    echo "VERIFIED: Output ${size}B, no silent-failure markers"
    return 0
}

# Trap: verify_output runs before script exits, regardless of task exit code
trap 'verify_output || exit 1' EXIT

# If we get here, the trap already ran verify_output on EXIT
exit $TASK_EXIT

The trap pattern is the key. It runs on every exit path — success, error, or signal. Even if run_agent_task returns 0, the script exits 1 if verification fails. The cron runner sees a non-zero exit and fires the alert.

For my Python pipelines, I use a rolling median comparison. If today’s output is less than 30% of the 7-day rolling median, something is wrong:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import json
import os
import statistics
from datetime import datetime, timedelta
from pathlib import Path

OUTPUT_DIR = Path("/var/log/agent_outputs")
THRESHOLD_RATIO = 0.30  # flag if output < 30% of rolling median
WINDOW_DAYS = 7

def get_output_size(date_str: str) -> int:
    """Get the size of output for a given date."""
    output_file = OUTPUT_DIR / f"output_{date_str}.json"
    if output_file.exists():
        return output_file.stat().st_size
    return 0

def rolling_median_today() -> float:
    """Calculate 7-day rolling median of output sizes, excluding today."""
    today = datetime.now()
    sizes = []
    for i in range(1, WINDOW_DAYS + 1):
        past_date = (today - timedelta(days=i)).strftime("%Y%m%d")
        size = get_output_size(past_date)
        if size > 0:
            sizes.append(size)
    if not sizes:
        return 0.0
    return statistics.median(sizes)

def verify_today_output() -> bool:
    """Check if today's output meets the minimum threshold."""
    today_str = datetime.now().strftime("%Y%m%d")
    today_size = get_output_size(today_str)
    median = rolling_median_today()

    if median == 0:
        # No historical data — use absolute minimum
        min_size = 500
    else:
        min_size = median * THRESHOLD_RATIO

    if today_size < min_size:
        print(
            f"FAILURE: Today's output ({today_size}B) is below "
            f"30% of 7-day median ({median:.0f}B → {min_size:.0f}B threshold)"
        )
        return False

    print(
        f"VERIFIED: Today's output {today_size}B meets "
        f"threshold {min_size:.0f}B (median: {median:.0f}B)"
    )
    return True

if __name__ == "__main__":
    if not verify_today_output():
        exit(1)

This catches the silent failures that exit codes miss. A job that runs, returns 0, and writes 200 bytes of error text will fail the 30% check against a 7-day median of 4KB.

Terminal output showing exit code 0 passing verification, then failing rolling median check and triggering an alert

The “Operational vs Effective” Gap

Your dashboard says the job ran. That is operational monitoring. Your dashboard does not say whether the job produced anything. That is effective monitoring. Most teams only have the former.

Here is a concrete example from my own blog publishing pipeline. I have a cron that runs every morning at 6 AM. It generates a header image, writes a blog post to Hugo, and deploys. The wrapper script returns 0 if the process completes without crashing.

One Tuesday, the image generation API started returning 429 (rate limited). The agent caught the error, logged it, and skipped the image step. The write step proceeded but the content was a stub — the agent generated placeholder text because it had no image context. The deploy step pushed the stub to production. The cron returned 0.

My monitoring said “published successfully.” My blog had a post titled “Daily Update” with no image and three paragraphs of lorem ipsum-style filler. I did not notice for two days. My readers did not notice either, which is somehow worse.

The gap between operational and effective monitoring is the gap between “the machine ran” and “the machine did something worth doing.” Closing that gap requires checking artifacts, not just process status.

Dashboard saying published successfully next to blog post showing placeholder content

What I Changed: The Three Rules

After these incidents, I rewrote my monitoring philosophy around three rules.

Rule 1: Monitor outcomes, not activity. Check that the blog post exists, has a non-zero word count, contains the expected image reference, and renders correctly. Do not check that the cron ran. The cron running is the least interesting thing about the cron.

Rule 2: Treat “empty” as an error state. Empty stdout from a producer is failure, not success. If my agent pipeline produces 0 bytes, that is a bug, not a quiet day. I set every producer cron to fail on empty output. No exceptions.

Rule 3: Test your failure paths deliberately. Once a month, I manually kill my cron mid-run. I trigger a 429 on the image API. I fill the disk. I revoke a token. Then I verify that the alert fires, the dashboard turns red, and the on-call notification reaches me. If any of those do not happen, I fix the monitoring before I fix the underlying issue.

These rules are not sophisticated. They are basic. But they would have caught every silent failure I have experienced. The sophistication was in the failure modes, not the detection.

The Checklist You Can Steal

If you run AI agent cron pipelines, apply this checklist to every job. It takes thirty minutes per job and saves you from the six-hours-of-silence incident.

  1. Output verification on every run. Add a post-run check that the output file exists, exceeds a minimum size, and does not contain known failure-marker strings. Use the bash trap pattern above.

  2. Rolling median comparison. Compare today’s output size against the 7-day rolling median. Flag anything below 30% as a failure. Tune the threshold to your workload.

  3. Artifact existence checks. After a “successful” blog publish, verify the blog file exists on disk, the image file exists, and the HTTP endpoint returns 200. Check the thing, not the process.

  4. Monthly failure injection. Schedule a deliberate failure once a month. Kill the process, revoke a credential, fill the disk. Verify the alert fires. If it does not, your monitoring is as broken as the thing it monitors.

  5. Cost and budget guardrails. Set hard budget limits on agent API calls. Alert at 50% and 80% of the expected per-run cost. A 4x cost overrun on 1000 runs/day is $1500/month. Catch it on day one, not day thirty.

The five-item checklist for failure-resistant cron pipelines, printable format


Sources:

  • Bob Renze, “AI Agent Silent Failures: What 6 Hours of Undetected Downtime Taught Me About Monitoring”, DEV.to, 2026-03-23.
  • Bob Renze, “How AI Agents Handle Stalled Tasks and Timeouts”, DEV.to, 2026-03-04.
  • Temur Khan, “5 Silent Failure Patterns I Keep Finding in Production AI Systems”, DEV.to, 2026-05-03.
  • SilentWatch MCP, github.com/temurkhan13/silentwatch-mcp — exit-0-empty-output detection, retry storms, action-budget leaks.