research

Does the compiler actually work?

Most “AI for the enterprise” products show a polished demo and skip the hard question: does it work on real, messy data? We built a 10-case benchmark across deploys, incidents, pricing, refunds, hiring, comms, procurement, support, and compliance. The eval runner is open source. You can reproduce these numbers.

18.6/ 20

Mean score

Across 10 cases

100%

A or B grade

10/10 cases

4.9/ 5

Format compliance

Anthropic Skills spec

4.8/ 5

Faithfulness

Stays grounded in the source

Last run: Wed, 29 Apr 2026 19:18:11 GMT · models: claude-haiku-4-5-20251001 → claude-opus-4-7 (graded by claude-opus-4-7) · cost $0.91

Case	Domain	Difficulty	Skill	Score	Grade
01-deploy-freeze	deploys	Medium	handling-billing-migration-deploy-freeze	19/20	A
02-cache-incident	incident-response	Hard	responding-to-search-service-stale-cache-incidents	19/20	A
03-usage-pricing	pricing	Medium	handling-usage-based-pricing-exceptions	19/20	A
04-refund-policy	refunds	Easy	handling-customer-refund-requests	19/20	A
05-press-policy	comms	Easy	handling-press-and-analyst-inquiries	19/20	A
06-vendor-procurement	procurement	Easy	handling-software-procurement	17/20	A
07-hiring-loop	hiring	Easy	running-engineering-hiring-loop	18/20	A
08-pricing-discount	pricing	Medium	handling-sales-discount-exceptions	19/20	A
09-routing-migration	support	Hard	migrating-tenants-to-routing-v2	18/20	A
10-data-retention	compliance	Hard	handling-customer-data-retention-requests	19/20	A

Faithfulness

4.80

Executability

5.00

Format

4.90

Conciseness

3.90

Methodology

We constructed 10 evaluation cases from realistic B2B SaaS company knowledge: Slack threads, Notion policy pages, GitHub post-mortems, Intercom tickets. Cases are graded easy / medium / hard based on input length, ambiguity, and the presence of conflicting or implicit rules.

For each case we hand-wrote a reference SKILL.md capturing what an ideal compiler would produce, then run the Skillforge pipeline (Claude Haiku for fact extraction, Claude Opus for skill synthesis) against the raw input. The generated skill is graded by a separate Claude Opus pass on four axes:

Faithfulness: every claim grounded in the source; hard rules carried verbatim.
Executability: an agent can act on the skill alone; decisions are precise.
Format compliance: follows Anthropic's exact Agent Skills spec.
Conciseness: every section earns its place; no redundancy.

Each axis 1-5; case total max 20. Grades: A ≥ 17, B 13-16, C 9-12, D < 9.

Reproduce

The eval runner is open source under /evals in the repo. To reproduce these results yourself:

bash

git clone https://github.com/AhmedTariqCS/skillforge
cd skillforge
npm install
export ANTHROPIC_API_KEY=sk-ant-...
npm run eval
# Results written to evals/results.json
# Per-case generated SKILL.md files written to evals/cases/*/generated.md

Full-suite cost: about $2.40 across all 10 cases on the current models. Per-case cost is logged in the results file.

What we're measuring

We report three things on every run:

Per-axis means — where the pipeline is strong and where it isn't, by category.
Per-case grade — a holistic A/B/C/D rating for each case so we can spot regressions.
Cost per skill — both extraction and synthesis. Real $ numbers, not estimates.

Honest about limits

Single-document only. The full Skillforge product synthesizes across multiple sources; that's not in this benchmark yet.
The grader is itself a Claude model. Spot-checked vs human graders on 30 outputs and found average disagreement of 0.4 points across axes — acceptable for a relative benchmark, not gospel.
10 cases is small. We're growing toward 50, and we accept community-contributed cases via PR.
The cases are well-structured by design (real but readable). Production data is messier; expect lower scores when running against raw exports.

What's next

Multi-source synthesis benchmark. When two sources disagree, can we produce a skill that handles the disagreement correctly?
Eval expansion to 50 cases across 20 domains. Public leaderboard for community-contributed pipelines.
Continuous re-evaluation: every Claude model upgrade triggers a full benchmark re-run. Results published here on every commit.

Try it yourself

The CLI uses the exact same multi-stage pipeline. Run it on your own inputs and see the validation output, the trace, and the cost.

Try the live compiler Install the CLI