Does the compiler actually work?
Most “AI for the enterprise” products show a polished demo and skip the hard question: does it work on real, messy data? We built a 10-case benchmark across deploys, incidents, pricing, refunds, hiring, comms, procurement, support, and compliance. The eval runner is open source. You can reproduce these numbers.
18.6/ 20
Mean score
Across 10 cases
100%
A or B grade
10/10 cases
4.9/ 5
Format compliance
Anthropic Skills spec
4.8/ 5
Faithfulness
Stays grounded in the source
Last run: Wed, 29 Apr 2026 19:18:11 GMT · models: claude-haiku-4-5-20251001 → claude-opus-4-7 (graded by claude-opus-4-7) · cost $0.91
| Case | Skill | Score | Grade |
|---|---|---|---|
| 01-deploy-freeze | handling-billing-migration-deploy-freeze | 19/20 | A |
| 02-cache-incident | responding-to-search-service-stale-cache-incidents | 19/20 | A |
| 03-usage-pricing | handling-usage-based-pricing-exceptions | 19/20 | A |
| 04-refund-policy | handling-customer-refund-requests | 19/20 | A |
| 05-press-policy | handling-press-and-analyst-inquiries | 19/20 | A |
| 06-vendor-procurement | handling-software-procurement | 17/20 | A |
| 07-hiring-loop | running-engineering-hiring-loop | 18/20 | A |
| 08-pricing-discount | handling-sales-discount-exceptions | 19/20 | A |
| 09-routing-migration | migrating-tenants-to-routing-v2 | 18/20 | A |
| 10-data-retention | handling-customer-data-retention-requests | 19/20 | A |
Faithfulness
4.80
Executability
5.00
Format
4.90
Conciseness
3.90
Methodology
We constructed 10 evaluation cases from realistic B2B SaaS company knowledge: Slack threads, Notion policy pages, GitHub post-mortems, Intercom tickets. Cases are graded easy / medium / hard based on input length, ambiguity, and the presence of conflicting or implicit rules.
For each case we hand-wrote a reference SKILL.md capturing what an ideal compiler would produce, then run the Skillforge pipeline (Claude Haiku for fact extraction, Claude Opus for skill synthesis) against the raw input. The generated skill is graded by a separate Claude Opus pass on four axes:
- Faithfulness: every claim grounded in the source; hard rules carried verbatim.
- Executability: an agent can act on the skill alone; decisions are precise.
- Format compliance: follows Anthropic's exact Agent Skills spec.
- Conciseness: every section earns its place; no redundancy.
Each axis 1-5; case total max 20. Grades: A ≥ 17, B 13-16, C 9-12, D < 9.
Reproduce
The eval runner is open source under /evals in the repo. To reproduce these results yourself:
git clone https://github.com/AhmedTariqCS/skillforge cd skillforge npm install export ANTHROPIC_API_KEY=sk-ant-... npm run eval # Results written to evals/results.json # Per-case generated SKILL.md files written to evals/cases/*/generated.md
Full-suite cost: about $2.40 across all 10 cases on the current models. Per-case cost is logged in the results file.
What we're measuring
We report three things on every run:
Per-axis means — where the pipeline is strong and where it isn't, by category.
Per-case grade — a holistic A/B/C/D rating for each case so we can spot regressions.
Cost per skill — both extraction and synthesis. Real $ numbers, not estimates.
Honest about limits
Single-document only. The full Skillforge product synthesizes across multiple sources; that's not in this benchmark yet.
The grader is itself a Claude model. Spot-checked vs human graders on 30 outputs and found average disagreement of 0.4 points across axes — acceptable for a relative benchmark, not gospel.
10 cases is small. We're growing toward 50, and we accept community-contributed cases via PR.
The cases are well-structured by design (real but readable). Production data is messier; expect lower scores when running against raw exports.
What's next
Multi-source synthesis benchmark. When two sources disagree, can we produce a skill that handles the disagreement correctly?
Eval expansion to 50 cases across 20 domains. Public leaderboard for community-contributed pipelines.
Continuous re-evaluation: every Claude model upgrade triggers a full benchmark re-run. Results published here on every commit.
Try it yourself
The CLI uses the exact same multi-stage pipeline. Run it on your own inputs and see the validation output, the trace, and the cost.