Skillforge
research

Does the compiler actually work?

Most “AI for the enterprise” products show a polished demo and skip the hard question: does it work on real, messy data? We built a 10-case benchmark across deploys, incidents, pricing, refunds, hiring, comms, procurement, support, and compliance. The eval runner is open source. You can reproduce these numbers.

18.6/ 20

Mean score

Across 10 cases

100%

A or B grade

10/10 cases

4.9/ 5

Format compliance

Anthropic Skills spec

4.8/ 5

Faithfulness

Stays grounded in the source

Last run: Wed, 29 Apr 2026 19:18:11 GMT · models: claude-haiku-4-5-20251001claude-opus-4-7 (graded by claude-opus-4-7) · cost $0.91

CaseSkillScoreGrade
01-deploy-freezehandling-billing-migration-deploy-freeze19/20A
02-cache-incidentresponding-to-search-service-stale-cache-incidents19/20A
03-usage-pricinghandling-usage-based-pricing-exceptions19/20A
04-refund-policyhandling-customer-refund-requests19/20A
05-press-policyhandling-press-and-analyst-inquiries19/20A
06-vendor-procurementhandling-software-procurement17/20A
07-hiring-looprunning-engineering-hiring-loop18/20A
08-pricing-discounthandling-sales-discount-exceptions19/20A
09-routing-migrationmigrating-tenants-to-routing-v218/20A
10-data-retentionhandling-customer-data-retention-requests19/20A

Faithfulness

4.80

Executability

5.00

Format

4.90

Conciseness

3.90

Methodology

We constructed 10 evaluation cases from realistic B2B SaaS company knowledge: Slack threads, Notion policy pages, GitHub post-mortems, Intercom tickets. Cases are graded easy / medium / hard based on input length, ambiguity, and the presence of conflicting or implicit rules.

For each case we hand-wrote a reference SKILL.md capturing what an ideal compiler would produce, then run the Skillforge pipeline (Claude Haiku for fact extraction, Claude Opus for skill synthesis) against the raw input. The generated skill is graded by a separate Claude Opus pass on four axes:

  • Faithfulness: every claim grounded in the source; hard rules carried verbatim.
  • Executability: an agent can act on the skill alone; decisions are precise.
  • Format compliance: follows Anthropic's exact Agent Skills spec.
  • Conciseness: every section earns its place; no redundancy.

Each axis 1-5; case total max 20. Grades: A ≥ 17, B 13-16, C 9-12, D < 9.

Reproduce

The eval runner is open source under /evals in the repo. To reproduce these results yourself:

bash
git clone https://github.com/AhmedTariqCS/skillforge
cd skillforge
npm install
export ANTHROPIC_API_KEY=sk-ant-...
npm run eval
# Results written to evals/results.json
# Per-case generated SKILL.md files written to evals/cases/*/generated.md

Full-suite cost: about $2.40 across all 10 cases on the current models. Per-case cost is logged in the results file.

What we're measuring

We report three things on every run:

  • Per-axis means — where the pipeline is strong and where it isn't, by category.

  • Per-case grade — a holistic A/B/C/D rating for each case so we can spot regressions.

  • Cost per skill — both extraction and synthesis. Real $ numbers, not estimates.

Honest about limits

  • Single-document only. The full Skillforge product synthesizes across multiple sources; that's not in this benchmark yet.

  • The grader is itself a Claude model. Spot-checked vs human graders on 30 outputs and found average disagreement of 0.4 points across axes — acceptable for a relative benchmark, not gospel.

  • 10 cases is small. We're growing toward 50, and we accept community-contributed cases via PR.

  • The cases are well-structured by design (real but readable). Production data is messier; expect lower scores when running against raw exports.

What's next

  • Multi-source synthesis benchmark. When two sources disagree, can we produce a skill that handles the disagreement correctly?

  • Eval expansion to 50 cases across 20 domains. Public leaderboard for community-contributed pipelines.

  • Continuous re-evaluation: every Claude model upgrade triggers a full benchmark re-run. Results published here on every commit.

Try it yourself

The CLI uses the exact same multi-stage pipeline. Run it on your own inputs and see the validation output, the trace, and the cost.