Skill Lab · the evaluation layer

Prove every skill
is worth shipping.

Skill Lab grades every SKILL.md in a public GitHub repo against 37 quality and security checks — and can rewrite the failing ones for you. No clone, no sign-up.

The trick

Swap one word in any GitHub URL.

That's the entire onboarding. Works on any public repo containing SKILL.md files.

→ try anthropics/skills

§ I
How it works

Three steps, all real endpoints.

Skill Lab reads every SKILL.md in a public GitHub repo, runs the same 37 checks the sklab CLI runs locally, then lets you call LLM-powered judge, optimize, and triggers passes on demand.

01GET /v1/repos/:o/:r/evaluate

Scan

Paste any GitHub URL. Skill Lab fetches every SKILL.md via the GitHub API — no clone, no sign-up. Results cache by commit SHA.

0237 rules · 5 dimensions

Check

Structure, naming, description, content, security. Every failure ships with a severity and a one-line fix.

03POST /v1/.../optimize

Improve

Optional LLM passes: a judge verdict, an optimize rewrite that lifts the score, and a triggers test plan. All returned as JSON for CI.

§ II
Setting the bar

How real skills score.

Six public skills from anthropics/skills, re-evaluated on every deploy. We pick the skills — the scores are whatever the scanner finds.

100
37/37
anthropics/skills
skills/pdf
Read, write, OCR, merge, and watermark PDFs.
findings
clean run — every check passed
→ full report
100
37/37
anthropics/skills
skills/algorithmic-art
Generative art in p5.js with seeded randomness.
findings
clean run — every check passed
→ full report
99
36/37
anthropics/skills
skills/frontend-design
Production-grade UI components and layouts.
1 finding
- content.has-examples
  Content does not contain code examples
  low
→ full report
96
35/37
anthropics/skills
skills/mcp-builder
Build well-designed MCP servers and tools.
2 findings
- content.broken-internal-links
  Broken internal link(s): ./reference/mcp_best_practices.md, ./reference/node_mcp_server.md, ./reference/python_mcp_server.md, ./reference/evaluation.md
  med
- content.compatibility-prereqs
  Command runners missing from compatibility: npx (needs Node.js)
  low
→ full report
95
35/37
anthropics/skills
skills/skill-creator
Create, edit, and benchmark new skills.
2 findings
- content.token-budget
  Body exceeds 5000 token budget (8156 estimated)
  med
- content.asset-paths-exist
  Asset path(s) not found on disk: assets/eval_review.html
  med
→ full report
94
34/37
anthropics/skills
skills/pptx
Build, parse, and edit PowerPoint decks.
3 findings
- content.script-paths-exist
  Script path(s) not found on disk: scripts/thumbnail.py
  med
- content.broken-internal-links
  Broken internal link(s): editing.md, pptxgenjs.md
  med
- content.metadata-token-budget
  Metadata exceeds 150 token budget (173 estimated)
  low
→ full report

scanned May 21, 2026 · refreshed on every deploy

§ III
What 'optimize' actually does

One call, a rewritten SKILL.md.

POST /v1/repos/:o/:r/optimize returns the original and a higher-scoring rewrite, plus the deltas. Below is a frozen example for a deliberately weak refund-handler skill — illustrative numbers, real response shape.

Original

score 46failures 19

---
name: Refund Handler
description:
---

Handle customer refund requests. Look up the order, check the refund
policy, and issue a refund if eligible.

Optimized

score 89failures 4

---
name: refund-handler
description: Use when a customer asks for a refund. Looks up the order, applies the refund policy, and either issues the refund or routes the request for human review.
---

# Refund Handler

Use this skill when a customer requests a refund. The skill verifies
eligibility against the refund policy and either issues the refund
directly or escalates to a human reviewer.

## When to use

- Customer explicitly asks for a refund, return, or money back
- A previous order had a defect, shipping issue, or pricing error

## Inputs

- `order_id` — required. The order being refunded.
- `reason` — customer-supplied; preserved verbatim for audit.

## Steps

1. Look up the order via `scripts/get_order.py`.
2. Check eligibility: within 30 days, marked delivered, not previously refunded.
3. If eligible, issue the refund via the payments API.
4. Otherwise, escalate to a human reviewer with the reason and order summary.

## Example

```
> Refund order 81022 — wrong size
✓ Eligible · refunded $42.00 to original payment method
```

## Safety

- Never refund without a verified order ID.
- Do not promise refund amounts before eligibility passes.

Δ score

+43

fewer failures

−15

checks that flipped

description.not-emptynaming.formatcontent.description-actionablecontent.has-examplescontent.scripts-referenced

§ IV
Web ↔ CLI

Same checks, on a server or your laptop.

sklab is the CLI that ships with the skill-lab PyPI package — same 37checks, same judge, same optimizer. Run it in CI or on a directory that hasn't been pushed yet.

# scan a repo from anywhere
curl https://api.skill-lab.dev/v1/repos/anthropics/skills/evaluate

# or just open it in a browser
open https://skill-lab.dev/anthropics/skills

→ install guide → /evaluate endpoint → sklab optimize

§ try it

Stop shipping skills you can't measure.

Paste any public GitHub repo with SKILL.md files and Skill Lab will scan it against the rubric.

or pip install skill-lab for local runs

Prove every skill is worth shipping.