The Headless Index, Methodology v1
Reader’s note
This is the published methodology for The Headless Index (THI). Every scorecard at headless-systems.com cites a methodology version. Scorecards stamped THI-v1 were produced under the rules in this document.
The single principle behind every decision: no score without verifiable evidence. If we cannot show our work, we do not score. “Unknown” is a valid, non-penalizing result.
1. What THI scores
THI evaluates a single construct: machine consumability.
Operational definition: Given a vendor’s public documentation and API surface, can an AI agent (LLM with tool use, autonomous workflow, or any non-human automated client) correctly read, write, and observe events on this software, with bounded engineering effort and predictable behavior?
This is the headless mode of operation: the architectural pattern where the API is the product and the UI is one client among many. THI applies to any software with a public API, regardless of how the vendor markets themselves. A vendor does not need to call itself “headless” to be scored by THI; what matters is whether the software can be operated in headless mode by machines.
The thesis: UI is becoming optional. THI measures how well each vendor has internalized that thesis.
2. What THI explicitly does NOT measure
So readers understand the limits of the score:
- Pricing or licensing fit
- Support quality, response times, account management
- Feature parity, ecosystem maturity beyond MCP
- Performance at scale (we do not load-test)
- Security audits (we observe declared posture, not penetration test)
- Vendor stability: financial health, leadership turnover, acquisition rumors
- Suitability for any specific use case
A vendor scoring well on THI is not endorsed for purchase. THI is one input among many, scoped tightly to one question.
3. The score
Each vendor receives three numbers and a band:
| Element | Range | Source |
|---|---|---|
| JAIRF Score | 0-100 (or N/A) | Jentic API AI-Readiness Framework, cli-jentic against vendor’s OpenAPI |
| The Headless Index | 0-100 (or partial with Unknown sub-criteria) | THI’s own rubric, five sub-criteria, evidence-cited |
| THI Band | A / B / C / D / F | Aggregated from the two scores per Section 9 |
| Editorial Verdict | 250-400 word paragraph | Written by the editor, cited from scorecard evidence |
JAIRF and The Headless Index measure different things and are reported separately. JAIRF measures OpenAPI spec quality from a strictly technical standpoint. The Headless Index measures the broader machine-consumability picture: API design intent, lifecycle support, agent-native investment, schema discoverability, event-driven posture.
A vendor’s headline number is simply “The Headless Index”. Like “S&P 500” doubling as the publication and the number, “The Headless Index” is both the program and the per-vendor score.
4. JAIRF: the technical foundation
THI adopts the Jentic API AI-Readiness Framework (JAIRF) v1.0.0 as its technical scoring layer. JAIRF is open-source (Apache 2.0), vendor-neutral, and credentialled by a public specification at github.com/jentic/api-ai-readiness-framework.
Per JAIRF’s attribution requirement, every scorecard carries “Powered by JAIRF v1.0.0 by Jentic” with a link to the framework repository.
JAIRF evaluates an OpenAPI specification across six dimensions in three pillars:
| Pillar | Dimension | What it measures |
|---|---|---|
| FDX | Foundational Compliance (FC) | Structural validity, standards conformance, parsability |
| FDX | Developer Experience & Tooling (DXJ) | Documentation clarity, example coverage, response completeness |
| AIRU | AI-Readiness & Agent Experience (ARAX) | Semantic clarity, intent expression, datatype specificity, error standardization |
| AIRU | Agent Usability (AU) | Composability, complexity comfort, navigation, safety patterns |
| TSD | Security (SEC) | Authentication strength, transport security, secret hygiene, OWASP posture |
| TSD | AI Discoverability (AID) | Descriptive richness, intent phrasing, registry signals |
Scores aggregate via a weighted harmonic mean (weak dimensions disproportionately drag the total). The final 0-100 score maps to five readiness levels:
- Level 0 (under 40): Not Ready
- Level 1 (40-60): Foundational
- Level 2 (60-75): AI-Aware
- Level 3 (75-90): AI-Ready
- Level 4 (90+): Agent-Optimized
We do not modify JAIRF. A computed JAIRF score on any THI scorecard is reproducible by running cli-jentic against the same OpenAPI spec at the same point in time.
5. The JAIRF coverage gap
JAIRF requires an OpenAPI 3.x or 2.x specification. Many vendors do not publish OpenAPI. For these vendors we cannot compute a JAIRF score. The scorecard displays:
“This vendor does not publish a machine-readable OpenAPI specification. JAIRF cannot be computed. The Headless Index score and editorial verdict below assess machine consumability through other observable signals.”
JAIRF N/A is itself a data point the reader should weigh. It is not a measurement failure on our side.
6. The Headless Index rubric
The Headless Index uses five sub-criteria, each scored 0 to 20, aggregating to 0 to 100. Where a sub-criterion is correctly scored as N/A or Unknown for evidence reasons, it is excluded from the denominator and the displayed Headless Index is computed proportionally.
| Criterion | Question | Max | N/A allowed |
|---|---|---|---|
API-first design intent api_first | Was the API documented before or alongside the UI, or retrofitted later? | 20 | no |
Headless mode of operation headless_operation | Can this vendor be operated end-to-end with no UI involvement? | 20 | no |
MCP and agent integration posture mcp_posture | Has the vendor invested in agent-native integration? | 20 | no |
Schema observability for agents schema_observability | Can an agent introspect the schema at runtime and act on it? | 20 | no |
Webhook and event-driven posture webhooks_events | Can an agent subscribe to state changes rather than polling? | 20 | yes |
6.1 API-first design intent (0-20)
Question: Was the API documented before or alongside the UI, or retrofitted later?
Evidence sources: earliest public mention of API in changelog or docs; whether the API is featured on the marketing site or buried in a “for developers” section; total npm/PyPI weekly downloads across vendor SDKs (real-world consumption signal); number of operations exposed in the API versus operations available only in the UI.
Bands:
- 20: API is the product. Marketing leads with the API. UI is one client among several. Massive download volume (1M+/wk).
- 15: API and UI co-equal. Strong SDK ecosystem (100k+/wk downloads or 200+ stars).
- 10: API is documented and reasonably complete but the product is positioned as UI-first.
- 5: API exists but covers a partial subset of UI workflows. Thin SDK ecosystem.
- 0: API is bolted on. Major workflows require the UI.
6.2 Headless mode of operation (0-20)
Question: Can this vendor be operated end-to-end with no UI involvement?
Evidence sources: account setup, project creation, schema changes available via API; billing and team management programmatic; critical workflows gated to the UI; CLI tool availability; number of OpenAPI operations exposed; docs site topic coverage (setup, billing, teams, CLI, schema).
Bands:
- 20: Full lifecycle via API. UI is optional. 80+ OpenAPI operations and 3+ lifecycle topics documented.
- 15: Operational workflows are API-driven; only meta-actions like billing require UI.
- 10: Day-to-day operations API-driven; setup or schema changes require UI.
- 5: Half the workflows require UI clicks.
- 0: Vendor’s core value depends on UI interaction.
6.3 MCP and agent integration posture (0-20)
Question: Has the vendor invested in agent-native integration?
Evidence sources: official MCP server in vendor’s GitHub org (verified via direct repo lookups for common patterns: mcp, mcp-server, agent-toolkit, {org}-mcp-server); official server in the MCP registry; community MCP servers and their maintenance signal; SDK design for LLM tool calling; public statements or guides about agent integration.
Bands:
- 20: Official, well-maintained MCP server (50+ stars OR vendor-tagged with mcp + active commits in last 90 days). LLM-tool-call-ready SDK. Public agent guidance.
- 15: Official MCP server or strong community MCP with active maintenance.
- 10: Community MCP exists and is maintained.
- 5: No MCP, but SDK usable for agents with significant scaffolding.
- 0: No MCP, no agent guidance, SDK assumes human-driven application development.
6.4 Schema observability for agents (0-20)
Question: Can an agent introspect the schema at runtime and act on it?
Evidence sources: public OpenAPI specification (discovered, valid, current); GraphQL introspection enabled; vendor-specific schema endpoints (Content Types API, GROQ schema, JSON Schema for resources); schema types exported by SDKs.
Bands:
- 20: Schema fully introspectable. OpenAPI and GraphQL where applicable. Current within 30 days.
- 15: Schema introspectable. OpenAPI present and current within 90 days.
- 10: Schema present but stale or partial. Or vendor-specific schema endpoint requiring auth.
- 5: Schema requires manual discovery from human docs.
- 0: No machine-readable schema published.
6.5 Webhook and event-driven posture (0-20)
Question: Can an agent subscribe to state changes rather than polling?
Evidence sources: declared webhook events; signing scheme (HMAC variants, JWT, none); retry policy and replay endpoint; payload completeness (full state or just “go fetch”).
Bands:
- 20: Comprehensive webhooks covering all resource lifecycle events. Signed. Documented retry and replay. Full payloads.
- 15: Webhooks cover most events with signing and retry.
- 10: Webhooks exist with partial coverage or weak signing/retry semantics.
- 5: Webhooks exist as a checkbox: minimal events, thin payloads.
- 0: No webhooks. Agents must poll.
N/A allowed: read-only or essentially-static APIs may legitimately not need webhooks. Scored N/A and excluded from the denominator.
7. Evidence-first scoring
Every numeric score on every scorecard must reference a verifiable artifact:
- A URL with the observation date
- A commit SHA in a public repo
- A recorded HTTP response captured by the data collection script
- A public statement (blog post, changelog, GitHub issue)
- An AI review evidence file (reading actual pages, citing direct quotes)
Three implications:
- If we cannot cite, we cannot score. The result is Unknown.
- Unknown is not penalized. The criterion is excluded from the denominator.
- The data collection script logs every probe. Raw evidence is downloadable as JSON from each scorecard page.
8. The two-pass scoring architecture
THI uses two complementary passes.
Pass 1, mechanical: automated probes of OpenAPI URLs, GitHub APIs, MCP registry, npm registry, docs crawl, external calibration. Fast, deterministic, evidence-rich. Produces an initial scorecard.
Pass 2, AI deep review: a human-supervised AI reviewer reads real pages, extracts cited signals, writes structured answer files with evidence quotes. Reviews can override mechanical scores when grounded in cited evidence. Marked [AI review] in the rationale.
A scorecard is considered editorially complete only after Pass 2.
9. The THI Band
The band is derived from JAIRF and The Headless Index together.
Designed for machine consumption. Agents can use this with minimal scaffolding.
- Headless Index ≥ 75
- JAIRF ≥ 75
Solid agent-readiness with at least one notable gap worth knowing.
- Headless Index ≥ 60
- JAIRF ≥ 60
Usable for agents but expect to write integration scaffolding around gaps.
- Headless Index ≥ 40
- JAIRF ≥ 40
Significant agent-readiness gaps. Recommendable only when other constraints dominate.
- any score < 40
Designed for human use only. Agent integration requires substantial workarounds and operational risk.
- any score < 30
Zero-floor rule: if any Headless Index sub-criterion scored exactly 0 with high evidence confidence, the maximum band is capped at C regardless of other scores.
10. Editorial verdict
Each scorecard includes a verdict paragraph, 250-400 words. The verdict:
- Must reference at least 3 cited evidence items from the scorecard
- Must include at least one explicit strength AND one explicit weakness
- Must end with exactly one of the four recommendation phrases:
- “Strong fit for agent-driven use cases.”
- “Workable but requires scaffolding.”
- “Use only when locked in.”
- “Not currently suitable for agent consumption.”
- Must avoid em dashes and banned marketing phrases
- Must not introduce claims not in the scorecard
Verdicts are drafted with AI assistance using the AI review evidence, then reviewed and signed off by the editor before publication.
11. Publication and corrections
THI publishes when the editor signs off, with full evidence trails. We do not run a pre-publication vendor review process; THI is an independent publication, not a tribunal.
Correction process:
- A correction form is available on every scorecard page
- Vendors and readers can submit corrections with evidence
- Each correction is reviewed within 7 business days
- Accepted corrections update the scorecard with a public changelog entry showing what changed and why
- Rejected corrections are still logged publicly so the dispute is part of the evidence trail
12. Re-verification cadence
- Top quartile (A/B band): quarterly
- Middle band (C): semi-annually
- Bottom bands (D, F): annually, or sooner if the vendor announces a major API change
Re-verification reruns the full pipeline with the same methodology version. Score changes greater than 10 points or band changes trigger editor review before re-publication. The changelog is public.
13. Conflict of interest
Publisher: Bitvea, a custom software development company in the Czech Republic. Bitvea builds bespoke headless systems for clients. Bitvea does not sell any product in the categories THI scores. Bitvea does not compete commercially with the vendors being evaluated.
Disclosed on every scorecard: “Bitvea’s relationship with [vendor]: none / past client engagement / current client engagement / partner / integrator.”
Standing commitments:
- No payment for scoring, score adjustments, or removal
- No paid placement, no paid exclusion
- No NDA-only information goes into THI
- No exclusive briefings as a condition of coverage
14. Versioning
This document is THI Methodology v1, in production. Patch updates (clarifications, typo fixes, additional evidence sources for existing criteria) increment to v1.0.1, v1.0.2 without re-scoring. Minor changes (new evidence requirements, refined bands) increment to v1.1 and re-scoring becomes a soft requirement within 6 months. Major changes (changed criteria, changed weights) increment to v2 and trigger re-scoring of all published vendors.
The methodology version is stamped on every scorecard so a reader can always tell which rules produced a score.
15. Categories
THI is not restricted to any product category. The methodology applies to any software exposing an API. Active and planned categories include: Content Management, Payments, Auth and Identity, Search and Vector Databases, Communications, Observability, Project Management, AI Platforms, Workflow and Automation, Feature Flags, Object Storage, Analytics, Commerce.
Vendor selection within a category is editorial. Selection criteria are public on each category page.
16. Known limitations
We commit to publishing our own limitations.
- Inter-rater reliability is unproven at v1. A single editor has reviewed all current scorecards. We will address this by onboarding a second reviewer before the index exceeds 50 vendors, and by periodically blind-rescoring a sample.
- Editorial subjectivity in verdicts. Verdicts are an editorial product. They are reviewed against the rubric but the choice of emphasis is the editor’s. The aggregate band is determined by scorecard data, not the verdict.
- AI review depth varies across the first cohort. Some early scorecards received AI review for a subset of sub-criteria rather than the full five. Re-verification will bring all vendors to consistent depth.
- The JAIRF coverage gap. Vendors without public OpenAPI cannot receive a JAIRF score. The Headless Index and editorial verdict carry the readiness assessment in those cases.
- Observable, not exhaustive. We score what is publicly observable. Private partner programs, undocumented features, or paid-tier APIs that we do not test are scored on what we can see.
17. How to read a scorecard
A THI scorecard has four layers:
- Hero: vendor name, THI Band (A/B/C/D/F), Headless Index total, JAIRF total, methodology version, last verified date.
- Editorial verdict: 250-400 words of cited analysis written by the editor.
- Scorecard detail: five Headless Index sub-criteria with score, status, rationale, and citations. Plus the six JAIRF dimensions when computed.
- Evidence trail: raw evidence file (JSON) downloadable. External calibration block where available.
The band is the headline. The verdict is the interpretation. The sub-criteria are the working. The evidence trail is the receipts.
18. Sources and references
- Jentic API AI-Readiness Framework: github.com/jentic/api-ai-readiness-framework (Apache 2.0)
- cli-jentic scoring engine: pypi.org/project/cli-jentic
- Fern Agent Score: buildwithfern.com/agent-score
- CLIRank Agent-Friendliness Score: clirank.dev
- Cloudflare Agent Readiness Score: isitagentready.com
- Model Context Protocol Registry: registry.modelcontextprotocol.io
Status: Production. Scorecards are stamped THI Methodology v1.