Methodology, reviewed 2026-05-16
Talvio's score is a work-activity-exposure signal. It helps organizations decide where AI training and workflow discovery should start by reading the O*NET activities that make up each occupation, then weighting those activities by their importance and level in the role.
The Talvio Augmentation Potential score is designed for training prioritization, workflow discovery, and budget planning. It is not an individual performance measure and it is not a staffing-impact forecast.
The methodology keeps a high-resolution internal value for validation. Product screens display TAP on a 0-10 scale; it is a display rescaling, not a separate score.
TAP is deterministic: the same reviewed inputs always produce the same score, and the scoring run is arithmetic. There is no model inference at scoring time. The score combines what an occupation does from O*NET with reviewed AI capability maturity values from public benchmarks.
TAP is a within-occupation share of addressable weighted work, not a rank and not an absolute capability claim. Occupations with the same TAP have similar proportions of addressable weighted work, not necessarily similar job content. In the current release, product score explanations are work-activity-based, so top O*NET work-activity drivers are the honest explanation for a role's score.
The reviewed capability scores help shape training design and capability views. They answer "what AI skills are relevant here?" while the priority ranking remains work-activity-exposure-based in the current validated release.
| Capability | Reviewed maturity | Primary benchmark |
|---|---|---|
| Written content generation and editing | 9.2/10 | HELM Capabilities |
| Information synthesis and research | 7.8/10 | Artificial Analysis Intelligence Index |
| Structured data analysis and quantitative reasoning | 8.1/10 | Artificial Analysis Intelligence Benchmarking |
| Coding and software engineering | 8.4/10 | SWE-bench |
| Conversational support and customer interaction | 8.4/10 | Arena Text Leaderboard |
| Translation and cross-language work | 8.5/10 | Artificial Analysis Multilingual Index |
| Speech and audio processing | 7.7/10 | Artificial Analysis Speech to Text Leaderboard |
| Image and document understanding | 8.6/10 | MMMU-Pro |
| Image, video, and design generation | 8.2/10 | Artificial Analysis Image Model Leaderboard |
| Planning, scheduling, and structured decision support | 7.6/10 | GDPval |
| Tool use and autonomous agents | 7.1/10 | METR Time Horizons |
| Domain-specialist reasoning | 7.7/10 | GPQA |
The score is work-activity-exposure-based: the training-prioritization ranking is validated against Felten AIOE (Spearman rho=0.920, n=682 SOC6, O*NET-structure-derived) and independently corroborated against Anthropic Economic Index observed AI task-use (Spearman rho=0.424, n=480 SOC6 occupations, release_2026_03_24). The AEI correlation is moderate and computed on a 54% SOC6 subset; the claim is that TAP corresponds to observed use about as well as the best structural exposure measure (Felten AIOE vs. AEI rho=0.414 on the same n=480 common set), not that it is highly predictive of AI use.
AEI is independent of O*NET structure: it is observed AI task-use rather than an O*NET-structure-derived exposure score, and it corroborates TAP about as strongly as it corroborates AIOE (rho=0.424 vs rho=0.414, n=480 SOC6), addressing the structural-circularity limitation.
GDPval near-zero correlation is explained by construct differences: TAP shows the same pattern as Felten AIOE (rho=0.920, n=682 SOC6, O*NET-structure-derived) and Anthropic Economic Index observed AI task-use (rho=0.424, n=480 SOC6 occupations, release_2026_03_24): strong with exposure measures, moderate with observed use, and near-orthogonal to peak-deliverable-quality benchmarks. For Talvio's purpose, that orthogonality is correct behavior: GDPval measures peak deliverable quality on selected tasks, while TAP measures exposure of an occupation's work-activity mix.
The validated claim is deliberately scoped: Talvio corresponds to observed AI use about as well as Felten AIOE on the matched AEI set (Talvio-vs-AEI rho=0.424; Felten-vs-AEI rho=0.414). It does not claim high prediction of observed use for every occupation or every organization.
For readers who want benchmark comparisons, validation narrative, limitations, references, and audit detail.
The production model carries 122 explicit exclusions across residual, split-code, and out-of-scope military rows. They remain visible as excluded rows with provenance rather than being silently scored.
Reason categories are A residual (residual_soc_no_descriptor_data), B split-code
(split_code_no_work_activities), and C military (out_of_scope_military).
Donor imputation was analyzed and is not included in the production method.