Talvio Methodology White Paper

A methodology paper for Talvio's work-activity-based AI augmentation potential score. The summary states the scoped claim up front; the methodological tail shows the validation path, failed gates, coverage exclusions, and the inert-capability diagnostic.

Section 1

Key Findings / Plain Summary

Talvio is a work-activity-based method for ranking occupations and departments by AI augmentation potential as a training-prioritization signal. It is meant to help learning and development teams decide where structured AI enablement is most worth examining first, which work activities explain that priority, and which capability areas should shape training.

In plain terms, Talvio's ranking matches established external exposure measures strongly and matches real-world AI-use data moderately.

The validation claim is deliberately narrow. Talvio's priority ranking aligns strongly with Felten, Raj, and Seamans's AI Occupational Exposure measure, with the published validation line using rho = 0.920 on n = 682 SOC6 occupations. Because that referent is also derived from O*NET structure, Talvio also tests against Anthropic Economic Index observed AI task-use. On the n = 480 SOC6 common set, Talvio's correlation with AEI is rho = 0.424, and Felten AIOE's correlation with AEI is rho = 0.414. The AEI result is moderate and covers a 54% SOC6 subset. The claim is therefore not that Talvio highly predicts AI use. It is that Talvio corresponds to observed AI task-use about as well as the best structural exposure measure while remaining useful as a training-prioritization signal.

The main limitation is also straightforward. In the corrected validation, replacing the 12 reviewed AI capability maturity scores with a uniform capability value produced effectively identical occupation rankings: rho approximately 1.00 on n = 894 direct-scored occupations. As of the 2026-05-16 corrected run, capability maturity is not the driver of the occupation-priority score. The score is driven primarily by O*NET work-activity structure, importance/level weighting, and the physical/embodied work ceiling. The capability layer remains useful for training design: it describes what AI skills to teach once a role or department has been prioritized.

Talvio directly scores 894 occupations and explicitly carries 122 occupations as excluded because they lack O*NET 30.2 Work Activities coverage. Those exclusions are part of the method's scope boundary.

Section 2

Why Measuring This Is Hard

AI workforce measurement is difficult because nearby ideas are easy to confuse. Task exposure, observed AI task-use, and model performance on realistic deliverables are related, but they are not interchangeable. A role can have high work-activity exposure without frequent current AI use. A benchmark can show strong model performance on selected deliverables without validating an occupation-wide training-priority ranking.

Talvio borrows the exposure-measurement machinery of the AIOE/Eloundou lineage, but it is scoped to training prioritization rather than staffing-impact or unemployment forecasting, and it should not be evaluated as a staffing-outcome predictor.

Talvio sits in the task-based exposure lineage. Felten, Raj, and Seamans provide the load-bearing external validator and methodological ancestor: AIOE maps AI applications onto occupational structure and is available at SOC granularity. Eloundou, Manning, Mishkin, and Rock provide a task-level rubric for large language model exposure. Webb's patent-to-task mapping is a contrasting path that Talvio did not adopt, because the present method needed an auditable work-activity spine that maps cleanly into training conversations. Frey and Osborne remain a useful foil for how influential staffing-impact prediction frames can travel farther than their practical uncertainty should allow.

The lesson from that literature is not that measurement should be abandoned. It is that construct boundaries matter. Talvio therefore separates the evidence tests: Felten AIOE is the structural exposure check, AEI is the independent observed-use check, and GDPval is treated as a real-world deliverable-quality benchmark rather than as validation of occupation-priority ranking.

The numbers reinforce the distinction. On the n = 480 SOC6 common set, Talvio vs. AEI is rho = 0.424, while Felten AIOE vs. AEI is rho = 0.414. GDPval produces near-zero correlations with Talvio under the primary weighting approach, at -0.03, 0.08, and 0.07 on n = 41 GDPval occupations. Talvio treats that GDPval result as construct divergence: GDPval evaluates peak deliverable quality in selected occupations, while Talvio ranks work-activity exposure for training sequence.

Section 3

The Capability-First Approach As Designed

Talvio began with a capability-first hypothesis: score occupations by combining ONET work activity importance and level with a reviewed matrix linking work activities to AI capability areas. The occupation layer uses ONET 30.2 Work Activities, including 41 generalized work activity elements with importance and level scales.

The capability layer contains 12 reviewed AI capability maturity scores. In the corrected 2026-05-16 input, those scores range from 7.1 to 9.2. A separate physical/embodied work ceiling is retained at 1.5 and is not sourced from the capability-score workbook. The primary scoring approach multiplies work-activity importance by work-activity level; an importance-only version is retained as a sensitivity check.

The corrected scoring output has 1,016 rows: 894 direct-scored rows and 122 excluded rows. Among direct-scored rows, the primary 0-100 TAP score distribution has mean 68.38, standard deviation 7.07, minimum 45.10, maximum 82.14, and modal 10-point band share 44.1%.

The design remains useful, but the validation changed what can be claimed about it. The capability layer was designed as part of the scoring architecture. The corrected testing showed that capability maturity differences do not materially explain the resulting rank order.

Figure 1. Two-layer methodology architecture Talvio separates the ranking spine from the training-design layer: O*NET work activities, importance/level weighting, and the physical ceiling drive TAP, while the 12 capabilities describe what to teach. Figure 1. Two-layer methodology architecture Scoring spine and training context are kept distinct Inputs 41 O*NET 30.2 activities define the task spine. Importance x level weights emphasize central role work. Coverage review keeps missing rows explicit. TAP scoring spine Activity addressability is combined with role weights. Embodied and real-time work is bounded by a 1.5/10 ceiling. 894 occupations receive direct training-priority scores. Training context 12 AI skill areas shape what an L&D team teaches. Benchmarks anchor maturity context, not ranking causality. 122 exclusions stay visible for honest coverage. Current score explanation work-activity-based, not driven by maturity scores Talvio separates the ranking spine from the training-design layer: O*NET work activities, importance/level weighting, and the physical ceiling drive TAP, while the 12 capabilities describe what to teach.
Figure 1. Talvio separates the ranking spine from the training-design layer: O*NET work activities, importance/level weighting, and the physical ceiling drive TAP, while the 12 capabilities describe what to teach.

Section 4

What Testing Showed

The key diagnostic was simple. The corrected rerun recomputed Talvio's priority score after setting the 12 reviewed AI capability maturity scores to their uniform mean of 8.11 while retaining the physical/embodied ceiling at 1.5. The comparison between the production score and the uniform-capability score is effectively identical at rho approximately 1.00 on n = 894 direct-scored occupations. The importance-only sensitivity comparison is also effectively identical at rho approximately 1.00 on n = 894 direct-scored occupations.

This means the capability maturity spread is not materially changing occupation ranks. The occupation-priority score is instead dominated by O*NET work-activity structure, importance/level weighting, and the physical/embodied ceiling. The published configuration therefore treats capability maturity as non-explanatory for the current ranking.

The other validation gates are mixed but useful. The cross-industry diagnostic panel passed 50 of 50 checks, with 0 total fails and 0 healthcare fails. The driver consistency gate found 0 implausible top-3 GWA drivers. The distribution gate failed because the primary span was 37.04 against a threshold of at least 40 and the modal band share was 44.1% against a threshold of at most 25%. The legacy-comparison gate also failed: the corrected ranking remained closer to the previous ranking than the gate allowed.

The result is a scoped claim rather than a broad one. The method is validated as a work-activity-based training-prioritization signal with specific external support and specific limitations. It is not evidence that capability maturity scores drive occupation priority.

Figure 2. Theoretical surface vs realized training-priority area The centerpiece area view uses diagnostic-panel category means: higher teal area means more work-activity exposure for training priority, while each category still retains non-teal area for human context, embodied work, judgment, or local workflow fit. Figure 2. Theoretical surface vs realized training-priority area Occupation categories from the corrected diagnostic panel 78.03 mean TAP Knowledge professional n=4 74.82 mean TAP Technical knowledge n=5 73.84 mean TAP Office admin n=4 68.16 mean TAP Care and public-facing n=8 60.20 mean TAP Physical service n=11 White square = 100-point surface; teal fill = realized training-priority area in the diagnostic panel. The centerpiece area view uses diagnostic-panel category means: higher teal area means more work-activity exposure for training priority, while each category still retains non-teal area for human context, embodied work, judgment, or local workflow fit.
Figure 2. The centerpiece area view uses diagnostic-panel category means: higher teal area means more work-activity exposure for training priority, while each category still retains non-teal area for human context, embodied work, judgment, or local workflow fit.
Figure 3. Uniform-capability diagnostic Replacing the 12 reviewed AI capability maturity scores with their uniform mean of 8.11 leaves the primary occupation ranking effectively identical (rho approximately 1.00, n=894), so the capability layer is disclosed as training context rather than the ranking driver. Figure 3. Uniform-capability diagnostic Real TAP and uniform-capability TAP are near-identical in rank order 50 50 60 60 70 70 80 80 1 2 3 4 5 6 Real TAP score Uniform-capability TAP score rho ~1.00 effectively identical; n=894 occupations Physical/embodied ceiling retained at 1.5/10 Examples across the score range 1. Heavy truck drivers 58.3 -> 58.6 2. Construction laborers 61.4 -> 61.8 3. Retail Salespersons 66.1 -> 66.3 4. Registered nurses 68.0 -> 68.4 5. Office Clerks, General 72.2 -> 72.6 6. Software developers 79.6 -> 80.1 Replacing the 12 reviewed AI capability maturity scores with their uniform mean of 8.11 leaves the primary occupation ranking effectively identical (rho approximately 1.00, n=894), so the capability layer is disclosed as training context rather than the ranking driver.
Figure 3. Replacing the 12 reviewed AI capability maturity scores with their uniform mean of 8.11 leaves the primary occupation ranking effectively identical (rho approximately 1.00, n=894), so the capability layer is disclosed as training context rather than the ranking driver.

Section 5

What It Is Validated To Do

Talvio is validated for one main use: ranking occupation and department training priority from work-activity exposure. The published validation line pairs Felten AIOE rho = 0.920 on n = 682 SOC6 occupations with AEI observed AI task-use rho = 0.424 on the n = 480 SOC6 common set from the March 24, 2026 Anthropic Economic Index release.

The Felten validation is strong but structurally close. In the corrected validation package, Felten AIOE SOC6 validation has rho = 0.920 on n = 682 SOC6 occupations. The SOC3 gate result is rho = 0.953 on n = 89 SOC3 groups. That is the strongest single validation referent, but it carries the circularity caveat because both Talvio and AIOE are O*NET-structure-derived.

The AEI validation addresses that caveat for the scoped claim. On the identical n = 480 SOC6 common set, Talvio vs. AEI is rho = 0.424, Talvio vs. Felten AIOE is rho = 0.924, and Felten AIOE vs. AEI is rho = 0.414. AEI is independent of O*NET structure and measures observed AI task-use, so the moderate AEI result is useful corroboration rather than another structural mirror.

GDPval is the third referent, and it behaves differently. The primary GDPval panel uses n = 41 GDPval occupations and produces rho values of -0.03, 0.08, and 0.07. On the same direct-TAP GDPval pool with Felten values available, the GDPval vs. Felten diagnostic uses n = 38 GDPval/Felten same-pool occupations and produces rho values of -0.07, 0.03, and 0.02. Talvio treats this as expected near-orthogonality: GDPval measures a different construct from work-activity exposure.

The cross-sector panel supports face validity for training priority. The diagnostic-panel means used in Figure 5 are knowledge-professional 78.03, technical-knowledge 74.82, care/public 68.16, physical-service 60.20, and service-physical 62.04.

Figure 4. External-validation comparison The external-validation pattern is strong with Felten AIOE, moderate with AEI observed task use, and near zero with GDPval; GDPval is labeled as a different construct rather than treated as capability-level validation. Figure 4. External-validation comparison Strong exposure alignment, moderate observed-use corroboration, and expected GDPval orthogonality Felten AIOE rho 0.920 n=682 SOC6 O*NET-structure-derived exposure AEI observed task use rho 0.424 n=480 SOC6 common observed AI task-use, moderate GDPval rho ~0 n=41 GDPval different construct, expected; range -0.03 to 0.08 The external-validation pattern is strong with Felten AIOE, moderate with AEI observed task use, and near zero with GDPval; GDPval is labeled as a different construct rather than treated as capability-level validation.
Figure 4. The external-validation pattern is strong with Felten AIOE, moderate with AEI observed task use, and near zero with GDPval; GDPval is labeled as a different construct rather than treated as capability-level validation.
Figure 5. Cross-sector ranking pattern The cross-sector comparison follows the intended ordering for training priority: knowledge-professional roles average 78.03, care/public-facing roles 68.16, and physical-service roles 60.20. Figure 5. Cross-sector ranking pattern Representative category means under the primary importance-and-level weighting approach Knowledge professional n=4 78.03 Care and public-facing n=8 68.16 Physical service n=11 60.20 The cross-sector comparison follows the intended ordering for training priority: knowledge-professional roles average 78.03, care/public-facing roles 68.16, and physical-service roles 60.20.
Figure 5. The cross-sector comparison follows the intended ordering for training priority: knowledge-professional roles average 78.03, care/public-facing roles 68.16, and physical-service roles 60.20.

Section 6

What It Does Not Claim

Talvio does not claim to forecast staffing changes or labor-market outcomes. It ranks training priority.

Talvio does not claim that capability maturity scores explain the occupation ranking as of 2026-05-16. The uniform-capability diagnostic is effectively identical at rho approximately 1.00 on n = 894 direct-scored occupations, so the production method treats capability maturity as training-design context rather than ranking causality.

Talvio does not claim that GDPval validates capability-level differentiation. GDPval is valuable evidence about model performance on real-world deliverables, but the Talvio-GDPval correlations are near zero with n = 41 GDPval occupations and are treated as construct divergence.

Talvio does not silently fill occupations that lack O*NET 30.2 Work Activities coverage. The current coverage gap is 122 rows: 77 residual SOC/no descriptor rows, 26 split-code/no Work Activities rows, and 19 out-of-scope military rows. Donor imputation remains outside the production method, and split-code occupations are deferred to a future research path pending validated donor rules and disclosure standards.

Section 7

Methodological Detail / Rigorous Tail

7.1 Inputs and Claim Contract

The corrected Phase 4 validation run is dated 2026-05-16. The governing scoped-claims contract is human-reviewed. It defines the public claim as AI augmentation potential for training prioritization, not a staffing-impact forecast.

The occupation substrate is ONET 30.2 Work Activities. ONET 30.2 is distributed by the U.S. Department of Labor / Employment and Training Administration through the National Center for O*NET Development under CC BY 4.0 for the downloadable database. Talvio uses the Work Activities file and associated generalized work activity structure as the occupation spine.

The capability substrate is the human-reviewed capability registry. The 12 AI capability contexts and their primary benchmark anchors are:

Capability context Primary benchmark anchor As of
Written content generation and editing HELM Capabilities 2026-05-16
Information synthesis and research Artificial Analysis Intelligence Index 2026-05-16
Structured data analysis and quantitative reasoning Artificial Analysis Intelligence Benchmarking 2026-05-16
Coding and software engineering SWE-bench 2026-05-16
Conversational support and customer interaction Arena Text Leaderboard 2026-05-16
Translation and cross-language work Artificial Analysis Multilingual Index 2026-05-16
Speech and audio processing Artificial Analysis Speech to Text Leaderboard 2026-05-16
Image and document understanding MMMU-Pro 2026-05-16
Image, video, and design generation Artificial Analysis Image Model Leaderboard 2026-05-16
Planning, scheduling, and structured decision support GDPval 2026-05-16
Tool use and agent workflows METR Time Horizons 2026-05-16
Domain-specialist reasoning GPQA 2026-05-16

The benchmark URLs are carried in the audit registry used to build this paper. Where canonical papers exist, the reference list cites them directly. Where a benchmark is maintained as a live leaderboard, the registry URL and retrieval date are the audit source.

7.2 Scoring Spine

The primary scoring approach multiplies O*NET Work Activities importance by level. For each directly scored occupation, Talvio uses those values, combines them with the reviewed work-activity-to-capability matrix, applies the physical/embodied ceiling, and converts the result to a 0-100 TAP score. A secondary importance-only version is retained for sensitivity checks.

The corrected scoring output has 1,016 rows, 894 direct-scored rows, and 122 excluded rows. The direct-scored primary 0-100 TAP distribution is:

Metric Value
Direct-scored occupations 894
Mean 68.38
Median 68.84
Minimum 45.10
Maximum 82.14
Standard deviation 7.07
Span 37.04
Modal band count 394
Modal band share 44.1%

The distribution concentration is disclosed as a characteristic of the validated method. It is not hidden by recalibration.

7.3 Validation Gates

The corrected validation gates are:

Gate Result Observed value
External structural validation pass Felten SOC3 rho = 0.953
Diagnostic panel pass 50/50 checks pass; 0 total fails; 0 healthcare fails
Driver consistency pass 0 implausible top-3 GWA drivers
Distribution range and concentration fail span = 37.04; modal band share = 44.1%
Relationship to previous ranking fail rho = 0.915
Healthcare service-line review pass deviations documented
GDPval capability-compression review manual review required all primary GDPval series rho < 0.30

The two failed gates are part of the evidence trail. The distribution gate shows that the score distribution remains concentrated. The legacy-comparison gate shows that the corrected work-activity-based ranking is still very close to the previous ranking. Neither gate is converted into a stronger claim.

7.4 External Referents

Felten AIOE is the primary structural referent. Talvio's primary strategy correlates with Felten AIOE at rho = 0.920 on n = 682 SOC6 occupations and rho = 0.953 on n = 89 SOC3 groups.

AEI is the independent observed-use referent. The validation addendum uses the March 24, 2026 Anthropic Economic Index release, based on Claude conversations from February 5 to February 12, 2026. The denominator path is explicit: 894 direct-scored Talvio occupation rows, 774 unique direct SOC6 codes, 670 direct rows matching AEI SOC6, 554 unique direct SOC6 codes matching AEI, 575 direct rows in the common AEI/Felten set, and 480 unique SOC6 codes in that common set.

The common-set correlations are:

Comparison n Spearman rho
Talvio primary 0-100 TAP score vs AEI observed AI task-use 480 0.424
Felten AIOE vs AEI observed AI task-use 480 0.414
Talvio primary 0-100 TAP score vs Felten AIOE 480 0.924

GDPval is the peak-deliverable-quality referent. The GDPval panel under the primary weighting approach uses n = 41 and produces rho values of -0.03, 0.08, and 0.07. GDPval join coverage is 41 of 44 occupations with direct TAP scores. Per-occupation GDPval values were transcribed from OpenAI GDPval PDF Figure 11 rather than taken from a machine-readable per-occupation result table.

7.5 Causal Diagnostic

The uniform-capability diagnostic is the key causal check on the capability-first hypothesis. It sets the 12 reviewed AI capability maturity scores to a uniform mean of 8.11 while retaining the physical/embodied ceiling at 1.5.

The production-score versus uniform-capability-score result is effectively identical at rho approximately 1.00 on n = 894 direct-scored occupations. The importance-only sensitivity result is also effectively identical at rho approximately 1.00 on n = 894 direct-scored occupations. The rigorous tail reports this as rho approximately 1.00 (effectively identical; machine output reported as 0.9999), n = 894 direct-scored occupations.

The interpretation is narrow but important: capability maturity scores are training-design context, not the current driver of occupation-priority rank. The capability layer remains useful for curriculum design, but score explanation should remain work-activity-based until a future validation demonstrates material capability-layer differentiation.

7.6 Coverage Handling

Talvio carries explicit coverage exclusions. The 122 excluded rows are split into 77 residual SOC/no descriptor rows, 26 split-code/no Work Activities rows, and 19 out-of-scope military rows. These rows remain visible as exclusions rather than being silently dropped from the denominator.

Residual "All Other" SOC rows are category-level non-coverage by construction: these codes do not have O*NET Work Activities profiles by design, so they are not 77 individually meaningful omissions.

Military rows are out of scope by design for Talvio's civilian occupation scope, not a coverage failure.

The 26 split-code occupations are: Legislators; Public Relations Managers; Project Management Specialists; Financial and Investment Analysts; Financial Risk Specialists; Web and Digital Interface Designers; Data Scientists; Calibration Technologists and Technicians; Hydrologic Technicians; Special Education Teachers, Kindergarten; Disc Jockeys, Except Radio; Lighting Technicians; Cardiologists; Orthopedic Surgeons, Except Pediatric; Pediatric Surgeons; Emergency Medical Technicians; Paramedics; Medical Records Specialists; Health Information Technologists and Medical Registrars; School Bus Monitors; First-Line Supervisors of Entertainment and Recreation Workers, Except Gambling Services; Crematory Operators; Sales Representatives of Services, Except Advertising, Insurance, Financial Services, and Travel; First-Line Supervisors of Passenger Attendants; Taxi Drivers; Aircraft Service Attendants.

The complete 122-row exclusion list is maintained as a supplementary coverage-exclusion list for audit review.

Donor mapping review materials exist, but donor imputation is outside the production scoring scope. The split-code group remains a future research path, contingent on a validated donor method and a clear disclosure standard.

Section 8

Limitations + DWA-Spine Open Question

The first limitation is the inert-capability finding. Talvio's design began with a capability-first hypothesis, but the corrected diagnostic showed an effectively identical relationship, rho approximately 1.00 on n = 894 direct-scored occupations, between real TAP and uniform-capability TAP. The method should therefore be described as work-activity-driven for ranking and capability-informed for training design.

The second limitation is AEI scope. AEI provides independent observed-use corroboration, but the published AEI claim is moderate and computed on a 54% SOC6 subset. The common-set audit table shows the exact denominator path: 894 direct-scored occupation rows, 774 unique direct SOC6 codes, 670 direct rows matching AEI SOC6, 554 unique direct SOC6 codes matching AEI, 575 direct rows in the common AEI/Felten set, and 480 unique SOC6 codes in that common set.

The third limitation is GDPval data shape. GDPval is a valuable real-world capability benchmark, but Talvio's per-occupation GDPval correlations rely on values transcribed from OpenAI GDPval PDF Figure 11. The n = 41 primary panel should be treated as evidence about construct divergence, not as a machine-readable benchmark integration.

The fourth limitation is coverage. Talvio directly scores 894 occupations and carries 122 exclusions: 77 residual SOC/no descriptor rows, 26 split-code/no Work Activities rows, and 19 out-of-scope military rows. Any interpretation of rank gaps must account for those exclusions.

The fifth limitation is granularity. The current method uses the GWA spine because it gives a stable cross-occupation structure with Work Activities coverage. A DWA-spine method might preserve more occupation-specific variation, especially for split-code occupations, but it would require new donor decisions, a new validation cycle, and a clean rule for when detailed activity data can be used without overstating precision. For the production method, DWA-spine and donor imputation remain open questions rather than current scoring claims.

Section 9

References

External Literature and Data Sources

Capability Benchmark Sources