Talvio Methodology White Paper

Section 1

Key Findings / Plain Summary

Talvio is a work-activity-based method for ranking occupations and departments by AI augmentation potential as a training-prioritization signal. It is meant to help learning and development teams decide where structured AI enablement is most worth examining first, which work activities explain that priority, and which capability areas should shape training.

In plain terms, Talvio's ranking matches established external exposure measures strongly and matches real-world AI-use data moderately.

The validation claim is deliberately narrow. Talvio's priority ranking aligns strongly with Felten, Raj, and Seamans's AI Occupational Exposure measure, with the published validation line using rho = 0.920 on n = 682 SOC6 occupations. Because that referent is also derived from O*NET structure, Talvio also tests against Anthropic Economic Index observed AI task-use. On the n = 480 SOC6 common set, Talvio's correlation with AEI is rho = 0.424, and Felten AIOE's correlation with AEI is rho = 0.414. The AEI result is moderate and covers a 54% SOC6 subset. The claim is therefore not that Talvio highly predicts AI use. It is that Talvio corresponds to observed AI task-use about as well as the best structural exposure measure while remaining useful as a training-prioritization signal.

The main limitation is also straightforward. In the corrected validation, replacing the 12 reviewed AI capability maturity scores with a uniform capability value produced effectively identical occupation rankings: rho approximately 1.00 on n = 894 direct-scored occupations. As of the 2026-05-16 corrected run, capability maturity is not the driver of the occupation-priority score. The score is driven primarily by O*NET work-activity structure, importance/level weighting, and the physical/embodied work ceiling. The capability layer remains useful for training design: it describes what AI skills to teach once a role or department has been prioritized.

Talvio directly scores 894 occupations and explicitly carries 122 occupations as excluded because they lack O*NET 30.2 Work Activities coverage. Those exclusions are part of the method's scope boundary.

Section 2

Why Measuring This Is Hard

AI workforce measurement is difficult because nearby ideas are easy to confuse. Task exposure, observed AI task-use, and model performance on realistic deliverables are related, but they are not interchangeable. A role can have high work-activity exposure without frequent current AI use. A benchmark can show strong model performance on selected deliverables without validating an occupation-wide training-priority ranking.

Talvio borrows the exposure-measurement machinery of the AIOE/Eloundou lineage, but it is scoped to training prioritization rather than staffing-impact or unemployment forecasting, and it should not be evaluated as a staffing-outcome predictor.

Talvio sits in the task-based exposure lineage. Felten, Raj, and Seamans provide the load-bearing external validator and methodological ancestor: AIOE maps AI applications onto occupational structure and is available at SOC granularity. Eloundou, Manning, Mishkin, and Rock provide a task-level rubric for large language model exposure. Webb's patent-to-task mapping is a contrasting path that Talvio did not adopt, because the present method needed an auditable work-activity spine that maps cleanly into training conversations. Frey and Osborne remain a useful foil for how influential staffing-impact prediction frames can travel farther than their practical uncertainty should allow.

The lesson from that literature is not that measurement should be abandoned. It is that construct boundaries matter. Talvio therefore separates the evidence tests: Felten AIOE is the structural exposure check, AEI is the independent observed-use check, and GDPval is treated as a real-world deliverable-quality benchmark rather than as validation of occupation-priority ranking.

The numbers reinforce the distinction. On the n = 480 SOC6 common set, Talvio vs. AEI is rho = 0.424, while Felten AIOE vs. AEI is rho = 0.414. GDPval produces near-zero correlations with Talvio under the primary weighting approach, at -0.03, 0.08, and 0.07 on n = 41 GDPval occupations. Talvio treats that GDPval result as construct divergence: GDPval evaluates peak deliverable quality in selected occupations, while Talvio ranks work-activity exposure for training sequence.

Section 3

The Capability-First Approach As Designed

Talvio began with a capability-first hypothesis: score occupations by combining ONET work activity importance and level with a reviewed matrix linking work activities to AI capability areas. The occupation layer uses ONET 30.2 Work Activities, including 41 generalized work activity elements with importance and level scales.

The capability layer contains 12 reviewed AI capability maturity scores. In the corrected 2026-05-16 input, those scores range from 7.1 to 9.2. A separate physical/embodied work ceiling is retained at 1.5 and is not sourced from the capability-score workbook. The primary scoring approach multiplies work-activity importance by work-activity level; an importance-only version is retained as a sensitivity check.

The corrected scoring output has 1,016 rows: 894 direct-scored rows and 122 excluded rows. Among direct-scored rows, the primary 0-100 TAP score distribution has mean 68.38, standard deviation 7.07, minimum 45.10, maximum 82.14, and modal 10-point band share 44.1%.

The design remains useful, but the validation changed what can be claimed about it. The capability layer was designed as part of the scoring architecture. The corrected testing showed that capability maturity differences do not materially explain the resulting rank order.

Figure 1. Talvio separates the ranking spine from the training-design layer: O*NET work activities, importance/level weighting, and the physical ceiling drive TAP, while the 12 capabilities describe what to teach.

Section 4

What Testing Showed

The key diagnostic was simple. The corrected rerun recomputed Talvio's priority score after setting the 12 reviewed AI capability maturity scores to their uniform mean of 8.11 while retaining the physical/embodied ceiling at 1.5. The comparison between the production score and the uniform-capability score is effectively identical at rho approximately 1.00 on n = 894 direct-scored occupations. The importance-only sensitivity comparison is also effectively identical at rho approximately 1.00 on n = 894 direct-scored occupations.

This means the capability maturity spread is not materially changing occupation ranks. The occupation-priority score is instead dominated by O*NET work-activity structure, importance/level weighting, and the physical/embodied ceiling. The published configuration therefore treats capability maturity as non-explanatory for the current ranking.

The other validation gates are mixed but useful. The cross-industry diagnostic panel passed 50 of 50 checks, with 0 total fails and 0 healthcare fails. The driver consistency gate found 0 implausible top-3 GWA drivers. The distribution gate failed because the primary span was 37.04 against a threshold of at least 40 and the modal band share was 44.1% against a threshold of at most 25%. The legacy-comparison gate also failed: the corrected ranking remained closer to the previous ranking than the gate allowed.

The result is a scoped claim rather than a broad one. The method is validated as a work-activity-based training-prioritization signal with specific external support and specific limitations. It is not evidence that capability maturity scores drive occupation priority.

Figure 2. The centerpiece area view uses diagnostic-panel category means: higher teal area means more work-activity exposure for training priority, while each category still retains non-teal area for human context, embodied work, judgment, or local workflow fit.

Figure 3. Replacing the 12 reviewed AI capability maturity scores with their uniform mean of 8.11 leaves the primary occupation ranking effectively identical (rho approximately 1.00, n=894), so the capability layer is disclosed as training context rather than the ranking driver.

Section 5

What It Is Validated To Do

Talvio is validated for one main use: ranking occupation and department training priority from work-activity exposure. The published validation line pairs Felten AIOE rho = 0.920 on n = 682 SOC6 occupations with AEI observed AI task-use rho = 0.424 on the n = 480 SOC6 common set from the March 24, 2026 Anthropic Economic Index release.

The Felten validation is strong but structurally close. In the corrected validation package, Felten AIOE SOC6 validation has rho = 0.920 on n = 682 SOC6 occupations. The SOC3 gate result is rho = 0.953 on n = 89 SOC3 groups. That is the strongest single validation referent, but it carries the circularity caveat because both Talvio and AIOE are O*NET-structure-derived.

The AEI validation addresses that caveat for the scoped claim. On the identical n = 480 SOC6 common set, Talvio vs. AEI is rho = 0.424, Talvio vs. Felten AIOE is rho = 0.924, and Felten AIOE vs. AEI is rho = 0.414. AEI is independent of O*NET structure and measures observed AI task-use, so the moderate AEI result is useful corroboration rather than another structural mirror.

GDPval is the third referent, and it behaves differently. The primary GDPval panel uses n = 41 GDPval occupations and produces rho values of -0.03, 0.08, and 0.07. On the same direct-TAP GDPval pool with Felten values available, the GDPval vs. Felten diagnostic uses n = 38 GDPval/Felten same-pool occupations and produces rho values of -0.07, 0.03, and 0.02. Talvio treats this as expected near-orthogonality: GDPval measures a different construct from work-activity exposure.

The cross-sector panel supports face validity for training priority. The diagnostic-panel means used in Figure 5 are knowledge-professional 78.03, technical-knowledge 74.82, care/public 68.16, physical-service 60.20, and service-physical 62.04.

Figure 4. The external-validation pattern is strong with Felten AIOE, moderate with AEI observed task use, and near zero with GDPval; GDPval is labeled as a different construct rather than treated as capability-level validation.

Figure 5. The cross-sector comparison follows the intended ordering for training priority: knowledge-professional roles average 78.03, care/public-facing roles 68.16, and physical-service roles 60.20.

Section 6

What It Does Not Claim

Talvio does not claim to forecast staffing changes or labor-market outcomes. It ranks training priority.

Talvio does not claim that capability maturity scores explain the occupation ranking as of 2026-05-16. The uniform-capability diagnostic is effectively identical at rho approximately 1.00 on n = 894 direct-scored occupations, so the production method treats capability maturity as training-design context rather than ranking causality.

Talvio does not claim that GDPval validates capability-level differentiation. GDPval is valuable evidence about model performance on real-world deliverables, but the Talvio-GDPval correlations are near zero with n = 41 GDPval occupations and are treated as construct divergence.

Talvio does not silently fill occupations that lack O*NET 30.2 Work Activities coverage. The current coverage gap is 122 rows: 77 residual SOC/no descriptor rows, 26 split-code/no Work Activities rows, and 19 out-of-scope military rows. Donor imputation remains outside the production method, and split-code occupations are deferred to a future research path pending validated donor rules and disclosure standards.

Section 7

Methodological Detail / Rigorous Tail

7.1 Inputs and Claim Contract

The corrected Phase 4 validation run is dated 2026-05-16. The governing scoped-claims contract is human-reviewed. It defines the public claim as AI augmentation potential for training prioritization, not a staffing-impact forecast.

The occupation substrate is ONET 30.2 Work Activities. ONET 30.2 is distributed by the U.S. Department of Labor / Employment and Training Administration through the National Center for O*NET Development under CC BY 4.0 for the downloadable database. Talvio uses the Work Activities file and associated generalized work activity structure as the occupation spine.

The capability substrate is the human-reviewed capability registry. The 12 AI capability contexts and their primary benchmark anchors are:

Capability context	Primary benchmark anchor	As of
Written content generation and editing	HELM Capabilities	2026-05-16
Information synthesis and research	Artificial Analysis Intelligence Index	2026-05-16
Structured data analysis and quantitative reasoning	Artificial Analysis Intelligence Benchmarking	2026-05-16
Coding and software engineering	SWE-bench	2026-05-16
Conversational support and customer interaction	Arena Text Leaderboard	2026-05-16
Translation and cross-language work	Artificial Analysis Multilingual Index	2026-05-16
Speech and audio processing	Artificial Analysis Speech to Text Leaderboard	2026-05-16
Image and document understanding	MMMU-Pro	2026-05-16
Image, video, and design generation	Artificial Analysis Image Model Leaderboard	2026-05-16
Planning, scheduling, and structured decision support	GDPval	2026-05-16
Tool use and agent workflows	METR Time Horizons	2026-05-16
Domain-specialist reasoning	GPQA	2026-05-16

The benchmark URLs are carried in the audit registry used to build this paper. Where canonical papers exist, the reference list cites them directly. Where a benchmark is maintained as a live leaderboard, the registry URL and retrieval date are the audit source.

7.2 Scoring Spine

The primary scoring approach multiplies O*NET Work Activities importance by level. For each directly scored occupation, Talvio uses those values, combines them with the reviewed work-activity-to-capability matrix, applies the physical/embodied ceiling, and converts the result to a 0-100 TAP score. A secondary importance-only version is retained for sensitivity checks.

The corrected scoring output has 1,016 rows, 894 direct-scored rows, and 122 excluded rows. The direct-scored primary 0-100 TAP distribution is:

Metric	Value
Direct-scored occupations	894
Mean	68.38
Median	68.84
Minimum	45.10
Maximum	82.14
Standard deviation	7.07
Span	37.04
Modal band count	394
Modal band share	44.1%

The distribution concentration is disclosed as a characteristic of the validated method. It is not hidden by recalibration.

7.3 Validation Gates

The corrected validation gates are:

Gate	Result	Observed value
External structural validation	pass	Felten SOC3 rho = 0.953
Diagnostic panel	pass	50/50 checks pass; 0 total fails; 0 healthcare fails
Driver consistency	pass	0 implausible top-3 GWA drivers
Distribution range and concentration	fail	span = 37.04; modal band share = 44.1%
Relationship to previous ranking	fail	rho = 0.915
Healthcare service-line review	pass	deviations documented
GDPval capability-compression review	manual review required	all primary GDPval series rho < 0.30

The two failed gates are part of the evidence trail. The distribution gate shows that the score distribution remains concentrated. The legacy-comparison gate shows that the corrected work-activity-based ranking is still very close to the previous ranking. Neither gate is converted into a stronger claim.

7.4 External Referents

Felten AIOE is the primary structural referent. Talvio's primary strategy correlates with Felten AIOE at rho = 0.920 on n = 682 SOC6 occupations and rho = 0.953 on n = 89 SOC3 groups.

AEI is the independent observed-use referent. The validation addendum uses the March 24, 2026 Anthropic Economic Index release, based on Claude conversations from February 5 to February 12, 2026. The denominator path is explicit: 894 direct-scored Talvio occupation rows, 774 unique direct SOC6 codes, 670 direct rows matching AEI SOC6, 554 unique direct SOC6 codes matching AEI, 575 direct rows in the common AEI/Felten set, and 480 unique SOC6 codes in that common set.

The common-set correlations are:

Comparison	n	Spearman rho
Talvio primary 0-100 TAP score vs AEI observed AI task-use	480	0.424
Felten AIOE vs AEI observed AI task-use	480	0.414
Talvio primary 0-100 TAP score vs Felten AIOE	480	0.924

GDPval is the peak-deliverable-quality referent. The GDPval panel under the primary weighting approach uses n = 41 and produces rho values of -0.03, 0.08, and 0.07. GDPval join coverage is 41 of 44 occupations with direct TAP scores. Per-occupation GDPval values were transcribed from OpenAI GDPval PDF Figure 11 rather than taken from a machine-readable per-occupation result table.

7.5 Causal Diagnostic

The uniform-capability diagnostic is the key causal check on the capability-first hypothesis. It sets the 12 reviewed AI capability maturity scores to a uniform mean of 8.11 while retaining the physical/embodied ceiling at 1.5.

The production-score versus uniform-capability-score result is effectively identical at rho approximately 1.00 on n = 894 direct-scored occupations. The importance-only sensitivity result is also effectively identical at rho approximately 1.00 on n = 894 direct-scored occupations. The rigorous tail reports this as rho approximately 1.00 (effectively identical; machine output reported as 0.9999), n = 894 direct-scored occupations.

The interpretation is narrow but important: capability maturity scores are training-design context, not the current driver of occupation-priority rank. The capability layer remains useful for curriculum design, but score explanation should remain work-activity-based until a future validation demonstrates material capability-layer differentiation.

7.6 Coverage Handling

Talvio carries explicit coverage exclusions. The 122 excluded rows are split into 77 residual SOC/no descriptor rows, 26 split-code/no Work Activities rows, and 19 out-of-scope military rows. These rows remain visible as exclusions rather than being silently dropped from the denominator.

Residual "All Other" SOC rows are category-level non-coverage by construction: these codes do not have O*NET Work Activities profiles by design, so they are not 77 individually meaningful omissions.

Military rows are out of scope by design for Talvio's civilian occupation scope, not a coverage failure.

The 26 split-code occupations are: Legislators; Public Relations Managers; Project Management Specialists; Financial and Investment Analysts; Financial Risk Specialists; Web and Digital Interface Designers; Data Scientists; Calibration Technologists and Technicians; Hydrologic Technicians; Special Education Teachers, Kindergarten; Disc Jockeys, Except Radio; Lighting Technicians; Cardiologists; Orthopedic Surgeons, Except Pediatric; Pediatric Surgeons; Emergency Medical Technicians; Paramedics; Medical Records Specialists; Health Information Technologists and Medical Registrars; School Bus Monitors; First-Line Supervisors of Entertainment and Recreation Workers, Except Gambling Services; Crematory Operators; Sales Representatives of Services, Except Advertising, Insurance, Financial Services, and Travel; First-Line Supervisors of Passenger Attendants; Taxi Drivers; Aircraft Service Attendants.

The complete 122-row exclusion list is maintained as a supplementary coverage-exclusion list for audit review.

Donor mapping review materials exist, but donor imputation is outside the production scoring scope. The split-code group remains a future research path, contingent on a validated donor method and a clear disclosure standard.

Section 8

Limitations + DWA-Spine Open Question

The first limitation is the inert-capability finding. Talvio's design began with a capability-first hypothesis, but the corrected diagnostic showed an effectively identical relationship, rho approximately 1.00 on n = 894 direct-scored occupations, between real TAP and uniform-capability TAP. The method should therefore be described as work-activity-driven for ranking and capability-informed for training design.

The second limitation is AEI scope. AEI provides independent observed-use corroboration, but the published AEI claim is moderate and computed on a 54% SOC6 subset. The common-set audit table shows the exact denominator path: 894 direct-scored occupation rows, 774 unique direct SOC6 codes, 670 direct rows matching AEI SOC6, 554 unique direct SOC6 codes matching AEI, 575 direct rows in the common AEI/Felten set, and 480 unique SOC6 codes in that common set.

The third limitation is GDPval data shape. GDPval is a valuable real-world capability benchmark, but Talvio's per-occupation GDPval correlations rely on values transcribed from OpenAI GDPval PDF Figure 11. The n = 41 primary panel should be treated as evidence about construct divergence, not as a machine-readable benchmark integration.

The fourth limitation is coverage. Talvio directly scores 894 occupations and carries 122 exclusions: 77 residual SOC/no descriptor rows, 26 split-code/no Work Activities rows, and 19 out-of-scope military rows. Any interpretation of rank gaps must account for those exclusions.

The fifth limitation is granularity. The current method uses the GWA spine because it gives a stable cross-occupation structure with Work Activities coverage. A DWA-spine method might preserve more occupation-specific variation, especially for split-code occupations, but it would require new donor decisions, a new validation cycle, and a clean rule for when detailed activity data can be used without overstating precision. For the production method, DWA-spine and donor imputation remain open questions rather than current scoring claims.

Section 9

References

External Literature and Data Sources

Handa, K., Tamkin, A., McCain, M., Huang, S., Durmus, E., Heck, S., Mueller, J., Hong, J., Ritchie, S., Belonax, T., Troy, K. K., Amodei, D., Kaplan, J., Clark, J., & Ganguli, D. (2025). Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations. arXiv:2503.04761. https://arxiv.org/abs/2503.04761
Anthropic. (2026). Anthropic Economic Index, March 24, 2026 dataset release. Dataset and source registry: https://huggingface.co/datasets/Anthropic/EconomicIndex
Blinder, A. S., & Krueger, A. B. (2013). Alternative measures of offshorability: A survey approach. Journal of Labor Economics, 31(2), S97-S128. https://doi.org/10.1086/669061
Eloundou, T., Manning, S., Mishkin, P., & Rock, D. (2023). GPTs are GPTs: An early look at the labor market impact potential of large language models. arXiv:2303.10130. https://arxiv.org/abs/2303.10130
Felten, E., Raj, M., & Seamans, R. (2021). Occupational, industry, and geographic exposure to artificial intelligence: A novel dataset and its potential uses. Strategic Management Journal, 42(12), 2195-2217. https://doi.org/10.1002/smj.3286
Felten, Raj, and Seamans AIOE data repository. https://github.com/AIOE-Data/AIOE
Frey, C. B., & Osborne, M. A. (2017). The future of employment: How susceptible are jobs to computerisation? Technological Forecasting and Social Change, 114, 254-280. https://doi.org/10.1016/j.techfore.2016.08.019
Kochhar, R. (2023). Which U.S. workers are more exposed to AI on their jobs? Pew Research Center. https://www.pewresearch.org/social-trends/2023/07/26/which-u-s-workers-are-more-exposed-to-ai-on-their-jobs/
Massenkoff, M., & McCrory, P. (2026). Labor market impacts of AI: A new measure and early evidence. Anthropic. https://www.anthropic.com/research/labor-market-impacts
National Center for ONET Development. (2014). A Multi-Phase Rational Method for Developing Area Work Activities*. https://www.onetcenter.org/dl_files/DWA_2014.pdf
National Center for ONET Development. ONET 30.2 Database. https://www.onetcenter.org/database.html
National Center for ONET Development. ONET License Agreements. https://www.onetcenter.org/license_agreements.html
OpenAI. (2025). GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. https://openai.com/index/gdpval/ and https://cdn.openai.com/pdf/d5eb7428-c4e9-4a33-bd86-86dd4bcf12ce/GDPval.pdf
OpenAI GDPval public dataset. https://huggingface.co/datasets/openai/gdpval
U.S. Council of Economic Advisers. (2024). Potential Labor Market Impacts of Artificial Intelligence: An Empirical Analysis. https://bidenwhitehouse.archives.gov/wp-content/uploads/2024/07/Potential-Labor-Market-Impacts-of-Artificial-Intelligence-An-Empirical-Analysis-July-2024.pdf
Webb, M. (2019). The Impact of Artificial Intelligence on the Labor Market. SSRN. https://doi.org/10.2139/ssrn.3482150

Capability Benchmark Sources

Artificial Analysis. Model leaderboards. https://artificialanalysis.ai/
Artificial Analysis. Intelligence Benchmarking Methodology. https://artificialanalysis.ai/methodology/intelligence-benchmarking
Artificial Analysis. Speech to Text Leaderboard. https://artificialanalysis.ai/speech-to-text
Artificial Analysis. Image Model Leaderboard. https://artificialanalysis.ai/text-to-image
Arena. Text Leaderboard. https://arena.ai/leaderboard
Center for Research on Foundation Models. HELM Capabilities. https://crfm.stanford.edu/helm/capabilities/latest/
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770. https://arxiv.org/abs/2310.06770
METR. (2026). Time Horizon 1.1. https://metr.org/blog/2026-1-29-time-horizon-1-1/
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022. https://arxiv.org/abs/2311.12022
SWE-bench leaderboards. https://www.swebench.com/
Yue, X., Zheng, T., Ni, Y., Wang, Y., Zhang, K., Tong, S., Sun, Y., Yu, B., Zhang, G., Sun, H., Su, Y., Chen, W., & Neubig, G. (2024). MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. arXiv:2409.02813. https://arxiv.org/abs/2409.02813