Department of Labor Logo United States Department of Labor
Dot gov

The .gov means it's official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.


The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Injuries, Illnesses, and Fatalities

Automated Coding of Injury and Illness Data

The Survey of Occupational Injuries and Illnesses (SOII) is an establishment-based survey used to estimate incidence rates and counts of nonfatal workplace injuries and illnesses. It also provides detailed case and demographic data for cases that involved 1 or more days away from work (DAFW) and/or days of job transfer or restriction (DJTR). BLS uses information provided from these establishments to generate detailed statistics on the characteristics of nonfatal injury or illness cases requiring DAFW and/or DJTR.

To generate these statistics, BLS must convert text entries from the OSHA forms to standard codes used by BLS, as indicated in the table below:

OSHA field SOII Code Coding Taxonomy Used

Job title

Occupation Standard Occupational Classification

What was the employee doing just before the incident occurred?

Event or exposure Occupational Injury and Illness Classification System

What happened?

Nature of injury or illness and Event or exposure Occupational Injury and Illness Classification System

What was the injury or illness?

Nature of Injury or illness and Part of body Occupational Injury and Illness Classification System

What object or substance directly harmed the employee?

Source and Secondary Source of injury or illness Occupational Injury and Illness Classification System

The set of all fields, taken together, is considered the case "narrative." Prior to survey year 2014, BLS relied exclusively on humans to code cases. For survey year 2014, BLS began using machine learning to code a subset of cases. To use machine learning, BLS first selected a learning algorithm and trained it on large quantities of previously coded SOII narratives. During this process, the algorithm calculated how strongly various features, such as words, pairs of words, and other items were associated with the codes that could be assigned.

After training, BLS used the algorithm to estimate the best codes for each uncoded narrative and assigned those codes if the model’s confidence exceeded a predetermined threshold. For survey years 2014-17 BLS used regularized multinomial logistic regression. Starting with survey year 2018, BLS switched to deep neural network architectures. For survey years 2018-20, BLS used an architecture with character-level convolutional embeddings and Long-Short-Term-Memory recurrent layers (source code is available here). BLS began using a transformer architecture starting with survey year 2021.

BLS autocoding SOII data has expanded over time. For survey year 2014, only 26 percent of occupation codes were assigned by machine learning. By survey year 2019, automatic coding expanded to include all six coding tasks (occupation, nature, part, source, secondary source, and event) with the model assigning approximately 85 percent of all codes. For survey year 2020, all cases mentioning ‘covid’ or ‘corona’ were manually coded due to their novel nature and prevalence, dropping the percentage of autocoded cases. Since then, COVID-19 cases were integrated into the Autocoder training process, allowing for the automated coding of approximately 92 percent of all codes for survey years 2021-22.

Starting with survey year 2021, BLS expanded collection of case data from all sampled establishments to include details for cases involving days of job transfer or restriction (DJTR) only. Previously, BLS collected details only for cases involving days away from work (DAFW). Biennial (2-year) estimates of detailed case circumstances and worker characteristics for cases involving days away from work, job transfer, or restriction  are available for the first-time covering survey years 2021-2022. Single-year estimates for case circumstances and worker demographics are not available after survey year 2020; however, the chart below illustrates the SOII Autocoder performance for case data collected annually.

Starting with survey year 2023, the SOII Autocoder will begin using the new Occupational Injury and Illness Classification System (OIICS) v3.01, requiring the migration of training data to this new coding system. For survey years 2021-22 the Autocoder used SOC 2018 and OIICS v2.01 codes.

View data

Related articles

For additional technical information on SOII Autocoding techniques, please submit your questions here.


Last Modified Date: January 12, 2024