Automated Coding of Injury and Illness Data
The Survey of Occupational Injuries and Illnesses (SOII) collects data from sampled establishments on OSHA forms 300 and 301. We use the information provided on these forms to generate detailed statistics on the characteristics of cases involving injury or illness.
In order to generate these statistics, survey staff must convert the text entries in the OSHA forms to standard codes used by BLS, as indicated in the table below:
|OSHA field||SOII Code||Coding Taxonomy Used|
|Occupation||Standard Occupational Classification|
What was the employee doing just before the incident occurred?
|Event or exposure||Occupational Injury and Illness Classification System|
|Nature of injury or illness and Event or exposure||Occupational Injury and Illness Classification System|
What was the injury or illness?
|Nature of Injury or illness and Part of body||Occupational Injury and Illness Classification System|
What object or substance directly harmed the employee?
|Source of injury or illness||Occupational Injury and Illness Classification System|
The set of all fields, taken together, is considered the case "narrative." Prior to survey year 2014, BLS exclusively relied on humans to code cases, based on a careful reading and analysis of the case narrative. In 2014, BLS began using computer assisted coding to code a subset of cases. BLS uses logistic regression as the machine learning technique to assign the case codes. To use logistic regression, we first train the logistic model on large quantities of previously coded SOII narratives. During this process the algorithm learns how strongly various words, pairs of words, and other features are associated with the codes that can be assigned. After training, we use the algorithm to estimate the best codes for uncoded narratives.
In 2014, only 5 percent of codes were assigned by machine learning and the model only assigned occupation codes. In 2015, we expanded the use of automated coding to also include part of body and nature of injury or illness. In 2016, automated coding was expanded again to include the five primary components of the narrative (occupation, nature, part, source, and event) and was used to code approximately 50% of these elements. For 2017, 61.5 percent of the components were coded by the machine learning algorithm.
For additional technical information on our techniques, please contact OSHS_Autocoding@bls.gov.
Last Modified Date: October 1, 2018