An official website of the United States government
Much of the information about work related injuries and illnesses in the U.S. is recorded only as short text narratives on Occupational Safety and Health Administration (OSHA) logs and Worker’s Compensation records. Analysis of these data has the potential to answer many important questions about workplace safety, but typically requires that the individual cases be “coded“ first to indicate their specific characteristics. Unfortunately the process of assigning these codes is often manual, time consuming, and prone to human error. This paper compares manual and automated approaches to assigning detailed occupation, nature of injury, part of body, event resulting injury, and source of injury codes to narratives collected through the Survey of Occupational Injuries and Illnesses, an annual survey of U.S. establishments that collects OSHA logs describing approximately 300,000 work related injuries and illnesses each year. We review previous efforts to automate similar coding tasks and demonstrate that machine learning coders based on the logistic regression and support vector machine algorithms outperform those based on nave Bayes, and achieve coding accuracies comparable to or better than trained human coders.