An official website of the United States government
The Census of Fatal Occupational Injuries (CFOI) collects and publishes a complete count of work-related fatal injuries and detailed data on their circumstances. The CFOI program uses state, federal, and independent data sources to identify, verify, and describe these fatal work injuries. This ensures that counts are as complete and accurate as possible. Overall, more than 20,000 individual source documents, or about four per each fatal injury, are collected each year. Some of these sources are collected under a pledge of confidentiality and data collected from them are protected under the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA). This factsheet describes the methodology used to prevent the disclosure of these sensitive data in the CFOI. The disclosure avoidance methodology prevents direct disclosure as well as indirect disclosure that could be determined through inference and deduction.
CFOI uses a series of rules to determine which cells contain sensitive information and should be suppressed. This is referred to as primary suppression. Other cells are then suppressed to prevent the value of the originally suppressed cells from being calculated. This is referred to as secondary suppression. This factsheet describes how these secondary suppressions are applied to CFOI data to prevent disclosure. This factsheet also describes other methodologies that can be used to avoid disclosure in statistical datasets. Some of the methodologies considered and researched for CFOI data are expounded on in the Considerations section at the end of this article.
A sensitive cell refers to a data point that could directly disclose confidential information. The initial step to protecting confidential information in CFOI data is to suppress sensitive cells. This ensures that these counts are not published in tables and online database query tools, such as the Census of Fatal Occupational Injuries (2011 forward) One-screen data search tool.
Table 1 below displays fictional data to show an example of a sensitive cell. For this, and the following examples, assume that the list of subcategories in each example is comprehensive and adds to the true total. Table 1 has one census unit in state A, categorized in the Finance industry, which has been determined to be a sensitive cell. Table 1a shows what would be the published data table with the necessary primary suppression applied.
Table 1. Fatal injuries in State A by industry
|
Industry |
Count |
State A |
Total |
47 |
State A |
Construction |
23 |
State A |
Manufacturing |
15 |
State A |
Natural Resources and Mining |
8 |
State A |
Finance |
1 |
Table 1a. Fatal injuries in State A by industry, primary suppression applied
|
Industry |
Count |
State A |
Total |
47 |
State A |
Construction |
23 |
State A |
Manufacturing |
15 |
State A |
Natural Resources and Mining |
8 |
State A |
Finance |
-- |
Removing only sensitive cells from a tabular dataset leaves a problem. Providing the Total and all the other contributing cell counts allows for an easy deduction of the suppressed cells. In Table 2, a data user can subtract the Construction, Manufacturing, Natural Resources and Mining counts from the Total for state A and determine the Finance sector had one fatality in state A. This computation does not require sophisticated software, just basic arithmetic.
Table 2. Fatal injuries in State A by industry
|
Industry |
Count |
State A |
Total |
47 |
State A |
Construction |
23 |
State A |
Manufacturing |
15 |
State A |
Natural Resources and Mining |
8 |
State A |
Finance |
47-23-15-8= 1 |
The use of secondary suppressions addresses this deduction problem. Suppressing a second cell protects the sensitive data point by introducing uncertainty. By removing the fatality count for Natural Resources and Mining, shown in Table 2a, a data user can only deduce that there are nine fatalities between the Natural Resources and Mining industry and Finance industry.
Table 2a. Fatal injuries in State A by industry, primary and secondary suppressions applied
|
Industry |
Count |
State A |
Total |
47 |
State A |
Construction |
23 |
State A |
Manufacturing |
15 |
State A |
Natural Resources and Mining |
-- |
State A |
Finance |
-- |
The secondary suppression example is sufficient for any table with only one characteristic. However, the CFOI collects numerous characteristics on each fatal occupational injury and publishes cross-tabulated data. Table 3 shows state A fatalities by industry and by the event precipitating the fatal injury (for simplicity listed as Event 1, 2 and 3). The table requires primary and secondary suppressions applied to both the rows and the columns, since both row and column totals are included in the table.
Table 3. Fatal injuries in State A by industry and event
|
Industry |
Count |
Event 1 |
Event 2 |
Event 3 |
State A |
Total |
47 |
26 |
11 |
10 |
State A |
Construction |
23 |
13 |
8 |
2 |
State A |
Manufacturing |
15 |
10 |
2 |
3 |
State A |
Natural Resources and Mining |
-- |
3 |
1 |
4 |
State A |
Finance |
-- |
0 |
0 |
1 |
Table 3a shows the primary suppressions applied to sensitive cells. Looking at the column for Event 2, the total count is 11 and the counts for Manufacturing, Natural Resources and Mining, and Finance have all been primarily suppressed. Since there is more than one suppressed cell within the Event 2 column, secondary suppressions are not necessary. The three primary suppressions introduce enough uncertainty on their own. Looking at the row for construction, however, it shows just one primary suppression, for Event 3. A data user could infer with certainty that there were 2 fatalities in Event 3 after subtracting the other event counts from the Construction row total. The cell count for Construction and Event 2 is then secondarily suppressed to protect that cell from deduction, shown in Table 3b.
Table 3a. Fatal injuries in State A by industry and event, primary suppressions applied
|
Industry |
Count |
Event 1 |
Event 2 |
Event 3 |
State A |
Total |
47 |
26 |
11 |
10 |
State A |
Construction |
23 |
13 |
8 |
-- |
State A |
Manufacturing |
15 |
10 |
-- |
-- |
State A |
Natural Resources and Mining |
-- |
-- |
-- |
4 |
State A |
Finance |
-- |
-- |
-- |
-- |
Table 3b. Fatal injuries in State A by industry and event, primary and secondary suppressions applied
|
Industry |
Count |
Event 1 |
Event 2 |
Event 3 |
State A |
Total |
46 |
25 |
11 |
10 |
State A |
Construction |
23 |
13 |
-- |
-- |
State A |
Manufacturing |
15 |
10 |
-- |
-- |
State A |
Natural Resources and Mining |
-- |
-- |
-- |
4 |
State A |
Finance |
-- |
-- |
-- |
-- |
The final consideration for disclosure avoidance is table differencing. This refers to using other published data tables or data points to deduce sensitive data cells. Table 4 shows the Count in state A’s Natural Resources and Mining industry by gender. Neither cell was determined to be sensitive. However, these data points allow a data user to infer with certainty that there was a total of 8 fatalities in state A’s Natural Resources and Mining industry, since CFOI currently publishes data on only the two values, Male and Female. Below Table 4 is a replica of Table 2a which had sufficient protection with primary and secondary suppressions within the table. The data provided by Table 4 allows a data user to deduce the counts for both cells. This is represented by the numbers in parentheses. So, applying additional suppressions to prevent table differencing is the final step to disclosure avoidance in CFOI data.
Table 4. Fatal injuries in State A, Natural Resources and Mining industry by gender
|
Industry |
Gender |
Count |
State A |
Natural Resources and Mining |
Male |
4 |
State A |
Natural Resources and Mining |
Female |
4 |
Table 4a. Fatal injuries in State A by industry, secondary (and primary) suppressions disclosed
|
Industry |
Count |
State A |
Total |
47 |
State A |
Construction |
23 |
State A |
Manufacturing |
15 |
State A |
Natural Resources and Mining |
-- (8) |
State A |
Finance |
-- (1) |
Consider the published CFOI characteristics: ownership, industry, occupation, gender, race, age group, event or exposure, source, secondary source, nature, part, time, state, location, and employee status, and worker activity. Though the mechanics of the primary suppressions and initial secondary suppressions are simple, with each additional characteristic the evaluation becomes increasingly complex. The goal of the disclosure methodology is to ensure confidentiality of data across all published CFOI data. Suppressions must be applied in such a way that any and all combinations of published counts and totals would not disclose sensitive information. This is accomplished by an algorithm that uses CFOI microdata as an input to evaluate all possible combinations of variables to apply primary and secondary suppressions using the methodology discussed above. This program is referred to as the “hypercube” in CFOI data processing, deriving from the idea that as you include more and more data points and totals, the relationships that must be evaluated for disclosure risk become multidimensional.
There are several ways federal statistical agencies avoid disclosure and the methodologies range in complexity. Some general tools used for disclosure avoidance are suppression, coarsening, record swapping, sampling, and noise injection.[1] Each of these contain a variety of more specific implementation methods, depending on the data. Of these tools, BLS research determined that suppression is the most apt for CFOI data.
Both coarsening and record swapping are often used to avoid disclosure in the release of microdata. Coarsening broadens the categorization of variables until the cells are secure. For instance, rather than include microdata records with exact ages, the microdata may instead have 10-year age categories. Record swapping switches some characteristics on a record with a record that matches on other characteristics, while maintaining the same characteristic totals. These techniques are effective for avoiding disclosure in microdata but record swapping could not be used to avoid disclosure in tabular data and coarsening CFOI data would result in excessively broad categorizations.
Samples are at the core of statistics. From a carefully designed sample, much can be inferred about a population and with statistical confidence. Data published from samples provide inherent protection to its units because there is uncertainty of which units were selected and which units participated. The CFOI data is a complete census of all fatalities in the US. It is not a sample and not protected by the disclosure uncertainty a sample can provide.
Finally, another technique considered for protecting CFOI data was injecting variability, or noise, into sensitive cells while maintaining high level totals. However, injecting noise in already very small count data risks changing the meaning. For example, changing a data cell from 1 fatality to 3 fatalities increases the cell by 200 percent and could have unintended policy implications. Additionally, a variability of 2 could result in a data cell changing from 1 fatality to -1 fatality, which is nonsensical. Larger datasets are not as sensitive to injecting a small amount of noise to protect from disclosure, but the nature of CFOI data makes this technique ineffective for data usability.
These disclosure avoidance methods are not a comprehensive list but do detail the alternative methods considered but not implemented for CFOI disclosure.
Background on cell suppression, cell sensitivity, and the protection of statistical data can be obtained from the Federal Committee on Statistical Methodology's Working Paper 22.
[1] Cole, S., Dhaliwal, I., Sautmann, A., & Vilhuber, L. (2020, September 25). Handbook on using administrative data for research and evidence-based policy. https://admindatahandbook.mit.edu/book/v1.0-rc2/index.html
Last Modified Date: September 27, 2023