Protection of Respondent Confidentiality


The internal CE microdata contains information that could reveal the identity of respondents. BLS changes potentially revealing information to ensure that data users cannot identify survey respondents. We call this process "topcoding" for reported values that exceed a positive threshold and "bottom coding" for reported values that exceed a negative threshold. For simplicity this document refers only to "topcoding". Here is the list of "2015 Topcodes and Suppression".

This page discuss the main aspects of topcoding: Topcoding basic variables, recoding summary variables, conditional recoding, geographic recoding, and transferring topcoded observations to related files.

If CE topcodes an observation, it sets its flag variable to 'T'.


Variable topcoding

Topcoding variables refers to replacing a reported observation that exceeds a prescribed critical value. CE calculates critical values using the guidelines by the Census Review Board. This method is applied to observations in the FMLI, FMLD, EXPN, and EXPD files. The topcoded variables are listed in the topcoding file from 1996 forward. The file lists the variable, the relation, the topcode and bottomcode value, and the upper and lower critical value. For pre-1996 topcoded values, see the respective documentation.

Topcoding involves five steps:


  1. Determine the critical value, which is a value above a threshold that could allow users to identify the respondents. (See chart 1)
  2. Identify reported values that exceed the critical value. (See chart 1)
Chart 1 with pre-topcoded data
  1. Calculate the topcoded values by averaging all reported values that exceed the critical value. (See chart 2)
  2. Replace the reported values with the topcoded values. (See chart 2)
  3. Set the flag value to 'T' for that observation.
Chart 2 with post-topcoded data

All five quarters of data in the CE microdata release are used to determine critical values and topcode amounts. Since the critical value and set of values that need to be topcoded may differ with each annual release, the topcode values may change annually and be applied at a different starting point. By topcoding values in this manner, means are preserved for each five-quarter data release when using the total sample. This will not be the case when means are estimated by characteristic.

 

Summary variable recode

Recoding summary variables occurs when an aggregate variable includes a 'feeder' variable that has been topcoded. A feeder variable is a variable that is used to sum an aggregate variable. This method is applied to variables in the FMLI, FMLD, MEMI, and MEMD.

This method involves three steps:

  1. Identify topcoded feeder variables and their aggregate variables.
  2. Calculate the aggregate variable with the topcoded feeder variable.
  3. Set the flag value to 'T' for that observation of the aggregate variable.

The example below clarifies this method. For example, the variable FSMPFRMX (family income or loss from self-employment) is computed as the sum of the values for the variable SEMPFRMX (member income or loss from self-employment) from the MEMI file. For SEMPFRMX, all values above the critical value of $150,000 (-$170,000) are topcoded to $321,846 (-$435,000).

 
    SEMPFRMX reported SEMPFRMX topcoded FSMPFRMX Topcoded FSMPFRMX Topcoding flag

CU 1

Member 1 $95,000 $95,000 $170,000 No

 

Member 2 $75,000 $75,000    

CU 2

Member 1 $160,000 $321,846 $331,846 Yes

 

Member 2 $10,000 $10,000    

CU 3

Member 1 $450,000 $321,846 $643,692 Yes

 

Member 2 $350,000 $321,846    

CU 4

Member 1 $300,000 $321,846 -$113,154 Yes

 

Member 2 -$200,000 -$435,000    

The case for CU 1 and CU 2 demonstrate that aggregate values can differ after topcoding even if the values before topcoding sum to the same amount. CU 1 and CU 2 both reported $170,000 for FSMPFRMX, however CE only topcodes the value reported by member 1 of CU2. Thus, the value for FSMPFRMX for CU2 is higher than for CU1 and is flagged as topcoded while CU1 is not. By using the mean of the subset of observations that are above (below) the critical value as the topcode amount, values on the public use data can be either below or above the actual reported value.

The case of CU3 demonstrates that the topcoded value can be lower than the reported value.

The case of CU4 demonstrates that the reported value for FSMPFRMX can be positive, while the topcoded value can be negative. The reverse can also occur.

Conditional recode

Conditional topcoding is applied to variables if data users could deduce revealing information about feeder variables because the variables are used in formulas. A feeder variable is a variable that is used to sum an aggregate variable. This method is used for MEMI and MEMD.

This method involves three steps:

  1. Identify topcoded feeder variables and the variables that sum them.
  2. Calculate the sum with the topcoded feeder variable.
  3. Set the flag value to 'T' for that observation of the aggregate variable.

The example below clarifies this method for MEMI but applies as well to MEMD. The five MEMI file variables -- AMTFED, GOVRETX, PRIVPENX, RRRDEDX, and SLTAXX -- describe deductions from the most recent pay. These variables are used in conjunction with GROSPAYX (amount of last gross pay) and SALARYXM (annual wage and salary income) to derive ANFEDTX, ANGOVRTX, ANPRVPNX, ANRRDEDX, and ANSLTX, which represent the estimated annual deductions for each of these income deduction categories. The estimated annual Federal income tax deduction from pay is calculated as

        (1) ANFEDTXM = (SALARYXM (AMTFED/GROSPAYX))

SALARYXM can be estimated by using the above terms and rearranging such that

        (2) SALARYXM = (ANFEDTXM (GROSPAYX/AMTFED))

In the above example, a problem with disclosure may arise when neither ANFEDTXM, GROSPAYX, nor AMTFED are topcoded, but SALARYXM is. In this situation, the original value of SALARYXM can be recalculated by inserting the non-topcoded values into equation (2) and solving for SALARYXM. To prevent this, the non-topcoded terms in equation (2) will be suppressed (blanked out) and their associated flags will be assigned a value of 'T'.

The following chart describes the specific rules that CE applies to prevent the potential disclosure outlined above.

  • Identify topcoded feeder variables and the variables that sum them. If SALARYXM is greater than the critical value but ANFEDTXM, GROSPAYX, and AMTFED are not, then the values for ANFEDTXM, GROSPAYX, and AMTFED are suppressed and their flag variables are assigned a value of 'T.'
  • If SALARYXM is greater than the critical value but ANGOVRTM, GROSPAYX, and GOVRETX are not, then the values for ANGOVRTM, GROSPAYX, and GOVRETX are suppressed and their flag variables are assigned a value of 'T.'
  • If SALARYXM is greater than the critical value but ANPRVPNM, GROSPAYX, and PRIVPENX are not, then the values for ANPRVPNM, GROSPAYX, and PRIVPENX are suppressed and their flag variables are assigned a value of 'T.'
  • If SALARYXM is greater than the critical value but ANRRDEDM, GROSPAYX, and RRRDEDX are not, then the values for ANRRDEDM, GROSPAYX, and RRRDEDX are suppressed and their flag variables are assigned a value of 'T.'
  • If SALARYXM is greater than the critical value but ANSLTXM, GROSPAYX, and SLTAXX are not, then the values for ANSLTXM, GROSPAYX, and SLTAXX are suppressed and their flag variables are assigned a value of 'T.'

The same special suppression for MEMI file variables occurs with the original (pre-income imputation) variables that correspond to the variables noted above (SALARYX, ANFEDTX).

 

Geographic recode

Geographic recoding refers to the process of replacing or suppressing the state code if topcoding is not feasible. This method applies to FMLI and FMLD.

The value of the variable STATE identifies the state of residence. This variable must be suppressed for some observations to meet the Census Disclosure Review Board's criterion that the smallest geographically identifiable area must have a population of at least 100,000. STATE data were evaluated in conjunction with the POPSIZE, REGION, and BLS_URBN variables, which show the population size of the geographic area that is sampled, the four Census regions, and urban/rural status, respectively. Some STATE codes were suppressed because, in combination with these variables, they could be used to identify areas of 100,000 or less. On approximately 14 percent of the records on the FMLI files the STATE variable is blank.

A small proportion of STATE codes are replaced with codes of states other than the state where the CU resides. By re-coding in this manner, suppression of POPSIZE may be avoided. REGION is suppressed in some states. (In past releases selected observations of POPSIZE required suppression.) In total, approximately 4% of observations are recoded.

States not listed are not in the CE sample.

The table below lists the code CE uses to identify the state, the type of suppression, and the name of the state.

State codes, type of suppression, and state name
FIPS CODE Flag State
Prior to 2015 2015 forward

1

D Alabama

2

Alaska

4

Arizona

5

A A Arkansas

6

B California

8

B Colorado

9

Connecticut

10

C D Delaware

11

District of Columbia

12

Florida

13

F Georgia

15

Hawaii

16

A Idaho

17

B F Illinois

18

B B Indiana

20

B Kansas

21

Kentucky

22

Louisiana

23

B Maine

24

D D Maryland

25

Massachusetts

26

B B Michigan

27

C F Minnesota

28

A Mississippi

29

B Missouri

30

A A Montana

31

A Nebraska

32

Nevada

33

New Hampshire

34

D New Jersey

35

A New Mexico

36

B New York

37

A North Carolina

38

A North Dakota

39

B B Ohio

40

Oklahoma

41

B B Oregon

42

F Pennsylvania

44

Rhode Island

45

South Carolina

46

A South Dakota

47

B Tennessee

48

B B Texas

49

Utah

50

A Vermont

51

B F Virginia

53

B Washington

54

B F West Virginia

55

F D Wisconsin

Explanation of suppression codes:

  1. A: STATE codes have been suppressed for all sampled CUs in that state.
  2. B: STATE codes have been suppressed for some sampled CUs in that state.
  3. C: STATE codes have either been re-coded for all observations or all strata1 of observations from this state include "re-codes" from other states.
  4. D: STATE codes have either been recoded for some observations from this state or at least one stratum1 of observations from this state includes "re-codes" from other states.
  5. E: STATE code has been suppressed for some sampled CUs in that state and, either STATE has been re-coded or the state includes "re-codes" from other states in all strata.*
  6. F: STATE code has been suppressed for some sampled CUs in that state and, either STATE has been re-coded or the state includes "re-codes" from other states in at least one stratum.*
*A STATE stratum is a unique POPSIZE and BLS_URBN combination.

Interfile transfer

Interfile recoding is used to topcode variables that appear in multiple files. This method is also called "mapping" and is used to topcode observations in the MTBI, ITBI, and DTBD.

This method uses three steps:

  1. Variables are topcoded in the EXPN, FMLI, FMLD files respectively.
  2. The topcoded variables are mapped to their appropriate UCC.
  3. If the variable was topcoded in the EXPN, FMLI, FMLD files respectively, then the associated UCC will have a topcoded value

Set the flag value to 'T' for that observation of mapped variable.

The concordance file, called Parse file, lists which EXPN variables are mapped to which UCC. To obtain the Parse file, please contact the Consumer Expenditure Survey at the phone number or email address at the bottom of this page. Some UCCs have multiple topcode values depending on where the original value is mapped from.

 

 

Last Modified Date: August 30, 2016