An official website of the United States government
The Current Population Survey (CPS) classifies the jobs of respondents into hundreds of detailed industry and occupation categories. The classification systems change periodically, creating breaks in time series. Standard concordances bridge the periods, but often leave empty cells or inaccurate sharp changes in time series. Standard concordances also usually hold the assumption that a certain period of time can be representative, on more aggregate levels, of various historical periods. For each employed CPS respondent classified under a previous classification method we apply prediction algorithms, principally random forests, to impute standardized industry, occupation, and related variables. The imputations use micro data about each individual and large training data sets about the population. In some of the training data sets, industry and occupation have been classified by specialists into two industry and occupation category systems – that is, they are dual-coded. We train a random forests classifier to handle the changes in classification between the 1990s and 2000s largely on the dual-coded data set and apply it to the full CPS and IPUMS-CPS to impute several variables including industry and occupation. For changes in classification when an industry or occupation splits, we train the algorithms on the observations with the newly classified industry or occupation split, to predict how the historical observations would have been classified. We generate an augmented CPS, with additional columns of standardized industry and occupation. Augmented data sets of this kind can serve research on many topics.