An official website of the United States government
This case study describes the data quality assessment of a new method for collecting Consumer Price Index (CPI) gasoline price data from retailers or data aggregators instead of a sample survey at BLS. While crowdsourcing data directly from retailers leads to efficiencies in both collection efforts and costs, the method also introduces the potential for increased errors in collection. This case study highlights how the BLS mitigates threats to the accuracy and reliability of these crowdsourced data and is using the framework to guide the expansion of alternative data into CPI estimation.
BLS traditionally collects prices by hand for goods and services sampled in the survey underlying the CPI.[1] However, collecting data directly from retailers or data aggregators in a more automated fashion (“crowdsourcing”) leads to collection efficiencies and improved cost-effectiveness by reallocating resources. Additionally, collection automation can also lead to the efficient capture of significantly more price observations, thereby improving accuracy.
A case study is the Crowdsourced Motor Fuels Data project, which led to replacement of the traditional CPI gasoline sample with data from a secondary source.
The CPI program has been collecting daily motor fuel price data for regular, midgrade, and premium gasoline from the secondary source since June 2017.
Crowdsourced secondary source data are collected from all gas stations within CPI’s 75 geographic sampling areas. While the secondary source data consist of millions more observations per month than the traditionally collected CPI data, they are not considered a census of all gasoline price observations.
Between 2017 and 2019, the CPI program conducted research on the data, including index simulations. After being vetted internally and in multiple outside venues, such as the American Economic Association (AEA) and EuroStat, the dataset was approved for implementation and the secondary source data were included in the CPI in July 2021.
Survey data | Secondary source data | |
---|---|---|
Frequency |
Monthly | Daily |
Number of price observations |
4,000 price quotes/month | 6.1 million observations/month |
Number of retail outlets |
1,400 outlets/month | 91,272 stations/day |
Data characteristics |
Price, type of service, gasoline content, octane level, payment type, special pricing, brand name, address, collected throughout the month | Daily average price, number of valid reports, station ID, ZIP code, state, posted time |
The BLS and the CPI program have long been involved in the process of evaluating alternative datasets.[2] Prior to the FCSM Framework for Data Quality, the CPI and Producer Price Index (PPI) programs collaborated on a “scorecard” for alternative data that is very similar to the DQ framework produced by FCSM. The scorecard relied on the “qualitative” and “quantitative” analyses produced by researchers to evaluate the alternative data.
Researchers used the quantitative analysis to interpret summary statistics from the alternative data and used the qualitative analysis to summarize the data in narrative form, much like the DQ framework. The qualitative analysis in the scorecard included narrative summaries on how many observations will be provided, how the data are compiled in terms of database structure and software format, how the data meet the agency’s coverage requirements for the product category and geography, sufficiency of the level of characteristic details, and timeliness, security, and reliability of data delivery within a monthly production cycle. The qualitative analysis rated the alternative data in the categories analyzed and provided a recommendation on whether to move forward with the data.
The CPI program views the secondary source data as a success and a significant step forward in expanding the use of alternative data into CPI estimation. The cost of research and development was not insignificant. The research spanned 2 years, and the development and testing required another year. Still, the CPI program views the cost associated with the implementation of this project as an investment that is spread out over all the CPI alternative data research and development projects, given that many other projects will use a similar approach.
The secondary source provides daily gasoline prices for thousands of gas stations across the United States. The CPI program uses that data directly to produce the CPI and average price products for gasoline and individual fuel types, including regular, midgrade, and premium fuel types.
The secondary source is providing the data on a voluntary basis at no cost to the CPI. The data source does not limit the CPI program’s ability to release the data to users. We have now implemented the secondary source data into the production process, meaning we are currently releasing the CPI and average price products as scheduled using the secondary source data and associated methodology. The CPI program is still releasing the same products at the same level of granularity as in the past.
The daily prices are collected throughout the month. The data also include weekend and holidays observations, providing pricing data for days that were previously not reflected in the traditionally collected data.
The secondary source data were implemented into the CPI’s monthly production schedule, and no schedule accommodations were needed. The secondary source typically provides the data in a timely manner with relatively few hiccups. If secondary source data collection is missed for whatever reason, the CPI program either retrieves the missing data later in the reference period or publishes the CPI without the missing data. However, the CPI program has been able to collect data on over 95% of the days since we have begun using the secondary source data.
We produce price indexes and average price products at the same level of granularity as we have historically, including U.S.-, regional-, and city-level products. The large sample of gas stations protects confidentiality by including price changes across the thousands of locations. Thus, data users are unable to ascertain whether a particular gas station is in the sample or not. Station IDs provided by the secondary source also further mask the identity of the stations. With so many observations from stations across the country, the CPI program is not concerned about insufficient data.
The data provided by the secondary source are more granular in terms of timeliness. The CPI program receives daily prices from the secondary source. A method was developed to convert these daily prices into monthly prices. To replicate the granularity in terms of unit of time within the CPI, we calculate an arithmetic average price across the days of the month for a particular station and fuel type.
Research results from the secondary source compared favorably to the CPI for gasoline at the U.S. level. The secondary source research process did not find differences greater than 1.0% at the U.S. level over 3.5 years.
In terms of reliability, the CPI program studied the minimum amount of data needed from the secondary source to publish a gasoline index. The CPI program still collects gasoline data using its traditional method as a fallback to the secondary source data, but is also working to reduce the amount of data collected in the traditional way to lower costs. Additionally, the fallback data will serve as a baseline quality check on the secondary source data.
As mentioned in the relevance section, the gasoline prices provided by the secondary source match the definition of gasoline used in the CPI. Furthermore, the index methodology for use with the secondary source data aligns with best practices of price index theory as described in the International CPI Manual.[3] Finally, as mentioned in the accuracy and reliability section, the results of the alternative index (using the secondary source data) closely match the results of the traditional CPI for gasoline.
The probability and impact of the malicious or unintentional interference by data providers with the data in a way that impacts the estimates is low. The CPI program perceives no incentive for a provider to manipulate the information. Data providers have an incentive for their data to be as accurate as possible since they also publish this information on an even more granular level. In general, the secondary source and the CPI’s incentives align for accurate and reliable data.
Since we have begun researching the data, there has been particular interest in the gasoline index.
As stated in the accuracy and reliability section of our assessment, the CPI program compared over 3 years’ worth of price indexes and found little difference between the official CPI for gasoline and the experimental index using secondary source data.
We also compared additional months as part of our acceptance testing and parallel testing process during the implementation of the secondary source data into our published index. The secondary source is also an often-cited source in news organizations and is widely accepted by its users as a credible source of gasoline price information.
The CPI program collects data from the secondary source via a Secure File Transfer Protocol (SFTP). The risk to computer and physical security is considered low based on our collection of secondary source data over time, which has been relatively consistent.
As previously mentioned, the CPI program collects data from thousands of gas stations across the country, which generally mask the confidentiality of individual data providers. Furthermore, the CPI program uses pseudonyms to mask the confidentiality of the data provider.
Lessons learned and sustainability
The FCSM framework is an important tool to guide the expansion of alternative data into CPI estimation. However, the CPI program quickly recognized it is neither a ”one size fits all” nor a “be all and end all” approach. Rather, BLS views the DQ framework as a guideline that needs to be adaptable to an organization’s unique circumstances and concerns.
For instance, BLS developed a set of questions to add context to the 11 dimensions. For example, to evaluate the dimension of relevance, BLS adds “Are the data a relevant input to our data products and measurement objective?” For evaluation of timeliness, the CPI program asks itself, “Are the data representative of the index reference period?” The CPI program also considers an additional question about cost-effectiveness to complement the DQ framework. We ask ourselves, “Are the new data and methods more cost-effective than the data and methods they are replacing?”
Additionally, since adopting the DQ framework, the CPI program has replaced the “quantitative” and “qualitative” reports mentioned above with a single “Alternative Data Methods Summary,” a living document that is updated regularly to reflect changes in the methods over time. The summary document is very similar to the qualitative analysis mentioned above. Methods related to the collection of alternative data are subject to change throughout the approval process, as analysts and stakeholders familiarize themselves with the data. Stakeholders use the summary document to assess the adherence to the DQ framework over time.
Additionally, the CPI program has established an alternative data approval process that includes two approval groups, technical experts and the approval board. The technical experts group, which consists of senior BLS economists and statisticians, are working with BLS staff to develop a new methodology to ensure that the use of alternative data does not increase the total measurement error relative to traditional methods of data collection or previously implemented non-traditional methods. These groups will help the CPI program sustain an adherence to the DQ framework by ensuring that all future alternative data projects adhere to its guidelines.
Since there is no single metric to assess total measurement error, the technical experts make an overall qualitative assessment. In general, they consider both the statistical viewpoint (i.e., can the portion of the marketplace not in the sample be considered missing at random?) and the economic viewpoint (i.e., is the new data source and method consistent with the scope of the CPI, and is it measuring what we intend it to measure?). They consider each area of the methodology summaries including geography, price specifics, item definition, item eligibility, item classification, sampling, sample rotation, index methodology and index formula, item substitution/quality adjustment/comparability, and imputation.
Once the technical experts approve the methodology associated with an alternative data source, the proposal is then reviewed by the approval board for final authorization. The approval board is a cross-program group of managers charged with approving methodologies for implementation into the CPI. It is the responsibility of the approval board to ensure that the alternative data source and methodology being considered align with the data quality framework as outlined by FCSM.
The approval board can either approve the methodology for implementation or send it back to the research team with comments. A consensus agreement must be reached within 2 weeks of the recommendation for the approval board to approve or disapprove a proposal for implementation. The approval board sends their approval to the technical experts group, the alt data oversight group, and the CPI management group. In all other cases, the issues preventing approval must be documented and returned to the research team to mitigate the issues.
As the CPI program expands its use of alternative data sources, the application of the FCSM framework will continue to guide our data quality assessment process. However, alternative data sources are typically unique, and, thus, we recognize that the FCSM framework may require refining and adjustment as we encounter new data scenarios.
Domain | Dimension | Definition | BLS Question(s) |
---|---|---|---|
Utility |
Relevance | Relevance refers to whether the data product is targeted to meet current and prospective user needs. |
·What is the probability of unknown sources of bias?
|
Utility |
Accessibility | Accessibility relates to the ease with which data users can obtain an agency’s products and documentation in forms and formats that are understandable to data users. |
·Are the costs to access the data an effective use of resources?
|
Utility |
Timeliness | Timeliness is the length of time between the event or phenomenon described by the data and their availability. | ·Did a lack of timeliness impact how the data for the index could be used for the reference period? |
Utility |
Punctuality | Punctuality is measured as the time lag between the actual release of the data and the planned target date for data release. |
·Can the methodology be implemented within the typical production processing schedule?
|
Utility |
Granularity | Granularity refers to the amount of disaggregation available for key data elements. Granularity can be expressed in units of time, level of geographic detail, or the amount of detail on any of a number of characteristics (e.g. demographic, socio-economic). |
·Are there adequate data to support the current level of granularity in data products?
|
Objectivity |
Accuracy and reliability | Accuracy measures the closeness of an estimate from a data product to its true value. Reliability, a related concept, characterizes the consistency of results when the same phenomenon is measured or estimated more than once under similar conditions. | ·Are there any concerns with the technical experts’ qualitative assessment of total measurement error? |
Objectivity |
Coherence | Coherence is defined as the ability of the data product to maintain common definitions, classification, and methodological processes, to align with external statistical standards, and to maintain consistency and comparability with other relevant data. |
·Does the methodology impact the ability to compare CPI data with external sources?
|
Integrity |
Scientific integrity | Scientific integrity refers to an environment that ensures adherence to scientific standards and use of established scientific methods to produce and disseminate objective data products and one that shields these products from inappropriate political influence. | ·What is the probability and impact of the data provider (either maliciously or unintentionally) interfering with the data in a way that impacts estimates? |
Integrity |
Credibility | Credibility characterizes the confidence that users place in data products based simply on the qualifications and past performance of the data producer. | ·Based on a review of the output of index simulations and an assessment of the differences, how much does the simulation deviate from production? |
Integrity |
Computer and physical security | Computer and physical security of data refer to the protection of information throughout the collection, production, analysis, and development process from unauthorized access or revision to ensure that the information is not compromised through corruption or falsification. |
·What is the probability of a loss of data or data quality issues due to technical issues?
|
Integrity |
Confidentiality | Confidentiality refers to a quality or condition of information that is protected by an obligation not to disclose the information to an unauthorized party. | ·Are there any confidentiality concerns related to the use or announcement of this methodology? |
[1] Bieler, John, et al. A Nontraditional Data Approach to the CPI Gasoline Index: CPI Crowd-Sourced Motor Fuels Data Analysis Project, https://www.aeaweb.org/conference/2020/preliminary/paper/n8b4hBsT.
[2] Crystal G. Konny; Brendan Williams and David M. Friedman, (2019), Big Data in the US Consumer Price Index: Experiences and Plans, NBER Chapters, National Bureau of Economic Research, Inc.
[3] Consumer Price Index Manual: Theory and Practice. International Labour Office (ILO), 2004.