The promise and problems of including 'big data' in official government statistics
- By Fleur Johns, Caroline Compton, Wayne Wobcke
- Nov 13, 2018
This article was first posted to The Conversation.
The Australian Bureau of Statistics (ABS) will soon announce the kinds of information it will collect in the next national census in 2021. If international trends are a guide, “big data” will comprise a growing part of ABS data collection and analysis.
This may promise greater timeliness and efficiency compared to the traditional paper-based census, but using big data to measure populations and economies is not without challenges.
Debates about how democratic governments should count the people they serve are ongoing in Australia, the U.S. and in India. The use of digital technologies for state measurement seems likely to intensify these debates as significant questions emerge around the practice.
Public data gathering has high stakes
For centuries, states have counted and categorized people. Census data and other official statistics are used for government planning and budgeting, to determine political districts for elections, and for many other purposes. Official statistics also help to shape a population’s sense of itself. For these reasons, state counting practices have often been controversial.
In Australia, changing census practice has been a part of ongoing debate about ensuring First Nations people are properly represented. Historic undercounting of Aboriginal and Torres Strait Islander people was redressed by the abandonment of language in the census that referred to blood quantums -- which are now widely accepted as racist -- alongside other factors.
In the U.S., state counting is likewise a matter of intense dispute. California is among those states currently suing the U.S. federal government because of a question about citizenship status the Trump administration has proposed adding to the 2020 Census. California argues fewer non-citizens will complete the census if the question is included. This would lead to a lower population count and reduced federal funding for states with high numbers of non-citizens.
India has also seen heated national debate about the gathering of caste data and the categorization of “housewives” as non-workers.
Big data use in official statistics is growing
New issues of this kind are likely to emerge as government statistics offices around the world introduce digital data into their work.
The UN is currently spearheading efforts by member states to explore the use of new, digital data sources and technologies for official statistics. The ABS is involved in this endeavor. Since late 2017, for example, the ABS has been analyzing supermarket scanner data to try to improve CPI (inflation) measurement.
Other possibilities being explored for the use of digital data to improve state measurement include:
The promise and the problems
The aim of these efforts is to make official statistics more accurate, affordable to gather, and more attentive to geographically remote or otherwise marginalized communities. While there may be enormous potential to improve official statistics in these ways, big data use for state measurement raises thorny issues.
The first of these is the difficulty of auditing such data sources. All datasets come with blind spots and biases. Given the contentiousness of state counting, and the potentially high stakes of miscounting, it’s important the public maintains an overall sense of -- and capacity to query -- how, where, and why data is being collected. This may be difficult to ensure when data used for official measures are privately sourced.
While the ABS has the legal right to compel the provision of information, including from data providers, insight into how private companies collect and process data may be hard to obtain, and may not be shareable publicly.
Reliance on commercial data sources could also leave official statisticians dependent on privately owned infrastructure -- cell tower infrastructure, for instance. The distribution and maintenance of this infrastructure is driven by commercial interests, potentially working against the needs of responsible public data collection.
Another problem with the use of big data in official statistics is that data gathered are often not fit for the kinds of purposes states are pursuing. Data of this kind are messy and unstructured, and it can be hard to separate information from noise in their analysis. Because machine-learning methods for unstructured data are never 100 percent accurate, any inferences drawn must be carefully validated.
Statisticians are well aware of these limitations, but face challenges communicating with policymakers and the general public about them.
Enthusiasm must not outrun public engagement
There is a risk that because digital data are relatively abundant, those in charge of state measurement practices will make use of that data without due regard to questions of what should, and should not, be measured for particular purposes.
Without knowing when and how they are being counted, the public cannot be part of that discussion. It is incumbent on governments to bridge that gap, and incumbent on all Australians to take an active interest in these practices as they develop.