Making population data count: The Census Data Lake
- By Stephanie Kanowitz
- Dec 22, 2020
To more efficiently tackle the gargantuan task of securely collecting, storing and analyzing data on every person living in the United States this year, the Census Bureau created a central data repository called the Census Data Lake (CDL).
In 2017 the Census Bureau was using the Unified Tracking System – a legacy data warehouse – as a repository for decennial response data and paradata (the data collected about interviews and the survey process) during previous census tests. To get the performance needed for 2020, officials decided to leverage big data technologies to meet the expected demand, a Census spokesperson told GCN. CDL replaced UTS as the central data repository beginning with the 2018 census test.
CDL streamlines the collection of data and demographic metrics involving 52 systems that support 24 operations, including various data collection and post-processing systems plus human resources programs for payroll, training and expenses. The centralized system includes the master address file as well as systems for field operations, disclosure avoidance and survey operational control. To feed various types of data from different systems into the CDL, developers created interfaces to move their respective data into CDL via an enterprise service bus.
The Census Data Lake provides big data processing capability to fulfill petabyte-scale data management and analytics while satisfying security and privacy requirements and controlling costs. The initiative is transforming how the agency processes demographic and economic data using open source technology and high-performance cloud infrastructure, the spokesperson said.
CDL is not only the central repository for consolidating data from these various systems, but it also supports creation of business intelligence reports for field operations, statistical analysis work to create coverage improvement workloads and the Self Response Quality Assurance operation. It also delivers all response data to the Decennial Response Processing System for post processing.
Individuals can request access to CDL data via a request process implemented in the bureau’s Remedy IT service management tool. Once the proper approvals are obtained, the user is placed into the corresponding group in the Census’ Active Directory. CDL integrates with AD to authenticate users and leverages group memberships to enforce authorized access to data assets within CDL.
Information security is critical for all aspects of census operations. CDL enables security, privacy and policy controls for all types of sensitive data and code at an enterprise level. As a result, the bureau can effectively manage and secure multiple large datasets via automation and use metadata to monitor, link and aggregate datasets through the survey lifecycle until the final products are disseminated.
As the Census Bureau’s first high-performance, secure and scalable cloud computing environment, CDL enables survey owners to ingest, process, analyze, manipulate and share survey data. It will be used to develop data products using modern infrastructure and open source tools that require high-speed parallel processing, artificial intelligence and machine learning.
Unlike the bureau’s previous approach, new customized computing environments will be provisioned in hours rather than weeks without the high upfront costs, acquisition delays and many of the security concerns our customers face today, the spokesperson said. CDL users are able to achieve their goals without the need to administer, patch or otherwise maintain hardware and software, letting them focus on their work instead of IT duties.
Stephanie Kanowitz is a freelance writer based in northern Virginia.