example from IBM

Combatting facial recognition bias with better data

A major criticism of facial recognition technology is that it can only deliver results based on the data used to train it. And when it learns from datasets where young white men are overrepresented, it has difficulty correctly identifying women and people with darker skin tones, leading to false positives and charges of bias in the artificial intelligence algorithms.


Why confidence matters in facial recognition systems

Concerns over the accuracy of facial recognition systems came under the spotlight when Amazon’s Rekognition incorrectly matched 28 members of Congress to criminal mugshots. Read more.

Combining the facial recognition decisions of humans and computers can prevent costly mistakes

Improving the accuracy of facial-recognition systems can make society safer by protecting against terrorism, organized crime and identity fraud. Read more.

To increase the variety of the training data for facial recognition algorithms, IBM has released the Diversity in Faces dataset, an annotated collection of 1 million human facial images.

The DiF dataset is drawn from the publicly available YFCC-100M Creative Commons dataset, 100 million images and videos compiled from data available on Flickr.  IBM researchers tagged selected photos with 47 feature dimensions and used 10 coding schemes to obtain objective measurements of features of human faces.

To select the photos from Flickr, IBM researchers first filtered the YFC-100M images for pictures with Creative Common licenses, according to a research paper describing the process. They then used a convolutional neural network to identify faces and a machine-learning toolkit to select photos of a usable size where the subject was facing the camera.

The resulting 1 million face images were tagged with 19 landmarks that are used in the coding schemes to describe facial features in terms of their distance from each other, size and symmetry as well as skin color, age and gender predictions and contrast, which tends to vary with age. The researchers also used crowdsourcing to obtain human-labeled gender and age annotations.

While the IBM researchers acknowledged the 47 total feature dimensions may not evenly cover the diversity among human faces, "we believe the approach outlined in this work provides a needed methodology for advancing the study of diversity for face recognition," they said.

Although facial recognition systems are just starting to be used by government agencies, their reliability has been questioned.  In July 2018, the ACLU released the results of a study that found Amazon’s version of the technology, Rekognition, misidentified 28 members of Congress.

These examples, and other have prompted calls from academics and tech companies for federal leadership over privacy protections and how the technology will be deployed.

The technology has enjoyed some successes.

Many states are using facial recognition technology to prevent the issuance of fraudulent drivers licenses that contribute to identity theft. The software compares facial characteristics on a driver’s license or ID photos with other images on file with the Department of Motor Vehicles.  Officials in New York, which has been using the program since 2010, in August 2017 said the technology has resulted in spotting some 21,000 cases of possible identity fraud, with the number of positive matches increasing as the technology improves.

In October 2018, Customs and Border Protection’s tech trial of biometric entry/exit identification nabbed two imposters attempting to cross from Mexico into the U.S. near Yuma, Ariz., using someone else’s border-crossing cards. Dulles International Airport, an early adopter of the system, identified its first imposter in August.

The IBM dataset is available to the global research community upon request.

About the Author

Susan Miller is executive editor at GCN.

Over a career spent in tech media, Miller has worked in editorial, print production and online, starting on the copy desk at IDG’s ComputerWorld, moving to print production for Federal Computer Week and later helping launch websites and email newsletter delivery for FCW. After a turn at Virginia’s Center for Innovative Technology, where she worked to promote technology-based economic development, she rejoined what was to become 1105 Media in 2004, eventually managing content and production for all the company's government-focused websites. Miller shifted back to editorial in 2012, when she began working with GCN.

Miller has a BA and MA from West Chester University and did Ph.D. work in English at the University of Delaware.

Connect with Susan at [email protected] or @sjaymiller.


  • Records management: Look beyond the NARA mandates

    Pandemic tests electronic records management

    Between the rush enable more virtual collaboration, stalled digitization of archived records and managing records that reside in datasets, records management executives are sorting through new challenges.

  • boy learning at home (Travelpixs/Shutterstock.com)

    Tucson’s community wireless bridges the digital divide

    The city built cell sites at government-owned facilities such as fire departments and libraries that were already connected to Tucson’s existing fiber backbone.

Stay Connected