How accurate are face recognition technologies, and how well do they perform?
There is no simple answer, but this third piece in the series on five key face recognition technology governance challenges explains some of what publicly available third-party testing data reveals. Specifically, this piece shares insights from the National Institute of Standards and Technology (NIST) Face Recognition Vendor Test (FRVT) program and the Department of Homeland Security (DHS) Science and Technology Directorate (S&T) Biometric Technology Rallies. This piece also considers other types of testing that could be helpful.
In a deployment setting, assessing face recognition systems’ “performance” depends on user requirements and priorities. Each aspect of the face recognition system, including algorithm accuracy, image capture device quality, system speed and calibration, deployment environmental conditions, and the human operator’s effectiveness, impacts the full system’s ability to achieve user requirements and priorities. Different types of face recognition technology testing evaluate different face recognition system components under different settings. Each type of testing provides insight into one or more aspects of face recognition systems’ performance, but no single test currently provides a dispositive answer to the broad question of how well face recognition technologies perform
The NIST FRVT program evaluates face recognition algorithms that vendors voluntarily submit for benchmark testing. NIST evaluates both face verification (“one-to-one” or “1:1”) and face identification (“one-to-many” or “1:N”) algorithms and assesses overall accuracy and accuracy across specific demographic groups. (The first blog in this series explains the difference between face identification, face verification, and other technologies that analyze human faces.)
To perform overall and demographic-specific accuracy tests, NIST uses probe and gallery images of varying styles and quality, such as passport, mugshot, border crossing, and webcam photos. NIST also evaluates algorithm performance across galleries of multiple sizes. Because NIST tests hundreds of algorithms from many developers in a standardized format, FRVT testing provides useful information about how well individual algorithms perform tasks and the overall state and trajectory of face recognition algorithm accuracy across the market. (Analyzing historical NIST testing data demonstrates that face recognition algorithms have become increasingly accurate over time.)
Below, this piece summarizes test results that provide insight into (1) face identification algorithms’ overall accuracy, (2) face verification algorithms’ overall accuracy, and (3) face identification and face verification algorithms’ accuracy across demographic groups.
Face Identification Algorithms’ Overall Accuracy (as of April 21, 2023)
NIST’s ongoing face identification testing reveals that accuracy varies notably across algorithms and that algorithms’ accuracy varies across different types of images. This testing evaluates algorithm accuracy by measuring the false negative identification rate (FNIR) at a threshold that limits the false positive identification rate (FPIR) to 0.003. In other words, this testing measures how often algorithms produce a false negative error (i.e., incorrectly indicating that two templates depict different individuals) while only producing false positive errors (i.e., incorrectly indicating that two templates depict the same individual) 0.3% of the time.
Forty-five of the 105 identification algorithms evaluated were >99% accurate when comparing probe templates from high-quality images (i.e., mugshots) to a gallery of 1.6 million templates from high-quality images (i.e., mugshots). However, only three algorithms maintained this >99% accuracy level when the gallery size increased to 3 million templates and 12 or more years elapsed between the time the probe image was captured and the time the gallery image was captured. Three algorithms from three developers were >99% accurate when comparing probe templates from lower-quality images (i.e., webcams) to 1.6 million gallery templates from high-quality images (i.e., mugshots). Three algorithms from three developers were >90% accurate when comparing probe templates derived from off-angle (90-degree rotation) images to 1.6 million gallery templates from mugshot images taken straight-on.
Face Verification Algorithms’ Overall Accuracy (as of April 21, 2023)
NIST’s ongoing face verification testing similarly shows that different algorithms exhibit different accuracy levels and that the same algorithm may perform more or less accurately depending on the quality of images used to form the probe and gallery face templates. NIST’s face verification testing evaluates algorithm accuracy by measuring the false non-match rate at a threshold that limits the false match rate to 10-6 or 10-5. In other words, this testing measures how often algorithms produce a false negative error (i.e., incorrectly indicating that two templates depict different individuals) while only producing false positive errors (i.e., incorrectly indicating that two templates depict the same individual) 0.0001% of the time or 0.001% of the time.
Based on these parameters, 233 of 503 evaluated algorithms were >99% accurate when comparing a template derived from a constrained, cooperative image (i.e., a carefully posed photo) to a template derived from a visa photo. No algorithm was able to achieve >99% accuracy when comparing a template derived from an unconstrained, uncooperative probe image (i.e., an unposed photo in which the subject’s face may be pitched, turned, or partially obscured) to a template derived from a kiosk photo.
Face Identification and Face Verification Algorithms’ Accuracy Across Demographic Groups
NIST’s 2019 report specifically evaluated face recognition algorithms’ performance across demographic groups based on race and sex. The results showed notable variations across face identification algorithms. The most accurate face identification algorithms showed “undetectable” false positive error rate differentials across demographic groups. Many, less accurate algorithms showed higher false positive error rate differentials for: “West and East African and East Asian people” than for “Eastern European individuals”; women than for men; and children and elderly people than for middle-aged adults.
NIST’s ongoing face verification accuracy testing across demographic groups also shows that many algorithms achieve higher accuracy on individuals with lighter, rather than darker, skin tones and on men than women. The relative performance differences are often substantial (i.e., false positive error rates that are often >100 times worse for the lowest-accuracy demographic group than for the highest-accuracy demographic group). Nonetheless, the absolute performance differences across demographic groups are often <1%, and many algorithms are 98-99% accurate (expressed in terms of false non-match rate at a threshold that limits the false match rate to 0.03%) across every demographic group.
The Department of Homeland Security’s Science and Technology Directorate conducts performance evaluations of face recognition systems at the Maryland Test Facility. These Biometric Technology Rallies evaluate face recognition systems in conditions similar to operational environments.
The 2022 Biometric Technology Rally evaluated “face biometric acquisition systems and face biometric matching algorithms that are designed to reliably identify small groups of people opting-in to the biometric process, and to not process bystanders who have not opted-in.” The 2022 Rally used a group of “575 volunteers diverse in race, gender, age, and skin tone” to evaluate “four acquisition systems and 10 matching systems . . . based on their efficiency, user satisfaction, effectiveness, and privacy.” Volunteers “passed through each acquisition system in a group of two and a group of four while 8-10 others passed around the system via adjacent bypass lanes.” Acquisition systems had to capture “one best photo per person passing through the system,” and matching systems “had to identify the individual in each submitted photo.”
On average, the face recognition systems that the 2022 Rally tested identified <1% of non-users. The top system accurately identified 97.4% of people in groups of two and groups of four. The median system accurately identified 93% of people in groups of two and groups of four. The worst system accurately identified 64.4% of people in groups of two and 59.8% of people in groups of four. A total of 17 out of 40 system combinations met the >95% true identification rate (TIR) threshold over all volunteers in groups of two. (TIR is “the percentage of people that used the system who were correctly identified.”) Of these 17 system combinations (which included two acquisition systems and nine matching algorithms), most were more accurate for females than males. One acquisition system produced higher accuracy results across all matching systems for darker-skinned individuals than for lighter-skinned individuals. The median processing speed across all the tested systems was 1.9 seconds per person in a group of two and 1.7 seconds per person in a group of four.
Overall, the 2022 Rally found that “group processing can be effective” and that “cameras respect the privacy of bystanders and non-users.” The 2022 Rally results also attributed approximately 97% of the system errors to the acquisition system camera and 3% of the system errors to the face recognition algorithm.
Previous Biometric Technology Rallies evaluated acquisition and matching systems’ overall performance and performance across demographic groups when trying to identify individual subjects (not groups of subjects) who were and were not wearing face masks. The 2021 Rally found that 26 out of 50 system combinations achieved >99% matching-TIR (i.e., the true identification rate, expressed as the percentage of submitted images that were correctly identified) without face masks. Three system combinations achieved >99% matching-TIR with face masks. “Some system combinations were able to meet the 95% Rally TIR threshold for all demographic groups.” Nonetheless, without masks, median system performance was lower for female than male volunteers, volunteers who self-identified as “Black or African-American,” and volunteers “with relatively darker skin tones.”
In general, these Biometric Technology Rally results also underscore that performance depends on the technology itself (i.e., the matching algorithm and the acquisition system) and on the inputs it receives (i.e., the quality of the captured images).
NIST’s and DHS S&T’s face recognition technology evaluations can help face recognition technology vendors and users predict how well a particular face recognition algorithm or system may perform in a real-world use case. Yet, due to the uniqueness of each deployment environment (among other factors), operational performance test results may differ from NIST and DHS S&T results. No independent third-party entity conducts and publishes results from operational performance testing across diverse deployment environments.
Several face recognition technology vendors state that they work with their customers to conduct pre- and/or post-deployment operational performance testing on their face recognition systems. Many vendors also say that they work with their customers to conduct ongoing face recognition system performance monitoring. However, vendors and users often do not make the results of this testing and monitoring public. The difficulties associated with standardizing operational testing and the lack of publicly available data on overall operational performance and operational performance across demographic groups can make building public trust in face recognition technologies challenging.
By many accounts, face recognition technologies often perform effectively in their operational environments. Face recognition technologies helped law enforcement officials generate investigative leads after the January 6, 2021, attack on the U.S. Capitol. Millions of Americans use their faces to unlock their smartphones every day. Yet, whether due to technical or nontechnical factors, not all face recognition technology deployments seem to achieve a similar level of operational performance success.
Media reports described challenges that individuals (especially folks with darker skin tones) have had using a CBP app, CBP One, to apply for asylum. The extent to which the app’s face recognition technology caused these issues remains unclear, given that the media reports identified multiple factors that may have been impeding users’ ability to effectively use CBP One. Nonetheless, these stories have contributed to public uncertainty about how well face recognition technologies perform in their operational environments.
The false arrests of three Black men, Nijeer Parks, Robert Williams, and Michael Oliver, in investigations involving face recognition technologies also raised concerns about operational performance. Not all the details about the role that face recognition technologies played in these investigations have become publicly available, but Detroit Police Chief James Craig already attributed Robert Williams’s false arrest to “sloppy, sloppy investigative work,” not a face recognition technology failure. The paucity of publicly available operational performance data from various law enforcement agencies across the United States limits the public’s ability to contextualize these false arrests in broader performance trends. Without more data, we cannot determine if these false arrests are anomalies.
The reports of false arrests and issues with CBP One raise questions about how people and face recognition technology systems work in tandem. Although researchers have started studying how and how well people and technologies work together, many questions about the prevalence and impact of biases and other issues remain unanswered. For example, there is no standardized third-party testing that measures the impact of automation bias (i.e., the propensity to overly rely on or trust technological system outputs) on face recognition technology operators. Some research suggests that face recognition technology results may bias human reviewers’ analysis of face similarity–at least when the human reviewers are not fully informed about the potential for face recognition algorithm errors. The same research also shows that informing the human reviewers about the potential for algorithm errors can help mitigate risks of bias.
Understanding how well face recognition technologies perform requires careful, nuanced analysis of highly technical information from several different sources. No single source of test data is dispositive across all possible use cases, and old testing data’s relevance quickly diminishes as technologies improve. As the next piece in this series will explain, face recognition technologies’ risks and benefits vary based on the technologies’ performance and several other factors. Consequently, understanding how accurate various face recognition technologies are and how well they perform under different conditions can help policymakers more effectively evaluate potential risks and benefits and tailor legislative requirements accordingly.
|Scope of Testing||Reference Source||Key Takeaways|
|Algorithm||National Institute of Standards and Technology (NIST)||•The most accurate identification and verification algorithms are highly accurate overall and across demographic groups.
•Overall accuracy and accuracy across demographic groups vary widely across algorithms.
•Image quality, face position, gallery size, and the number of years elapsed between the date on which a probe image and a gallery image are captured can all impact an algorithm’s accuracy.
•In general, algorithms are becoming more accurate over time.
|Full system (controlled scenarios)||U.S. Department of Homeland Security Science and Technology Directorate (DHS S&T)||•The most accurate face recognition systems are highly accurate overall and across demographic groups, but accuracy varies across systems.
•The number of people in a group of data subjects, the demographic characteristics of the data subjects, the presence of face masks, and the quality of individual system components can all impact face recognition systems’ accuracy.
•Processing speed varies across face recognition systems and data subject group sizes, but most systems’ processing speed is <2 seconds/person.
|Full system (operational environment)||Individual face recognition technology vendors, users, and/or (in limited cases) government auditors||•Some vendors and users provide publicly available information about operational testing, but operational testing is not standardized, and operational test data often are not publicly available.|
Support Research Like This
With your support, BPC can continue to fund important research like this by combining the best ideas from both parties to promote health, security, and opportunity for all Americans.Give Now