Biometric Datasets and Benchmarks
One-line summary: The major public datasets, evaluation protocols, competitions, and NIST programs that drive biometric research — plus the evolving challenges around dataset ethics, consent, and the rise of synthetic training data.
Modality: Cross-modal
Related concepts: Facial Recognition Systems, Iris Recognition, Fingerprint Recognition, Gait Analysis, Voice Biometrics, Bias and Fairness in Biometrics, Anti Spoofing Techniques, Biometric Image Quality
Last updated: 2026-04-04
Overview
Biometric research depends on standardized datasets and evaluation protocols. The field has a rich tradition of government-sponsored evaluations (NIST FRVT, IREX, PFT, SRE) alongside academic benchmarks and competitions (LFW, VoxCeleb, LivDet, ASVspoof).
Key tension: Large, diverse, real-world datasets are essential for training and evaluation, but collecting biometric data raises serious consent and privacy concerns. Several major datasets (MS-Celeb-1M, MegaFace) have been retracted or restricted. Synthetic data is emerging as a partial solution.
Technical Details
Face Datasets
| Dataset |
Images/Videos |
Subjects |
Year |
Status |
Notes |
| LFW |
13K images |
5.7K |
2007 |
Active |
Saturated (>99.8%); still used for sanity checks |
| IJB-B |
76K images + video |
1.8K |
2017 |
Active (NIST) |
Mixed media; standard evaluation |
| IJB-C |
148K images + video |
3.5K |
2018 |
Active (NIST) |
Extended IJB-B |
| IJB-S |
202 surveillance videos |
— |
2018 |
Active |
Low-quality surveillance benchmark |
| MegaFace |
4.7M images / 672K ids |
Challenge retracted |
2016 |
Deprecated |
Consent issues; UW retracted |
| MS-Celeb-1M |
10M images / 100K ids |
— |
2016 |
Retracted |
Microsoft retracted due to consent |
| MS1MV2 (cleaned MS-Celeb) |
5.8M / 85K |
— |
2019 |
Widely used |
Cleaned version; still consent controversy |
| WebFace260M |
260M / 4M |
— |
2022 |
Active |
Largest public; noisy |
| WebFace42M (cleaned) |
42M / 2M |
— |
2022 |
Active |
Cleaned subset; common training set |
| CASIA-WebFace |
500K / 10.5K |
— |
2014 |
Active |
Smaller; good for ablations |
| VGGFace2 |
3.3M / 9.1K |
— |
2018 |
Restricted |
Oxford restricted download |
| CelebA |
202K images / 10.2K ids |
— |
2015 |
Active |
Attribute annotations (40 binary attributes) |
| RFW |
40K / 10K (balanced) |
— |
2019 |
Active |
Racially balanced evaluation |
Iris Datasets
| Dataset |
Size |
Spectrum |
Notes |
| ND-IRIS-0405 |
64K / 356 subjects |
NIR |
Academic standard |
| CASIA-Iris-V4 |
54K (Thousand + Lamp + Distance + Twins) |
NIR |
Multi-condition |
| UBIRIS v2 |
11K / 261 |
Visible |
Unconstrained VIS benchmark |
| ND-CrossSensor |
116K / 676 |
NIR |
Cross-sensor matching |
| LivDet-Iris |
Varies |
NIR |
PAD competition |
Fingerprint Datasets
| Dataset |
Size |
Type |
Notes |
| FVC 2000–2006 |
~3.5K each |
Contact |
Classic competition series |
| NIST SD 4/14/27/302 |
2K–88K |
Rolled/plain/latent |
Federal benchmark data |
| LivDet 2009–2023 |
Varies |
Live + spoof |
Liveness detection competition |
| PrintsGAN |
Unlimited |
Synthetic |
GAN-generated; privacy-safe training |
Voice Datasets
| Dataset |
Size |
Speakers |
Notes |
| VoxCeleb1 |
153K utterances |
1.2K |
Dev/test benchmark |
| VoxCeleb2 |
1.1M utterances |
6.1K |
Training set |
| NIST SRE (2008–2021) |
Varies |
Varies |
Operational telephony evaluation |
| ASVspoof 2019/2021/2024 |
~600K utterances |
— |
Anti-spoofing challenge |
| CN-Celeb |
1.2M utterances |
3K |
Chinese speakers |
| LibriSpeech |
1K hours |
2.5K |
Pre-training; not biometric-specific |
Gait Datasets
| Dataset |
Subjects |
Setting |
Notes |
| CASIA-B |
124 |
Indoor, 11 views |
Classic; limited scale |
| OU-MVLP |
10.3K |
Indoor, 14 views |
Largest lab-based |
| GREW |
26.3K |
Outdoor wild |
In-the-wild benchmark |
| Gait3D |
4K |
Outdoor, 3D |
LiDAR-based 3D gait |
Palm Datasets
| Dataset |
Size |
Type |
Notes |
| PolyU Palmprint v2 |
7.7K / 386 |
Contact |
Standard benchmark |
| IITD Palmprint |
2.6K / 460 |
Contactless |
Academic |
| Tongji Contactless |
12K / 600 |
Contactless |
Large-scale |
| VERA Palm Vein |
2.2K / 110 |
NIR vein |
Vein recognition |
NIST Evaluation Programs
| Program |
Modality |
Status |
Description |
| FRVT (Face Recognition Vendor Test) |
Face |
Ongoing |
Largest independent face recognition benchmark; 500+ algorithm submissions |
| IREX (Iris Exchange) |
Iris |
Ongoing |
Iris recognition algorithm evaluation |
| PFT (Proprietary Fingerprint Template) |
Fingerprint |
Ongoing |
Fingerprint template interoperability and accuracy |
| MINEX |
Fingerprint |
Ongoing |
Minutiae interoperability exchange |
| SRE (Speaker Recognition Evaluation) |
Voice |
Biennial |
Speaker verification challenge |
| ELFT-EFS |
Fingerprint (latent) |
Ongoing |
Latent fingerprint identification evaluation |
Competitions and Challenges
| Competition |
Modality |
Frequency |
Organizer |
| NIST FRVT |
Face |
Ongoing |
NIST |
| FG/IJCB Face Anti-Spoofing |
Face PAD |
Annual |
IEEE |
| LivDet (Fingerprint) |
Fingerprint PAD |
Biennial |
U of Cagliari |
| LivDet (Iris) |
Iris PAD |
Biennial |
Various |
| ASVspoof |
Voice PAD |
Biennial |
Interspeech community |
| VoxSRC |
Voice |
Annual |
VoxCeleb team |
| FairFace Challenge |
Face fairness |
2024 |
IJCB |
| GaitBenchmark (OpenGait) |
Gait |
Ongoing |
OpenGait team |
Challenges
- Consent and ethics — MS-Celeb-1M and MegaFace collected faces without explicit consent; both were retracted. New datasets must navigate GDPR, BIPA, and institutional review boards.
- Dataset bias — Most face datasets over-represent young, light-skinned individuals. See Bias and Fairness in Biometrics.
- Label noise — Web-scraped datasets contain significant identity label errors (10–30% noise in uncleaned WebFace).
- Benchmark saturation — LFW is effectively solved; even IJB-C is approaching ceiling for top methods.
- Real-world representativeness — Lab benchmarks don't capture operational conditions (surveillance, mobile, extreme weather).
- Synthetic data gap — Models trained on synthetic data still underperform real-data training by 2–5% on hard benchmarks.
- Data availability — Many strong datasets are behind license agreements, institutional affiliation requirements, or have been retracted entirely.
State of the Art (SOTA)
- Leading evaluation platform: NIST FRVT for face, NIST IREX for iris, VoxCeleb + NIST SRE for voice.
- Trend: shift from single-benchmark evaluation to multi-benchmark + disaggregated (demographic-aware) evaluation.
- Synthetic data: becoming viable for training (~95% of real-data accuracy) and mandatory for some new research due to privacy constraints.
- Benchmarks emerging for cross-domain, cross-sensor, and cross-spectral scenarios.
Open Questions
- Will synthetic biometric datasets fully replace real datasets for training in 5 years?
- How to create large-scale, diverse, ethically collected biometric datasets?
- Should benchmark results be disaggregated by demographic group as standard practice?
- Can few-shot evaluation protocols replace large gallery benchmarks for rapid system assessment?
References
- Maze, B. et al. (2018). IARPA Janus Benchmark — C (IJB-C). ICB.
- Grother, P. et al. (ongoing). NIST FRVT. https://pages.nist.gov/frvt/
- Nagrani, A. et al. (2020). VoxCeleb: Large-Scale Speaker Verification in the Wild. Computer Speech and Language.
- Zhu, Z. et al. (2021). WebFace260M: A Benchmark for Million-Scale Deep Face Recognition. CVPR.
Backlinks: Facial Recognition Systems, Iris Recognition, Fingerprint Recognition, Gait Analysis, Voice Biometrics, Bias and Fairness in Biometrics, Anti Spoofing Techniques, Biometric Image Quality, Palm Recognition