Skip to content

Biometric Datasets and Benchmarks

One-line summary: The major public datasets, evaluation protocols, competitions, and NIST programs that drive biometric research — plus the evolving challenges around dataset ethics, consent, and the rise of synthetic training data.

Modality: Cross-modal
Related concepts: Facial Recognition Systems, Iris Recognition, Fingerprint Recognition, Gait Analysis, Voice Biometrics, Bias and Fairness in Biometrics, Anti Spoofing Techniques, Biometric Image Quality
Last updated: 2026-04-04


Overview

Biometric research depends on standardized datasets and evaluation protocols. The field has a rich tradition of government-sponsored evaluations (NIST FRVT, IREX, PFT, SRE) alongside academic benchmarks and competitions (LFW, VoxCeleb, LivDet, ASVspoof).

Key tension: Large, diverse, real-world datasets are essential for training and evaluation, but collecting biometric data raises serious consent and privacy concerns. Several major datasets (MS-Celeb-1M, MegaFace) have been retracted or restricted. Synthetic data is emerging as a partial solution.

Technical Details

Face Datasets

Dataset Images/Videos Subjects Year Status Notes
LFW 13K images 5.7K 2007 Active Saturated (>99.8%); still used for sanity checks
IJB-B 76K images + video 1.8K 2017 Active (NIST) Mixed media; standard evaluation
IJB-C 148K images + video 3.5K 2018 Active (NIST) Extended IJB-B
IJB-S 202 surveillance videos 2018 Active Low-quality surveillance benchmark
MegaFace 4.7M images / 672K ids Challenge retracted 2016 Deprecated Consent issues; UW retracted
MS-Celeb-1M 10M images / 100K ids 2016 Retracted Microsoft retracted due to consent
MS1MV2 (cleaned MS-Celeb) 5.8M / 85K 2019 Widely used Cleaned version; still consent controversy
WebFace260M 260M / 4M 2022 Active Largest public; noisy
WebFace42M (cleaned) 42M / 2M 2022 Active Cleaned subset; common training set
CASIA-WebFace 500K / 10.5K 2014 Active Smaller; good for ablations
VGGFace2 3.3M / 9.1K 2018 Restricted Oxford restricted download
CelebA 202K images / 10.2K ids 2015 Active Attribute annotations (40 binary attributes)
RFW 40K / 10K (balanced) 2019 Active Racially balanced evaluation

Iris Datasets

Dataset Size Spectrum Notes
ND-IRIS-0405 64K / 356 subjects NIR Academic standard
CASIA-Iris-V4 54K (Thousand + Lamp + Distance + Twins) NIR Multi-condition
UBIRIS v2 11K / 261 Visible Unconstrained VIS benchmark
ND-CrossSensor 116K / 676 NIR Cross-sensor matching
LivDet-Iris Varies NIR PAD competition

Fingerprint Datasets

Dataset Size Type Notes
FVC 2000–2006 ~3.5K each Contact Classic competition series
NIST SD 4/14/27/302 2K–88K Rolled/plain/latent Federal benchmark data
LivDet 2009–2023 Varies Live + spoof Liveness detection competition
PrintsGAN Unlimited Synthetic GAN-generated; privacy-safe training

Voice Datasets

Dataset Size Speakers Notes
VoxCeleb1 153K utterances 1.2K Dev/test benchmark
VoxCeleb2 1.1M utterances 6.1K Training set
NIST SRE (2008–2021) Varies Varies Operational telephony evaluation
ASVspoof 2019/2021/2024 ~600K utterances Anti-spoofing challenge
CN-Celeb 1.2M utterances 3K Chinese speakers
LibriSpeech 1K hours 2.5K Pre-training; not biometric-specific

Gait Datasets

Dataset Subjects Setting Notes
CASIA-B 124 Indoor, 11 views Classic; limited scale
OU-MVLP 10.3K Indoor, 14 views Largest lab-based
GREW 26.3K Outdoor wild In-the-wild benchmark
Gait3D 4K Outdoor, 3D LiDAR-based 3D gait

Palm Datasets

Dataset Size Type Notes
PolyU Palmprint v2 7.7K / 386 Contact Standard benchmark
IITD Palmprint 2.6K / 460 Contactless Academic
Tongji Contactless 12K / 600 Contactless Large-scale
VERA Palm Vein 2.2K / 110 NIR vein Vein recognition

NIST Evaluation Programs

Program Modality Status Description
FRVT (Face Recognition Vendor Test) Face Ongoing Largest independent face recognition benchmark; 500+ algorithm submissions
IREX (Iris Exchange) Iris Ongoing Iris recognition algorithm evaluation
PFT (Proprietary Fingerprint Template) Fingerprint Ongoing Fingerprint template interoperability and accuracy
MINEX Fingerprint Ongoing Minutiae interoperability exchange
SRE (Speaker Recognition Evaluation) Voice Biennial Speaker verification challenge
ELFT-EFS Fingerprint (latent) Ongoing Latent fingerprint identification evaluation

Competitions and Challenges

Competition Modality Frequency Organizer
NIST FRVT Face Ongoing NIST
FG/IJCB Face Anti-Spoofing Face PAD Annual IEEE
LivDet (Fingerprint) Fingerprint PAD Biennial U of Cagliari
LivDet (Iris) Iris PAD Biennial Various
ASVspoof Voice PAD Biennial Interspeech community
VoxSRC Voice Annual VoxCeleb team
FairFace Challenge Face fairness 2024 IJCB
GaitBenchmark (OpenGait) Gait Ongoing OpenGait team

Challenges

  • Consent and ethics — MS-Celeb-1M and MegaFace collected faces without explicit consent; both were retracted. New datasets must navigate GDPR, BIPA, and institutional review boards.
  • Dataset bias — Most face datasets over-represent young, light-skinned individuals. See Bias and Fairness in Biometrics.
  • Label noise — Web-scraped datasets contain significant identity label errors (10–30% noise in uncleaned WebFace).
  • Benchmark saturation — LFW is effectively solved; even IJB-C is approaching ceiling for top methods.
  • Real-world representativeness — Lab benchmarks don't capture operational conditions (surveillance, mobile, extreme weather).
  • Synthetic data gap — Models trained on synthetic data still underperform real-data training by 2–5% on hard benchmarks.
  • Data availability — Many strong datasets are behind license agreements, institutional affiliation requirements, or have been retracted entirely.

State of the Art (SOTA)

  • Leading evaluation platform: NIST FRVT for face, NIST IREX for iris, VoxCeleb + NIST SRE for voice.
  • Trend: shift from single-benchmark evaluation to multi-benchmark + disaggregated (demographic-aware) evaluation.
  • Synthetic data: becoming viable for training (~95% of real-data accuracy) and mandatory for some new research due to privacy constraints.
  • Benchmarks emerging for cross-domain, cross-sensor, and cross-spectral scenarios.

Open Questions

  • Will synthetic biometric datasets fully replace real datasets for training in 5 years?
  • How to create large-scale, diverse, ethically collected biometric datasets?
  • Should benchmark results be disaggregated by demographic group as standard practice?
  • Can few-shot evaluation protocols replace large gallery benchmarks for rapid system assessment?

References

  • Maze, B. et al. (2018). IARPA Janus Benchmark — C (IJB-C). ICB.
  • Grother, P. et al. (ongoing). NIST FRVT. https://pages.nist.gov/frvt/
  • Nagrani, A. et al. (2020). VoxCeleb: Large-Scale Speaker Verification in the Wild. Computer Speech and Language.
  • Zhu, Z. et al. (2021). WebFace260M: A Benchmark for Million-Scale Deep Face Recognition. CVPR.

Backlinks: Facial Recognition Systems, Iris Recognition, Fingerprint Recognition, Gait Analysis, Voice Biometrics, Bias and Fairness in Biometrics, Anti Spoofing Techniques, Biometric Image Quality, Palm Recognition