Skip to content

Facial Recognition Systems

One-line summary: End-to-end pipelines that detect, align, embed, and match human faces for verification (1:1) and identification (1:N).

Modality: Face
Related concepts: Deep Learning Architectures for Biometrics, Transformer Architectures for Biometrics, Anti Spoofing Techniques, Bias and Fairness in Biometrics, Biometric Image Quality, Multimodal Biometrics
Last updated: 2026-04-04


Overview

Modern facial recognition operates as a four-stage pipeline:

  1. Detection — Localize faces in an image (RetinaFace, SCRFD, YOLOv8-Face).
  2. Alignment — Warp the detected crop to a canonical pose using 5-point or 68-point landmarks.
  3. Embedding — Map the aligned face to a compact feature vector (typically 128–512 dims) using a deep network trained with a margin-based loss.
  4. Matching — Compare embeddings via cosine similarity or L2 distance against a gallery for verification or identification.

The field is dominated by metric-learning losses that enforce inter-class separation and intra-class compactness in the embedding space.

Technical Details

Loss Functions (Evolution)

Loss Year Key Idea
Contrastive Loss 2006 Pair-based distance learning
Triplet Loss (FaceNet) 2015 Anchor-positive-negative margin
Center Loss 2016 Penalizes distance to class center
SphereFace (A-Softmax) 2017 Angular margin in weight space
CosFace (LMCL) 2018 Additive cosine margin
ArcFace 2019 Additive angular margin — current default
AdaFace 2022 Quality-adaptive margin
UniTSFace 2024 Unified sample-to-sample loss with hard-pair mining

Backbone Architectures

  • ResNet-100/200 — Workhorse backbone; ArcFace + R100 remains a strong baseline.
  • MobileFaceNet — Lightweight backbone for on-device inference (~1M params).
  • EfficientNet-B4 — Balanced accuracy-efficiency trade-off.
  • ViT-based — See Transformer Architectures for Biometrics; ViT-B/16 + ArcFace now competitive with CNNs on IJB-C.
  • EdgeFace (2024) — Hybrid CNN-ViT optimized for mobile via NAS.

Inference Pipeline Considerations

  • Template aggregation — When multiple frames are available (video, multi-crop), embeddings are pooled (mean, quality-weighted, attention-based).
  • Score normalization — Z-norm, T-norm, and S-norm calibrate raw cosine scores for large-scale identification.
  • Quantization — INT8 / FP16 embeddings reduce storage for billion-scale galleries with minimal accuracy loss.

Key Models & Papers

Model / Paper Year Contribution
DeepFace (Facebook) 2014 First deep-learning face verification system, 97.35% on LFW
FaceNet (Google) 2015 Triplet loss + 128-d embeddings; 99.63% LFW
ArcFace (Deng et al.) 2019 Additive angular margin; SOTA across IJB-B/C, MegaFace
AdaFace (Kim et al.) 2022 Quality-adaptive margin; strong on low-quality benchmarks (IJB-S, TinyFace)
TransFace 2023 Pure ViT backbone with patch-level attention for face recognition
TopoFR 2024 Topology-preserving face recognition with persistent homology regularization
UniTSFace 2024 Unified triplet-softmax with curriculum hard-pair mining

Datasets

Dataset Size Year Notes
LFW 13K images / 5.7K identities 2007 Saturated benchmark (>99.8% for SOTA)
MS1MV2 (MS-Celeb-1M cleaned) 5.8M / 85K ids 2019 Standard training set after noise cleaning
WebFace260M 260M / 4M ids 2022 Largest public face dataset; noisy
CASIA-WebFace 500K / 10.5K ids 2014 Smaller training set, useful for ablations
IJB-B / IJB-C 76K / 3.5K ids 2017/2018 Mixed media (still + video); standard eval
IJB-S 202 videos / surveillance 2018 Low-quality surveillance benchmark
TinyFace 169K / 5.1K ids 2017 Low-resolution faces in the wild
BUPT-BalancedFace 1.3M / 28K ids 2020 Racially balanced training set

Challenges

  • Low-quality / unconstrained faces — Pose, illumination, occlusion, low resolution, and motion blur degrade embedding quality. Biometric Image Quality plays a critical role.
  • Demographic bias — Accuracy varies across race, gender, and age. See Bias and Fairness in Biometrics.
  • Aging and appearance change — Longitudinal stability of embeddings over years.
  • Billion-scale search — Efficient ANN indexing (FAISS, ScaNN) needed for national ID or web-scale galleries.
  • Presentation attacks — Print, replay, 3D mask, and deepfake attacks. See Anti Spoofing Techniques.
  • Privacy regulation — GDPR, Illinois BIPA, EU AI Act restrict collection and use. See Privacy Preserving Biometrics.

State of the Art (SOTA)

As of early 2026: - IJB-C TAR@FAR=1e-4: ~97.5% (ArcFace-R200 + quality-weighted fusion), ~97.8% with ViT-L ensembles. - LFW: 99.87%+ (essentially saturated). - IJB-S (surveillance): AdaFace and quality-aware methods lead (~72% Rank-1 at far end). - Real-world error rates: NIST FRVT ongoing evaluations show top commercial systems at FNMR < 0.2% @ FMR=1e-6 for frontal images. - On-device: Sub-5ms inference on flagship mobile SoCs with MobileFaceNet/EdgeFace.

Open Questions

  • Can self-supervised or foundation-model pre-training (DINOv2, MAE) replace supervised ArcFace training for face recognition?
  • How to build truly fair systems that equalize error rates across all demographic groups without sacrificing overall accuracy?
  • What is the theoretical limit of face recognition under extreme pose/illumination variation?
  • Will synthetic training data (generated by diffusion models) fully replace privacy-sensitive real face datasets?

References

  • Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). ArcFace: Additive Angular Margin Loss for Deep Face Recognition. CVPR.
  • Kim, M., Jain, A. K., & Liu, X. (2022). AdaFace: Quality Adaptive Margin for Face Recognition. CVPR.
  • Dan, J. et al. (2024). TopoFR: A Closer Look at Topology Alignment on Face Recognition. NeurIPS.
  • Grother, P. et al. (ongoing). NIST FRVT. https://pages.nist.gov/frvt/

Backlinks: Deep Learning Architectures for Biometrics, Transformer Architectures for Biometrics, Anti Spoofing Techniques, Bias and Fairness in Biometrics, Multimodal Biometrics, Biometric Image Quality, Biometric Datasets and Benchmarks, Real World Biometric Deployments