Skip to content

Deep Learning Architectures for Biometrics

One-line summary: The CNN, ViT, GCN, and hybrid architectures — plus the loss functions, training strategies, and pooling mechanisms — that power modern biometric recognition across all modalities.

Modality: Cross-modal
Related concepts: Transformer Architectures for Biometrics, Facial Recognition Systems, Iris Recognition, Fingerprint Recognition, Voice Biometrics, Gait Analysis, Biometric Image Quality, Biometric Datasets and Benchmarks
Last updated: 2026-04-04


Overview

Deep learning has transformed biometrics from hand-crafted feature engineering (Gabor filters, LBP, SIFT) to end-to-end learned representations. The field is architecturally diverse: face and iris use CNNs and ViTs, voice uses TDNNs and Transformers, gait uses GCNs and set-based networks, and fingerprint uses U-Nets and ResNets.

The unifying theme is metric learning: learning an embedding space where samples from the same identity are close and samples from different identities are far apart.

Technical Details

Backbone Architectures

Convolutional Neural Networks (CNNs)

Architecture Year Biometric Applications
VGGNet-16 2014 Early face (VGGFace), iris
ResNet-50/100/200 2016 Face (ArcFace backbone), fingerprint, palmprint
Inception / GoogLeNet 2015 FaceNet's original backbone
MobileNetV2/V3 2018/2019 On-device face (MobileFaceNet), finger vein
EfficientNet-B0–B7 2019 Face, iris, PAD; balanced accuracy-efficiency
SE-ResNet (Squeeze-Excitation) 2018 Channel attention; used in ECAPA-TDNN (voice)
ConvNeXt 2022 Modernized CNN; competitive with ViT for gait (DeepGaitV2)

Vision Transformers (ViTs)

See Transformer Architectures for Biometrics for full details. - ViT-B/16, ViT-L/16 used for face recognition (TransFace, 2023). - Swin Transformer for face anti-spoofing and gait. - DeiT distillation for efficient biometric ViTs.

Graph Convolutional Networks (GCNs)

  • GCN on skeleton graphs for gait recognition (GaitGraph, GaitGraph2).
  • GCN for fingerprint minutiae graph matching.
  • AASIST uses graph attention for voice anti-spoofing.

Time-Delay Neural Networks (TDNNs)

  • Dominant in speaker verification: x-vector, ECAPA-TDNN.
  • Frame-level feature extraction + temporal context via dilated 1D convolutions.

Hybrid / NAS-Derived

  • EdgeFace (2024): NAS-optimized hybrid CNN+ViT for mobile face recognition.
  • EfficientFormer: Efficient ViT variant used in on-device biometrics.

Loss Functions for Biometric Embedding Learning

The loss function is arguably the most critical design choice in biometric deep learning.

Loss Family Examples Key Property
Softmax variants Softmax, L-Softmax, A-Softmax (SphereFace), CosFace, ArcFace, AdaFace Classification-based; angular/cosine margins enforce separation
Metric losses Contrastive, Triplet, N-pair, Multi-Similarity Distance-based; operate on pairs/tuples of embeddings
Center-based Center Loss, Ring Loss Penalize distance from class center; reduce intra-class variance
Prototype-based Prototypical Loss, ProxyNCA, Proxy-Anchor Episode-based; compare to class proxies
Self-supervised SimCLR, BYOL, DINO, MAE Pre-training without identity labels; fine-tune for biometrics

ArcFace (Deng et al., 2019) remains the de facto standard for face, iris, and palmprint recognition. Its additive angular margin cos(θ + m) provides geometrically interpretable, consistent inter-class separation on the hypersphere.

Pooling and Aggregation

Converting variable-length inputs to fixed-length embeddings:

Strategy Domain Description
Global Average Pooling (GAP) Image Standard spatial pooling
GeM (Generalized Mean Pooling) Image Learnable power parameter; emphasizes discriminative regions
Attentive Statistics Pooling Voice Attention-weighted mean + std across time frames (ECAPA-TDNN)
Multi-Head Attention Pooling Voice, Video Transformer-style attention over temporal dimension
Quality-Weighted Pooling Face (video) Weight frames by quality score; down-weight blurry/occluded frames
Set Pooling Gait Permutation-invariant set aggregation (GaitSet)

Training Strategies

  • Large-scale classification → fine-tuning — Pre-train backbone on large identity-labeled dataset (MS1MV2, VoxCeleb2); fine-tune with margin loss on target domain.
  • Two-stage margin training — First train with softmax, then fine-tune with ArcFace/AAM-Softmax with increased margin and longer inputs (common in speaker verification).
  • Knowledge distillation — Distill large teacher (R200) into compact student (MobileFaceNet) for edge deployment.
  • Curriculum learning — Start with easy samples, gradually introduce hard negatives (AdaFace's quality-adaptive approach).
  • Mixed precision (FP16/BF16) — Essential for training on large face datasets (billions of face-class pairs).
  • Distributed training — Model-parallel softmax (partial FC) to handle millions of training identities.

Partial FC (Sub-Center Strategies)

When training with millions of identities (MS1MV2: 85K; WebFace42M: 2M), the classification layer becomes enormous. Partial FC (An et al., 2022) samples a random subset of negative classes per mini-batch, reducing memory by 10–100× with no accuracy loss. Sub-center ArcFace assigns K sub-centers per class to handle label noise.

Key Models & Papers

Paper Year Contribution
Schroff et al., FaceNet 2015 Triplet loss + end-to-end face embedding
Deng et al., ArcFace 2019 Additive angular margin; standard face recognition loss
Kim et al., AdaFace 2022 Quality-adaptive margin for face recognition
Desplanques et al., ECAPA-TDNN 2020 SE + multi-scale + attentive stat pooling for speaker verification
Dosovitskiy et al., ViT 2020 Vision Transformer; later adopted for biometrics
An et al., Partial FC 2022 Scalable training with sampled softmax
Oquab et al., DINOv2 2023 Self-supervised ViT; foundation model for visual features
He et al., MAE 2022 Masked autoencoder pre-training; applied to face/iris

Challenges

  • Identity-labeled data at scale — Training modern face models requires millions of identities; privacy concerns limit new large-scale dataset creation.
  • Domain gap — Models trained on curated datasets underperform on operational data (surveillance, forensic).
  • Computational cost — Trillion-parameter softmax layers for millions of classes; Partial FC helps but doesn't eliminate the issue.
  • Architecture search overhead — NAS for biometric-specific architectures is expensive; most work still uses off-the-shelf backbones.
  • Explainability — Deep embeddings are black boxes; forensic and legal applications demand interpretability.

State of the Art (SOTA)

  • Face: ArcFace + ViT-L + quality-adaptive margin achieves SOTA across IJB-B/C/S.
  • Voice: WavLM-Large + AAM-Softmax fine-tuning; ECAPA2 for efficient deployment.
  • Iris: CNN-based (ResNet/EfficientNet) with ArcFace loss; ViT-based models emerging.
  • Gait: ResNet-based GaitBase + augmentation remains surprisingly strong; ConvNeXt backbone in DeepGaitV2.
  • Fingerprint: U-Net segmentation + DeepPrint embeddings; attention-based models for contactless.
  • General trend: Foundation models (DINOv2, MAE, WavLM) pre-trained on massive unlabeled data, then fine-tuned with biometric-specific losses.

Open Questions

  • Will a single foundation model (biometric GPT) handle all modalities, or will modality-specific architectures persist?
  • Can self-supervised pre-training eliminate the need for large identity-labeled datasets?
  • How far can knowledge distillation push accuracy on sub-1M-parameter mobile models?
  • Should the biometric community move to standardized architecture benchmarks (like ImageNet for classification)?

References

  • Deng, J. et al. (2019). ArcFace: Additive Angular Margin Loss for Deep Face Recognition. CVPR.
  • An, X. et al. (2022). Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC. CVPR.
  • Oquab, M. et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv.

Backlinks: Transformer Architectures for Biometrics, Facial Recognition Systems, Iris Recognition, Fingerprint Recognition, Voice Biometrics, Gait Analysis, Anti Spoofing Techniques, Biometric Image Quality, Biometric Datasets and Benchmarks