Deep Learning Architectures for Biometrics¶

One-line summary: The CNN, ViT, GCN, and hybrid architectures — plus the loss functions, training strategies, and pooling mechanisms — that power modern biometric recognition across all modalities.

Modality: Cross-modal
Related concepts: Transformer Architectures for Biometrics, Facial Recognition Systems, Iris Recognition, Fingerprint Recognition, Voice Biometrics, Gait Analysis, Biometric Image Quality, Biometric Datasets and Benchmarks
Last updated: 2026-04-04

Overview¶

Deep learning has transformed biometrics from hand-crafted feature engineering (Gabor filters, LBP, SIFT) to end-to-end learned representations. The field is architecturally diverse: face and iris use CNNs and ViTs, voice uses TDNNs and Transformers, gait uses GCNs and set-based networks, and fingerprint uses U-Nets and ResNets.

The unifying theme is metric learning: learning an embedding space where samples from the same identity are close and samples from different identities are far apart.

Technical Details¶

Backbone Architectures¶

Convolutional Neural Networks (CNNs)¶

Architecture	Year	Biometric Applications
VGGNet-16	2014	Early face (VGGFace), iris
ResNet-50/100/200	2016	Face (ArcFace backbone), fingerprint, palmprint
Inception / GoogLeNet	2015	FaceNet's original backbone
MobileNetV2/V3	2018/2019	On-device face (MobileFaceNet), finger vein
EfficientNet-B0–B7	2019	Face, iris, PAD; balanced accuracy-efficiency
SE-ResNet (Squeeze-Excitation)	2018	Channel attention; used in ECAPA-TDNN (voice)
ConvNeXt	2022	Modernized CNN; competitive with ViT for gait (DeepGaitV2)

Vision Transformers (ViTs)¶

See Transformer Architectures for Biometrics for full details. - ViT-B/16, ViT-L/16 used for face recognition (TransFace, 2023). - Swin Transformer for face anti-spoofing and gait. - DeiT distillation for efficient biometric ViTs.

Graph Convolutional Networks (GCNs)¶

GCN on skeleton graphs for gait recognition (GaitGraph, GaitGraph2).
GCN for fingerprint minutiae graph matching.
AASIST uses graph attention for voice anti-spoofing.

Time-Delay Neural Networks (TDNNs)¶

Dominant in speaker verification: x-vector, ECAPA-TDNN.
Frame-level feature extraction + temporal context via dilated 1D convolutions.

Hybrid / NAS-Derived¶

EdgeFace (2024): NAS-optimized hybrid CNN+ViT for mobile face recognition.
EfficientFormer: Efficient ViT variant used in on-device biometrics.

Loss Functions for Biometric Embedding Learning¶

The loss function is arguably the most critical design choice in biometric deep learning.

Loss Family	Examples	Key Property
Softmax variants	Softmax, L-Softmax, A-Softmax (SphereFace), CosFace, ArcFace, AdaFace	Classification-based; angular/cosine margins enforce separation
Metric losses	Contrastive, Triplet, N-pair, Multi-Similarity	Distance-based; operate on pairs/tuples of embeddings
Center-based	Center Loss, Ring Loss	Penalize distance from class center; reduce intra-class variance
Prototype-based	Prototypical Loss, ProxyNCA, Proxy-Anchor	Episode-based; compare to class proxies
Self-supervised	SimCLR, BYOL, DINO, MAE	Pre-training without identity labels; fine-tune for biometrics

ArcFace (Deng et al., 2019) remains the de facto standard for face, iris, and palmprint recognition. Its additive angular margin cos(θ + m) provides geometrically interpretable, consistent inter-class separation on the hypersphere.

Pooling and Aggregation¶

Converting variable-length inputs to fixed-length embeddings:

Strategy	Domain	Description
Global Average Pooling (GAP)	Image	Standard spatial pooling
GeM (Generalized Mean Pooling)	Image	Learnable power parameter; emphasizes discriminative regions
Attentive Statistics Pooling	Voice	Attention-weighted mean + std across time frames (ECAPA-TDNN)
Multi-Head Attention Pooling	Voice, Video	Transformer-style attention over temporal dimension
Quality-Weighted Pooling	Face (video)	Weight frames by quality score; down-weight blurry/occluded frames
Set Pooling	Gait	Permutation-invariant set aggregation (GaitSet)

Training Strategies¶

Large-scale classification → fine-tuning — Pre-train backbone on large identity-labeled dataset (MS1MV2, VoxCeleb2); fine-tune with margin loss on target domain.
Two-stage margin training — First train with softmax, then fine-tune with ArcFace/AAM-Softmax with increased margin and longer inputs (common in speaker verification).
Knowledge distillation — Distill large teacher (R200) into compact student (MobileFaceNet) for edge deployment.
Curriculum learning — Start with easy samples, gradually introduce hard negatives (AdaFace's quality-adaptive approach).
Mixed precision (FP16/BF16) — Essential for training on large face datasets (billions of face-class pairs).
Distributed training — Model-parallel softmax (partial FC) to handle millions of training identities.

Partial FC (Sub-Center Strategies)¶

When training with millions of identities (MS1MV2: 85K; WebFace42M: 2M), the classification layer becomes enormous. Partial FC (An et al., 2022) samples a random subset of negative classes per mini-batch, reducing memory by 10–100× with no accuracy loss. Sub-center ArcFace assigns K sub-centers per class to handle label noise.

Key Models & Papers¶

Paper	Year	Contribution
Schroff et al., FaceNet	2015	Triplet loss + end-to-end face embedding
Deng et al., ArcFace	2019	Additive angular margin; standard face recognition loss
Kim et al., AdaFace	2022	Quality-adaptive margin for face recognition
Desplanques et al., ECAPA-TDNN	2020	SE + multi-scale + attentive stat pooling for speaker verification
Dosovitskiy et al., ViT	2020	Vision Transformer; later adopted for biometrics
An et al., Partial FC	2022	Scalable training with sampled softmax
Oquab et al., DINOv2	2023	Self-supervised ViT; foundation model for visual features
He et al., MAE	2022	Masked autoencoder pre-training; applied to face/iris

Challenges¶

Identity-labeled data at scale — Training modern face models requires millions of identities; privacy concerns limit new large-scale dataset creation.
Domain gap — Models trained on curated datasets underperform on operational data (surveillance, forensic).
Computational cost — Trillion-parameter softmax layers for millions of classes; Partial FC helps but doesn't eliminate the issue.
Architecture search overhead — NAS for biometric-specific architectures is expensive; most work still uses off-the-shelf backbones.
Explainability — Deep embeddings are black boxes; forensic and legal applications demand interpretability.

State of the Art (SOTA)¶

Face: ArcFace + ViT-L + quality-adaptive margin achieves SOTA across IJB-B/C/S.
Voice: WavLM-Large + AAM-Softmax fine-tuning; ECAPA2 for efficient deployment.
Iris: CNN-based (ResNet/EfficientNet) with ArcFace loss; ViT-based models emerging.
Gait: ResNet-based GaitBase + augmentation remains surprisingly strong; ConvNeXt backbone in DeepGaitV2.
Fingerprint: U-Net segmentation + DeepPrint embeddings; attention-based models for contactless.
General trend: Foundation models (DINOv2, MAE, WavLM) pre-trained on massive unlabeled data, then fine-tuned with biometric-specific losses.

Open Questions¶

Will a single foundation model (biometric GPT) handle all modalities, or will modality-specific architectures persist?
Can self-supervised pre-training eliminate the need for large identity-labeled datasets?
How far can knowledge distillation push accuracy on sub-1M-parameter mobile models?
Should the biometric community move to standardized architecture benchmarks (like ImageNet for classification)?

References¶

Deng, J. et al. (2019). ArcFace: Additive Angular Margin Loss for Deep Face Recognition. CVPR.
An, X. et al. (2022). Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC. CVPR.
Oquab, M. et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv.

Backlinks: Transformer Architectures for Biometrics, Facial Recognition Systems, Iris Recognition, Fingerprint Recognition, Voice Biometrics, Gait Analysis, Anti Spoofing Techniques, Biometric Image Quality, Biometric Datasets and Benchmarks