Evaluating Fine-Tuned BERT Beyond Accuracy: Latency, Calibration, and Failure Modes
Transformer classification projects are frequently reported with a single number: accuracy or macro F1. That is useful, but incomplete. A model that wins by one point offline can still be the wrong production choice if its latency is unstable, its confidence is poorly calibrated, or its error distribution is unacceptable for the application domain. Real evaluation should help you choose a deployable model, not just a leaderboard winner.
Why Macro F1 Matters
In mental health, moderation, and support-oriented NLP, class imbalance is often severe. Accuracy can look strong while minority classes remain poorly detected. Macro F1 is a better default because it weights each class equally and exposes whether the model is genuinely learning across the label space. This is especially important when false negatives are more harmful than false positives.
Compare Against Simpler Baselines
A fine-tuned BERT should not be evaluated in isolation. Naive Bayes, logistic regression, and Linear SVC often provide strong baselines on text classification tasks. They are cheap, interpretable, and fast to serve. If a Transformer only offers a marginal quality improvement, the operational cost may not justify deployment. The right question is not whether BERT wins. It is whether BERT wins enough.
Latency Is a Product Metric
Average inference time is not enough. Tail latency matters because user experience is shaped by slow cases, not just typical ones. You should measure p50, p95, and p99 latency under realistic batch sizes and hardware constraints. Tokenization overhead also deserves attention; in lightweight deployments it can be a non-trivial share of total response time.
Calibration and Confidence Quality
Many classification systems are used downstream in ranking, escalation, or human-in-the-loop workflows. In those cases, confidence quality matters almost as much as raw label quality. Poor calibration means a model may appear certain when it is actually wrong. Temperature scaling or isotonic calibration can materially improve decision quality even when the base classifier remains unchanged.
Error Analysis Is Where the Real Work Starts
Confusion matrices, class-wise precision and recall, and manual inspection of false positives and false negatives reveal what the model is actually learning. In mental health text, common failure categories include sarcasm, indirect distress signals, context collapse on short messages, and domain shift across communities. These are not edge cases. They define whether the system is safe and useful.
A Better Evaluation Framework
My preferred decision table combines macro F1, latency percentiles, calibration error, memory footprint, and qualitative failure notes. Once those signals are visible together, the deployment choice becomes clearer. Evaluation should be multi-objective because production systems are multi-objective. A classifier is not finished when it predicts well. It is finished when it predicts well enough, fast enough, and reliably enough for the environment it will live in.