From Tokens to Trees: What Makes Sequence Generation Hard in OCR-to-LaTeX Models
Image-to-LaTeX modeling is often described as OCR, but that label is misleading. Standard OCR extracts surface text. LaTeX generation must recover structured mathematical intent. Superscripts, subscripts, nested fractions, matrix layouts, and bracket scope all need to be reconstructed as a valid token sequence. The difficulty is not only visual recognition. It is structured generation under strong syntactic constraints.
Why This Problem Is Different from Plain OCR
Mathematical expressions are hierarchical, but autoregressive decoders emit them linearly. That creates a representation mismatch. The model must map a two-dimensional visual structure into a one-dimensional token sequence while preserving nesting and operator relationships. A small local mistake can break the global expression even if most characters are recognized correctly.
Encoder Choice Matters Less Than You Think
It is tempting to focus on stronger visual encoders such as larger CNNs or ViT variants. These help, but they do not solve the full problem. Once the encoder produces sufficiently informative visual features, the decoder and search strategy often become the real bottleneck. Weak decoding can waste good visual representations.
The Decoder Carries Structural Burden
LSTM decoders are still competitive because they are stable and efficient, but Transformer decoders better capture long-range dependencies in nested expressions. This becomes important when bracket matching, denominator scope, or multi-symbol operators depend on tokens generated many steps earlier. Better sequence memory often translates into cleaner structural outputs.
Why Beam Search Helps
Greedy decoding commits too early. In LaTeX generation, early ambiguity is common: a token may plausibly start a fraction, a grouped expression, or a plain symbol sequence. Beam search keeps multiple hypotheses alive long enough for later evidence to disambiguate them. That usually improves BLEU and exact-match metrics because structural consistency emerges over multiple decoding steps rather than a single local choice.
Metrics Need Context
BLEU is useful for approximate sequence similarity, but it can hide semantically catastrophic mistakes. Exact match is much harsher and better reflects whether the output is fully usable. Ideally, evaluation should include render-aware checks or symbolic equivalence for at least a validation subset. Two LaTeX strings can differ lexically while rendering identically, and they can also look nearly identical lexically while changing mathematical meaning.
What Good Systems Optimize For
The best OCR-to-LaTeX systems balance visual robustness, structural decoding, and search quality. Data augmentation helps the model tolerate scan noise and layout variability, but architecture still matters. If you want reliable output, think in terms of structured prediction, not character recognition. The model is not only reading symbols. It is reconstructing syntax.