Evaluating OCR-to-Markdown Systems Is Fundamentally Broken (and Why That’s Hard to Fix)

.toc-list { position: relative; } .toc-list { overflow: hidden; list-style: none; } .gh-toc .is-active-link::before { background-color: var(-ghost-accent-color); /* Determines the highlight color of the table of contents based on the highlight color set in Ghost Admin */ } .gl-toc__header { align-items: center; color: var(–foreground); cursor: cursor; Display: Flex; gap: 2rem; set-content:space-between; padding: 1 rim; width: 100%; } .gh-toc-title { font-size: 15 pixels! important; font-weight: 600 !important; character spacing: .0075rem; line-height: 1.2; margin: 0; text-transform: uppercase; } .gl-toc__icon { transition: easily convert .2s to outside; } .gh-toc me { color: #404040; font-size: 14px; line-height: 1.3; margin-bottom: .75rem; } .gh-toc { display: none; } .gh-toc.active { display: block; } .gl-toc__icon svg{ transition: easily convert 0.2s abroad; } .gh-toc.active + .gl-toc__header .gl-toc__icon .rotated{ transform: rotate(180deg); } .gl-toc__icon .rotated{ transform: rotate(180°); } .gh-toc-container-sidebar{ display: none; } .gh-toc-container-content{ display: block; width: 100%; } a.toc-link{ background-image: none! important; } .gh-toc-container-content .toc-list-item{ margin-left: 0 ! important; } .gh-toc-container-content .toc-list-item::marker{ content: none; } .gh-toc-container-content .toc-list{ padding: 0 ! important; Margin: 0! important; } media screen only and (min-width: 1200px) { .gh-sidebar-wrapper{ margin: 0; Position: sticky. Top: 6 Rim. left: calculate(((100vw – 928px)/ 2 ) – 16.25rem – 60px); z-index: 3; } .gh-sidebar { align-self: flex-start; background-color: transparent; flex direction: column; Network area: tok; max height: calculate(100vh – 6rem); Width: 16.25 rem; z-index: 3; Position: sticky. top: 80px; } .gh-sidebar:before { -webkit-backdrop-filter: Blur(30px); background-filter: blur(30px); background-color:hsla(0, 0%, 100%, .5);; border-radius: .5rem; content: “”; display: block; height: 100%; left: 0; Position: absolute; top: 0; width: 100%; z-index: -1; } .gl-toc__header { cursor: default; Elastic shrinkage: 0; cursor-events:none; } .gl-toc__icon { display: none; } .gh-toc { display: block; Flex: 1; overflow p: auto; } .gh-toc-container-sidebar{ display: block; } .gh-toc-container-content{ display: none; } } ))>

Evaluating optical character recognition (OCR) systems that convert PDF files or document images to Markdown is more complicated than it seems. Unlike plain text OCR, OCR-to-Markdown requires retrieval of forms Content, layout, reading order, and acting choices together. Today’s standards attempt to capture this through a combination of string matching, heuristic alignment, and formatting rules – but in practice, these approaches routinely misclassify correct output as failure.

This post explains why OCR-to-Markdown evaluation is inherently non-deterministic, examines common evaluation techniques and their failure patterns, highlights concrete problems observed in two widely used benchmarks, and explains why Master of Laws as a Judge It is currently considered the most practical way to evaluate these systems, despite its drawbacks.

Why is OCR-to-Markdown so difficult to evaluate?

At its core, OCR-to-Markdown does not have a single valid output.

Multiple outputs can be equally valid:

Multi-column layouts can be made linear with different reading orders.
Equations can be represented using LaTeX, Unicode, HTML, or hybrids.
Headers, footers, watermarks, and marginal text may or may not be considered “content” depending on the purpose of the assignment.
Spacing, punctuation, and Unicode normalization are often different without affecting the meaning.

From a human or system perspective, these outcomes are equivalent. And from a standard point of view, they are often not.

Common evaluation techniques and their limitations

1. String-based metrics (edit distance, exact match)

Most OCR-to-Markdown benchmarks are based on normalized string comparison or edit distance.

Restrictions

Markdown is treated as a flat character sequence, ignoring syntax.
Small differences in format result in significant penalties.
Structurally incorrect output can produce poor results if text overlaps.
The results correlate poorly with human judgment.

These metrics reward format compliance rather than correctness.

2. Matching demand-sensitive blocks

Some criteria divide documents into blocks and rank the results and their proximity.

Restrictions

Valid alternate read commands (such as multi-column documents) are penalized.
A small footer or marginal text can break the strict order restrictions.
Matching heuristics degrade quickly as the complexity of the layout increases.

Correct content is often flagged as false due to the order of assumptions.

3. Matching equations via LaTeX normalization

Heavy mathematics standards usually expect equations to be displayed as follows Complete latex.

Restrictions

Unicode or partially submitted equivalents are penalized.
Equivalent LaTeX expressions that use different macros fail to match.
Mixed LaTeX/Markdown/HTML representations are not processed.
Equations that display correctly still fail string-level checks.

This confuses Choice of representation with Sports health.

4. Assumptions about formatting

Standards implicitly encode the preferred output style.

Restrictions

HTML tags (for example, ) causes matching failure.
Unicode symbols (for example, km²) are penalized against LaTeX equivalents.
Inconsistency of spacing and punctuation in ground truth amplifies errors.

Models compatible with the standard format outperform more general OCR systems.

Problems observed in existing standards

Benchmark A: olmOCCRBench

Manual inspection reveals that several subgroups are included Implicit content deletion rules:

Headers, footers, and watermarks that are clearly present in documents are clearly marked as such absent In earthly reality.
Models trained for extraction All text is visible They are punished for being right.
These subgroups are actively evaluated Selective suppressionnot OCR quality.

In addition:

Math-heavy subgroups fail when equations are not fully settled using LaTeX.
Correct predictions are penalized due to representation differences.

As a result, scores depend strongly on whether the philosophy of the model’s outputs conforms to the hidden assumptions of the standard.

Example 1

For the image above, Nanonets-OCR2 correctly predicts the watermark on the right side of the image, but in the ground truth annotation penalizes the model for predicting it correctly.

{
"pdf": "headers_footers/ef5e1f5960b9f865c8257f9ce4ff152a13a2559c_page_26.pdf", 
"page": 1, 
"id": "ef5e1f5960b9f865c8257f9ce4ff152a13a2559c_page_26.pdf_manual_01", 
"type": "absent", 
"text": "Document t\u00e9l\u00e9charg\u00e9 depuis www.cairn.info - Universit\u00e9 de Marne-la-Vall\u00e9e - - 193.50.159.70 - 20/03/2014 09h07. \u00a9 S.A.C.", "case_sensitive": false, "max_diffs": 3, "checked": "verified", "first_n": null, "last_n": null, "url": ""}

He writes absent It means that in the forecast data, this text should not be present.

Example 2

The standard also does not take into account text in the footer of the document.

1769277967 116 Evaluating OCR to Markdown Systems Is Fundamentally Broken and Why Thats Hard

Example in this document, Alcoholics Anonymous\u00ae and www.aa.org It should not be present in the document on the basis of fact, and this is not true

{
	"pdf": "headers_footers/3754542bf828b42b268defe21db8526945928834_page_4.pdf", 
	"page": 1, 
	"id": "3754542bf828b42b268defe21db8526945928834_page_4_header_00", 
	"type": "absent", 
	"max_diffs": 0, 
	"checked": "verified", 
	"url": "", 
	"text": "Alcoholics Anonymous\u00ae", 
	"case_sensitive": false, "first_n": null, "last_n": null
	}
{
	"pdf": "headers_footers/3754542bf828b42b268defe21db8526945928834_page_4.pdf", 
	"page": 1, 
	"id": "3754542bf828b42b268defe21db8526945928834_page_4_header_01", 
	"type": "absent", 
	"max_diffs": 0, 
	"checked": "verified", 
	"url": "", 
	"text": "www.aa.org", 
	"case_sensitive": false, "first_n": null, "last_n": null}

Benchmark B: OmniDocBench

OmniDocBench presents similar issues, but on a larger scale:

The equation evaluation is based on strict LaTeX string equation.
Syntactically identical equations fail due to macro, spacing, or symbol differences.
Several ground truth annotation errors (missing tokens, distorted mathematics, incorrect distances) were observed.
Unicode normalization and spacing differences systematically reduce results.
Prediction selection heuristics can fail even when the correct answer is entirely present.

In many cases, low scores reflect Standard antiquesNot model errors.

Example 1

1769277968 609 Evaluating OCR to Markdown Systems Is Fundamentally Broken and Why Thats Hard

In the example above, the Nanonets-OCR2-3B networks predict 5 g silica + 3 g Al$_2$O$_3$ But the reality on the ground predicts as $ 5g \\mathrm{\\ s i l i c a}+3g \\mathrm{\\ A l}*{2} \\mathrm{O*{3}} $ . This indicates that the model’s prediction is incorrect, even when both are correct.

Complete the ground truth and prediction, and share the test case below:

'pred': 'The collected eluant was concentrated by rotary evaporator to 1 ml. The extracts were finally passed through a final column filled with 5 g silica + 3 g Al$_2$O$_3$ to remove any co-extractive compounds that may cause instrumental interferences durin the analysis. The extract was eluted with 120 ml of DCM:n-hexane (1:1), the first 18 ml of eluent was discarded and the rest were collected, which contains the analytes of interest. The extract was exchanged into n-hexane, concentrated to 1 ml to which 1 μg/ml of internal standard was added.'
'gt': 'The collected eluant was concentrated by rotary evaporator to 1 ml .The extracts were finally passed through a final column filled with $ 5g \\mathrm{\\ s i l i c a}+3g \\mathrm{\\ A l}*{2} \\mathrm{O*{3}} $ to remove any co-extractive compounds that may cause instrumental
interferences during the analysis. The extract was eluted with 120 ml of DCM:n-hexane (1:1), the first 18 ml of eluent was discarded and the rest were collected, which contains the analytes of interest. The extract was exchanged into n - hexane, concentrated to 1 ml to which $ \\mu\\mathrm{g / ml} $ of internal standard was added.'

Example 2

We found significantly incorrect annotations with OmniDocBench

1769277968 280 Evaluating OCR to Markdown Systems Is Fundamentally Broken and Why Thats Hard

In explaining the ground truth 1 Missing in 1 ml .

'text': 'The collected eluant was concentrated by rotary evaporator to 1 ml .The extracts were finally passed through a final column filled with $ 5g \\mathrm{\\ s i l i c a}+3g \\mathrm{\\ A l}*{2} \\mathrm{O*{3}} $ to remove any co-extractive compounds that may cause instrumental interferences during the analysis. The extract was eluted with 120 ml of DCM:n-hexane (1:1), the first 18 ml of eluent was discarded and the rest were collected, which contains the analytes of interest. The extract was exchanged into n - hexane, concentrated to 1 ml to which $ \\mu\\mathrm{g / ml} $ of internal standard was added.'