The frontier table
extraction benchmark

An open, multilingual benchmark for evaluating table extraction from document images. 1,820 real-world tables spanning 9 languages, scored with T-LAG, a graph-based metric that captures both structural fidelity and cell-content accuracy in a single number.

1,820tablesΒ·9languagesΒ·380source documentsΒ·48%with spanning cellsΒ·7providers

Leaderboard

Overall T-LAG F1 scores across all 1,820 samples. Providers are scored only on samples they successfully processed.

1Pulse Ultra 2
93.5%
2Gemini 3.1
81.5%
3Azure Document Intelligence
76.1%
4Snowflake Document AI
70.8%
5Databricks ai_parse_document
63.0%
6AWS Textract
60.3%
7Unstructured
36.0%
ProviderT-LAG Score

Head-to-Head

Select any two providers to compare T-LAG scores. Overall performance and per-language breakdown side by side.

vs+11.9% difference
93.5%
Pulse Ultra 2
81.5%
Gemini 3.1
Perfect extractions1053 vs 518
Coverage100% vs 99.5%

Example extraction

Source table
Pulse Ultra 2T-LAG 92.3%
Loading...
Gemini 3.1T-LAG 70.1%
Loading...

By language

Dataset

PulseBench-Tab draws from 380 real-world documents including financial filings, government reports, medical records, and academic papers. Tables range from simple 2-cell headers to dense 1,183-cell spreadsheets. Ground truth was human-labeled by subject matter experts.

πŸ‡ΊπŸ‡Έ
English
594 samples
πŸ‡¨πŸ‡³
Chinese
213 samples
πŸ‡ͺπŸ‡Έ
Spanish
176 samples
πŸ‡·πŸ‡Ί
Russian
170 samples
πŸ‡«πŸ‡·
French
165 samples
πŸ‡―πŸ‡΅
Japanese
159 samples
πŸ‡ΈπŸ‡¦
Arabic
146 samples
πŸ‡©πŸ‡ͺ
German
113 samples
πŸ‡°πŸ‡·
Korean
84 samples
11.3avg rows
5.0avg columns
54.1avg cells
1,183max cells
πŸ‡ΊπŸ‡Έ English (32.6%)πŸ‡¨πŸ‡³ Chinese (11.7%)πŸ‡ͺπŸ‡Έ Spanish (9.7%)πŸ‡·πŸ‡Ί Russian (9.3%)πŸ‡«πŸ‡· French (9.1%)πŸ‡―πŸ‡΅ Japanese (8.7%)πŸ‡ΈπŸ‡¦ Arabic (8.0%)πŸ‡©πŸ‡ͺ German (6.2%)πŸ‡°πŸ‡· Korean (4.6%)

Performance by Language

Table extraction quality varies dramatically across scripts. Arabic and Korean are the hardest. Most providers drop 15-30 points on non-Latin languages.

LanguagePulse Ultra 2Gemini 3.1Azure DISnowflake Document AIDatabricks ai_parse_document
πŸ‡ΊπŸ‡ΈEnglish5949178757267
πŸ‡¨πŸ‡³Chinese2139687716859
πŸ‡ͺπŸ‡ΈSpanish1769485857870
πŸ‡·πŸ‡ΊRussian1709487797565
πŸ‡«πŸ‡·French1659790847972
πŸ‡―πŸ‡΅Japanese1599683807160
πŸ‡ΈπŸ‡¦Arabic1469266615441
πŸ‡©πŸ‡ͺGerman1139584777468
πŸ‡°πŸ‡·Korean849484746956
90+80+70+50+<50

How T-LAG Works

T-LAG models each table as a directed graph of cell adjacencies, then finds the optimal matching between ground truth and prediction graphs. Unlike TEDS which operates on DOM trees, T-LAG evaluates the 2D logical structure directly.

What is T-LAG?

T-LAG (Table Logical Adjacency Graph) represents each table as a directed graph where nodes are cells and edges connect horizontally or vertically adjacent cells. The score measures how well the predicted graph matches the ground truth graph, capturing both structure and content in a single F1 metric.

Why not TEDS?

TEDS (Tree Edit Distance Similarity) is the most common table evaluation metric, but it has well-documented weaknesses. It operates on the DOM tree rather than the logical 2D grid, so it conflates formatting changes (like wrapping cells in <thead>) with actual structural errors. It also scales poorly for large tables.

T-LAG vs TEDS
T-LAGEvaluates 2D logical grid structure directly
TEDSEvaluates DOM tree edit distance
T-LAGIgnores formatting-only differences
TEDSPenalizes formatting changes as errors
T-LAGOptimal matching (Hungarian algorithm)
TEDSGreedy tree edit operations

Pipeline

1

Build adjacency graphs

Parse each HTML table into a grid, then extract directed edges. RIGHT for horizontal neighbors, BELOWfor vertical. Spanning cells are deduplicated so merged regions don't dominate.

2

Weight edges with the Psi kernel

For each candidate pair of ground-truth and predicted edges, compute a similarity weight. Cell text similarity uses normalized Levenshtein distance raised to the 7th power, sharply penalizing even small character-level errors.

3

Optimal matching

Run the Hungarian algorithm on the weight matrix for optimal 1-to-1 edge assignment. Direction-constrained: RIGHT only matches RIGHT, BELOW only matches BELOW.

4

Score

Compute weighted precision, recall, and F1 from the matched edges. The F1 is the final T-LAG score. No additional structural penalty needed. Errors are captured through unmatched edges.