Mistral OCR

vikp · 2025-03-06T23:01:56 1741302116

I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .

Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.

You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... .

The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon.

Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

lolinder · 2025-03-07T02:17:16 1741313836

> with LLM as a judge

For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.

Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.

I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?

[0] https://github.com/VikParuchuri/marker/blob/master/benchmark...

themanmaran · 2025-03-07T05:52:57 1741326777

We also ran an OCR benchmark with LLM as judge using structured outputs. You can check out the full methodology on the repo [1]. But the general idea is:

- Every document has ground truth text, a JSON schema, and the ground truth JSON.

- Run OCR on each document and pass the result to GPT-4o along with the JSON Schema

- Compare the predicted JSON against the ground truth JSON for accuracy.

In our benchmark, the ground truth text => gpt-4o was 99.7%+ accuracy. Meaning whenever gpt-4o was given the correct text, it could extract the structured JSON values ~100% of the time. So if we pass in the OCR text from Mistral and it scores 70%, that means the inaccuracies are isolated to OCR errors.

https://github.com/getomni-ai/benchmark

cdolan · 2025-03-07T05:54:52 1741326892

were you guys able to finish running the benchmark with mistral and got a 70% score? Missed that

Edit - I see it on the Benchmark page now. Woof, low 70% scores in some areas!

https://getomni.ai/ocr-benchmark

vikp · 2025-03-07T02:41:23 1741315283

Benchmarking is hard for markdown because of the slight formatting variations between different providers. With HTML, you can use something like TEDS (although there are issues with this, too), but with markdown, you don't have a great notion of structure, so you're left with edit distance.

I think blockwise edit distance is better than full page (find the ground truth blocks, then infer each block separately and compare), but many providers only do well on full pages, which doesn't make it fair.

There are a few different benchmark types in the marker repo:

  - Heuristic (edit distance by block with an ordering score)
  - LLM judging against a rubric
  - LLM win rate (compare two samples from different providers)

None of these are perfect, but LLM against a rubric has matched visual inspection the best so far.

I'll continue to iterate on the benchmarks. It may be possible to do a TEDS-like metric for markdown. Training a model on the output and then benchmarking could also be interesting, but it gets away from measuring pure extraction quality (the model benchmarking better is only somewhat correlated with better parse quality). I haven't seen any great benchmarking of markdown quality, even at research labs - it's an open problem.

arthurcolle · 2025-03-07T03:18:53 1741317533

You can use structured outputs, or something like my https://arthurcolle--dynamic-schema.modal.run/

to extract real data from unstructured text (like that producted from an LLM) to make benchmarks slightly easier if you have a schema

carlgreene · 2025-03-07T00:22:38 1741306958

Thank you for your work on Marker. It is the best OCR for PDFs I’ve found. The markdown conversion can get wonky with tables, but it still does better than anything else I’ve tried

vikp · 2025-03-07T02:42:46 1741315366

Thanks for sharing! I'm training some models now that will hopefully improve this and more :)

DeathArrow · 2025-03-07T05:33:05 1741325585

>Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

To fight hallucinations, can't we use more LLMs and pick blocks where the majority of LLMs agree?

boxed · 2025-03-07T05:52:59 1741326779

Why wouldn't hallucinations be agreed upon if they have roughly the same training data?

TJSomething · 2025-03-07T06:52:57 1741330377

A hallucination is often an indication that the model doesn't know something. Then, the internal signal gets dominated by noise from the seeded training weights. Efforts to eliminate hallucinations with a single model have found success by asking the same question in different ways and only taking answers that agree. Logically, you could get more durable results from multiple models on the same prompt.

codelion · 2025-03-07T03:53:52 1741319632

Really interesting benchmark, thanks for sharing! It's good to see some real-world comparisons. The hallucinations issue is definitely a key concern with LLM-based OCR, and it's important to quantify that risk. Looking forward to seeing the full benchmark results.

bambax · 2025-03-06T21:42:46 1741297366

It's not bad! But it still hallucinates. Here's an example of an (admittedly difficult) image:

https://i.imgur.com/jcwW5AG.jpeg

For the blocks in the center, it outputs:

> Claude, duc de Saint-Simon, pair et chevalier des ordres, gouverneur de Blaye, Senlis, etc., né le 16 août 1607 , 3 mai 1693 ; ép. 1○, le 26 septembre 1644, Diane - Henriette de Budos de Portes, morte le 2 décembre 1670; 2○, le 17 octobre 1672, Charlotte de l'Aubespine, morte le 6 octobre 1725.

This is perfect! But then the next one:

> Louis, commandeur de Malte, Louis de Fay Laurent bre 1644, Diane - Henriette de Budos de Portes, de Cressonsac. du Chastelet, mortilhomme aux gardes, 2 juin 1679.

This is really bad because

1/ a portion of the text of the previous bloc is repeated

2/ a portion of the next bloc is imported here where it shouldn't be ("Cressonsac"), and of the right most bloc ("Chastelet")

3/ but worst of all, a whole word is invented, "mortilhomme" that appears nowhere in the original. (The word doesn't exist in French so in that case it would be easier to spot; but the risk is when words are invented, that do exist and "feel right" in the context.)

(Correct text for the second bloc should be:

> Louis, commandeur de Malte, capitaine aux gardes, 2 juin 1679.)

layer8 · 2025-03-06T22:34:37 1741300477

> This is perfect!

Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”. These are https://fr.wikipedia.org/wiki/Adverbe_ordinal#Premiers_adver....

There’s also extra spaces after the “1607” and around the hyphen in “Diane-Henriette”.

Lastly, U+2019 instead of U+0027 would be more appropriate for the apostrophe, all the more since in the image it looks like the former and not like the latter.

MatthiasPortzel · 2025-03-07T01:55:22 1741312522

Slightly unrelated, but I once used Apple’s built-in OCR feature LiveText to copy a short string out of an image. It appeared to work, but I later realized it had copied “M” as U+041C (Cyrillic Capital Letter Em), causing a regex to fail to match. OCR giving identical characters is only good enough until it’s not.

jorvi · 2025-03-06T23:22:22 1741303342

> Just a nit, but I wouldn’t call it perfect when using U+25CB ○ WHITE CIRCLE instead of what should be U+00BA º MASCULINE ORDINAL INDICATOR, or alternatively a superscript “o”

Or degree symbol. Although it should be able to figure out which to use according to the context.

TeMPOraL · 2025-03-06T22:45:02 1741301102

This is "reasoning model" stuff even for humans :).

layer8 · 2025-03-06T22:50:53 1741301453

There is OCR software that analyses which language is used, and then applies heuristics for the recognized language to steer the character recognition in terms of character sequence likelihoods and punctuation rules.

I don’t think you need a reasoning model for that, just better training; although conversely a reasoning model should hopefully notice the errors — though LLM tokenization might still throw a wrench into that.

bambax · 2025-03-06T21:55:51 1741298151

Another test with a text in English, which is maybe more fair (although Mistral is a French company ;-). This image is from Parliamentary debates of the parliament of New Zealand in 1854-55:

https://i.imgur.com/1uVAWx9.png

Here's the output of the first paragraph, with mistakes in brackets:

> drafts would be laid on the table, and a long discussion would ensue; whereas a Committee would be able to frame a document which, with perhaps a few verbal emundations [emendations], would be adopted; the time of the House would thus be saved, and its business expected [expedited]. With regard to the question of the comparative advantages of The-day [Tuesday]* and Friday, he should vote for the amendment, on the principle that the wishes of members from a distance should be considered on all sensations [occasions] where a principle would not be compromised or the convenience of the House interfered with. He hoped the honourable member for the Town of Christchurch would adopt the suggestion he (Mr. Forssith [Forsaith]) had thrown out and said [add] to his motion the names of a Committee.*

Some mistakes are minor (emnundations/emendations or Forssith/Forsaith), but others are very bad, because they are unpredictable and don't correspond to any pattern, and therefore can be very hard to spot: sensations instead of occasions, or expected in lieu of expedited... That last one really changes the meaning of the sentence.

spudlyo · 2025-03-06T22:02:50 1741298570

I want to rejoice that OCR is now a "solved" problem, but I feel like hallucinations are just as problematic as the kind of stuff I have to put up with tesseract -- both require careful manual proofreading for an acceptable degree of confidence. I guess I'll have to try it and see for myself just how much better these solutions are for my public domain archive.org Latin language reader & textbook projects.

qingcharles · 2025-03-07T00:49:01 1741308541

It depends on your use-case. For mine, I'm mining millions of scanned PDF pages to get approximate short summaries of long documents. The occasional hallucination won't damage the project. I realize I'm an outlier, and I would obviously prefer a solution that was as accurate as possible.

eMPee584 · 2025-03-07T00:44:24 1741308264

possibly doing both & diffing the output to spot contested bits?

spudlyo · 2025-03-07T01:17:22 1741310242

that’s my current idea, use two different ocr models and diff the results to spot check for errors. at these prices why not?

thomasfromcdnjs · 2025-03-07T01:48:53 1741312133

Does anyone know the correlation between our abilities to parse PDF's and the quality of our LLM's training datasets?

If a lot of scientific papers have been pdf's and hitherto had bad conversions to text/tokens, can we expect to see major gains in our training and therefore better outputs?

samstave · 2025-03-07T01:02:20 1741309340

[flagged]

Biganon · 2025-03-07T02:27:14 1741314434

...are you okay?

Kokichi · 2025-03-07T00:57:15 1741309035

All it ever does is hallucinate

raunakchowdhuri · 2025-03-07T03:54:06 1741319646

We ran some benchmarks comparing against Gemini Flash 2.0. You can find the full writeup here: https://reducto.ai/blog/lvm-ocr-accuracy-mistral-gemini

A high level summary is that while this is an impressive model, it underperforms even current SOTA VLMs on document parsing and has a tendency to hallucinate with OCR, table structure, and drop content.

hackernewds · 2025-03-07T06:12:23 1741327943

meanwhile, you're comparing it to the output of almost a trillion dollar company

HaZeust · 2025-03-07T06:27:40 1741328860

... And? We're judging it for the merits of the technology it purports to be, not the pockets of the people that bankroll them. Probably not fair - sure, but when I pick my OCR, I want to pick SOTA. These comparisons and announcements help me find those.

owenpalmer · 2025-03-06T18:48:32 1741286912

This is incredibly exciting. I've been pondering/experimenting on a hobby project that makes reading papers and textbooks easier and more effective. Unfortunately the OCR and figure extraction technology just wasn't there yet. This is a game changer.

Specifically, this allows you to associate figure references with the actual figure, which would allow me to build a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.

It also allows a clean conversion to HTML, so you can add cool functionality like clicking on unfamiliar words for definitions, or inserting LLM generated checkpoint questions to verify understanding. I would like to see if I can automatically integrate Andy Matuschak's Orbit[0] SRS into any PDF.

Lots of potential here.

[0] https://docs.withorbit.com/

NalNezumi · 2025-03-06T21:36:12 1741296972

>a UI that solves the annoying problem of looking for a referenced figure on another page, which breaks up the flow of reading.

A tangent but this exact issue is what I was frustrated for a long time with pdf reader and reading science papers. Then I found sioyek that pops up a small window when you hover over links (references and equations and figures) and it solved it.

Granted, the pdf file must be in right format, so OCR could make this experience better. Just saying the UI component of that already exist

https://sioyek.info/

PerryStyle · 2025-03-06T22:36:34 1741300594

Zotero's PDF viewer also does this now. Being able to annotate PDFs and having a reference manager has been a life saver.

owenpalmer · 2025-03-07T00:43:02 1741308182

Thanks for the link! Good to know someone is working on something similar.

generalizations · 2025-03-06T19:24:56 1741289096

Wait does this deal with images?

ezfe · 2025-03-06T19:42:00 1741290120

The output includes images from the input. You can see that on one of the examples where a logo is cropped out of the source and included in the result.

Asraelite · 2025-03-06T19:00:41 1741287641

I never thought I'd see the day where technology finally advanced far enough that we can edit a PDF.

randomNumber7 · 2025-03-06T19:16:07 1741288567

I never thought driving a car is harder than editing a pdf.

pzo · 2025-03-06T19:40:10 1741290010

It's not about harder but about what error you can tolerate. Here if you have accuracy 99% for many applications it's enough. If you have 99% accuracy per trip of no crash during self driving then you gonna be dead within a year very likely.

For cars we need accuracy at least 99.99% and that's very hard.

rtsil · 2025-03-06T20:29:27 1741292967

I doubt most people have 99% accuracy. The threshold of tolerance for error is just much lower for any self-driving system (and with good reason, because we're not familiar with them yet).

KeplerBoy · 2025-03-06T21:11:11 1741295471

How do you define 99% accuracy?

I guess something like success rate for a trip (or mile) would be a more reasonable metric. Most people have a success rate far higher than 99% for averages trips.

Most people who commute daily are probably doing something like a 1000 car rides a year and have minor accidents every few years. 99% success rates would mean monthly accidents.

toephu2 · 2025-03-06T21:03:34 1741295014

I've been able to edit PDFs (95%+ of them) accurately for the past 10 years...

Apofis · 2025-03-06T19:44:55 1741290295

Foxit PDF exists...

Beijinger · 2025-03-07T00:15:42 1741306542

Master PDF Editor?

kbyatnal · 2025-03-06T19:32:33 1741289553

We're approaching the point where OCR becomes "solved" — very exciting! Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.

However IMO, there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. But the future is on the horizon!

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)

dml2135 · 2025-03-06T20:14:54 1741292094

One problem I’ve encountered at my small startup in evaluating OCR technologies is precisely convincing stakeholders that the “human-in-the-loop” part is both unavoidable, and ultimately beneficial.

PMs want to hear that an OCR solution will be fully automated out-of-the-box. My gut says that anything offering that is snake-oil, and I try to convey that the OCR solution they want is possible, but if you are unwilling to pay the tuning cost, it’s going to flop out of the gate. At that point they lose interest and move on to other priorities.

kbyatnal · 2025-03-06T20:37:34 1741293454

Yup definitely, and this is exactly why I built my startup. I've heard this a bunch across startups & large enterprises that we work with. 100% automation is an impossible target, because even humans are not 100% perfect. So how we can expect LLMs to be?

But that doesn't mean you have to abandon the effort. You can still definitely achieve production-grade accuracy! It just requires having the right tooling in place, which reduces the upfront tuning cost. We typically see folks get there on the order of days or 1-2 weeks (it doesn't necessarily need to take months).

golergka · 2025-03-07T00:38:00 1741307880

It really depends on their fault tolerance. I think there's a ton of useful applications where OCR would be 99.9%, 99%, and even 98% reliable. Skillful product manager can keep these limitations in mind and work around them.

techwizrd · 2025-03-06T20:47:09 1741294029

The challenge I have is how to get bounding boxes for the OCR, for things like redaction/de-identification.

dontlikeyoueith · 2025-03-06T22:24:03 1741299843

AWS Textract works pretty well for this and is much cheaper than running LLMs.

daemonologist · 2025-03-06T22:48:09 1741301289

Textract is more expensive than this (for your first 1M pages per month at least) and significantly more than something like Gemini Flash. I agree it works pretty well though - definitely better than any of the open source pure OCR solutions I've tried.

kbyatnal · 2025-03-06T21:18:22 1741295902

yeah that's a fun challenge — what we've seen work well is a system that forces the LLM to generate citations for all extracted data, map that back to the original OCR content, and then generate bounding boxes that way. Tons of edge cases for sure that we've built a suite of heuristics for over time, but overall works really well.

dontlikeyoueith · 2025-03-06T22:24:38 1741299878

Why would you do this and not use Textract?

schcrosby · 2025-03-07T00:04:05 1741305845

I too have this question.

risyachka · 2025-03-06T20:08:44 1741291724

>> Any legacy vendors providing pure OCR are going to get steamrolled by these VLMs.

-OR- they can just use these APIs, and considering that they have a client base - which would prefer to not rewrite integrations to get the same result - they can get rid of most code base, replace it with llm api and increase margins by 90% and enjoy good life.

esafak · 2025-03-06T21:55:23 1741298123

They're going to become commoditized unless they add value elsewhere. Good news for customers.

TeMPOraL · 2025-03-06T22:53:16 1741301596

They are (or at least could easily be) adding value in form of SLA - charging money for giving guarantees on accuracy. This is both better for customer, who gets concrete guarantees and someone to shift liability to, and for the vendor, that can focus on creating techniques and systems for getting that extra % of reliability out of the LLM OCR process.

All of the above are things companies - particularly larger ones - are happy to pay for, because ORC is just a cog in the machine, and this makes it more reliable and predictable.

On top of the above, there are auxiliary value-adds such a vendor could provide - such as, being fully compliant with every EU directive and regulation that's in power, or about to be. There's plenty of those, they overlap, and no one wants to deal with it if they can outsource it to someone who already figured it out.

(And, again, will take the blame for fuckups. Being a liability sink is always a huge value-add, in any industry.)

nextworddev · 2025-03-06T23:49:53 1741304993

Your customer includes Checkr? Impressive. Are they referencable?

einpoklum · 2025-03-06T22:30:17 1741300217

An LLM with billions of parameters for extracting text from a PDF (which isn't even a rasterized image) really does not "solve OCR".

mvac · 2025-03-06T19:34:08 1741289648

Great progress, but unfortunately, for our use case (converting medical textbooks from PDF to MD), the results are not as good as those by MinerU/PDF-Extract-Kit [1].

Also the collab link in the article is broken, found a functional one [2] in the docs.

[1] https://github.com/opendatalab/MinerU [2] https://colab.research.google.com/github/mistralai/cookbook/...

owenpalmer · 2025-03-06T20:47:58 1741294078

I've been searching relentlessly for something like this! I wonder why it's been so hard to find... is it the Chinese?

In any case, thanks for sharing.

thelittleone · 2025-03-06T23:14:31 1741302871

Have you had a chance to compare results from MinerU vs LLM such a Gemini 2.0 or anthropic's native PDF tool?

shekhargulati · 2025-03-07T02:02:40 1741312960

Mistral OCR made multiple mistakes in extracting this [1] document. It is a two-page-long PDF in Arabic from the Saudi Central Bank. The following errors were observed:

- Referenced Vision 2030 as Vision 2.0. - Failed to extract the table; instead, it hallucinated and extracted the text in a different format. - Failed to extract the number and date of the circular.

I tested the same document with ChatGPT, Claude, Grok, and Gemini. Only Claude 3.7 extracted the complete document, while all others failed badly. You can read my analysis here [2].

1. https://rulebook.sama.gov.sa/sites/default/files/en_net_file... 2. https://shekhargulati.com/2025/03/05/claude-3-7-sonnet-is-go...

vessenes · 2025-03-06T18:02:26 1741284146

Dang. Super fast and significantly more accurate than google, Claude and others.

Pricing : $1/1000 pages, or per 2k pages if “batched”. I’m not sure what batching means in this case: multiple pdfs? Why not split them to halve the cost?

Anyway this looks great at pdf to markdown.

sophiebits · 2025-03-06T18:06:04 1741284364

Batched often means a higher latency option (minutes/hours instead of seconds), which providers can schedule more efficiently on their GPUs.

abiraja · 2025-03-06T18:05:30 1741284330

Batching likely means the response is not real-time. You set up a batch job and they send you the results later.

ozim · 2025-03-06T19:00:37 1741287637

If only business people I work with would understand 100GB even transfer over the network is not going to return immediately results ;)

vessenes · 2025-03-06T18:06:31 1741284391

That makes sense. Idle time is nearly free after all.

kapitalx · 2025-03-06T18:58:48 1741287528

From my testing so far, it seems it's super fast and responded synchronously. But it decided that the entire page is an image and returned `![img-0.jpeg](img-0.jpeg)` with coordinates in the metadata for the image, which is the entire page.

Our tool, doctly.ai is much slower and async, but much more accurate and gets you the content itself as an markdown.

ralusek · 2025-03-06T19:12:45 1741288365

I thought we stopped -ly company names ~8 years ago?

kapitalx · 2025-03-06T19:27:19 1741289239

Haha for sure. Naming isn't just the hardest problem in computer science, it's always hard. But at some point you just have to pick something and move forward.

DonHopkins · 2025-03-07T05:07:31 1741324051

But doctr.ai was taken.

yieldcrv · 2025-03-06T19:21:35 1741288895

if you talk to people gen-x and older, you still need .com domains

for all those people that aren't just clicking on a link on their social media feed, chat group, or targeted ad

odiroot · 2025-03-06T18:52:39 1741287159

May I ask as a layperson, how would you about using this to OCR multiple hundreds of pages? I tried the chat but it pretty much stops after the 2nd page.

beklein · 2025-03-06T19:58:15 1741291095

You can check the example code on the Mistral documentation, you would _only_ have to change the value of the variable `document_url` to the URL of your uploaded PDF... and you need to change the `MISTRAL_API_KEY` to the value of your specific key that you can get from the Le Platforme webpage.

https://docs.mistral.ai/capabilities/document/#ocr-with-pdf

odiroot · 2025-03-06T20:17:39 1741292259

Thanks!

sneak · 2025-03-06T18:56:15 1741287375

Submit the pages via the API.

odiroot · 2025-03-06T21:04:47 1741295087

This worked indeed. Although I had to cut my document into smaller chunks. 900 pages at once ended with a timeout.

Tostino · 2025-03-06T18:08:45 1741284525

Usually (With OpenAI, I haven't checked Mistral yet) it means an async api rather than a sync api.

e.g. you submit multiple requests (pdfs) in one call, and get back an id for the batch. You then can check on the status of that batch and get the results for everything when done.

It lets them use their available hardware to it's full capacity much better.

jacksnipe · 2025-03-06T18:05:25 1741284325

I would assume this is 1 request containing 2k pages vs N requests whose total pages add up to 1000.

serjester · 2025-03-06T19:07:13 1741288033

This is cool! With that said for anyone looking to use this in RAG, the downside to specialized models instead of general VLMs is you can't easily tune it to your use specific case. So for example, we use Gemini to add very specific alt text to images in the extracted Markdown. It's also 2 - 3X the cost of Gemini Flash - hopefully the increased performance is significant.

Regardless excited to see more and more competition in the space.

Wrote an article on it: https://www.sergey.fyi/articles/gemini-flash-2-tips

hyuuu · 2025-03-06T20:15:31 1741292131

gemini flash is notorious for hallucinating the output of the OCR, be careful with it. For straight forward, semi-structured, low page count (under 5) it should perform well, but the more the context window is stretched the more the output gets more unreliable

porphyra · 2025-03-07T00:45:57 1741308357

I uploaded a picture of my Chinese mouthwash [0] and it made a ton of mistakes and hallucinated a lot. Very disappointing. For example it says the usage instructions is to use 80 ml each time, even though the actual usage instruction on the bottle says use 5-20 mL each time, three times a day, and gargle for 1 minute.

[0] https://i.imgur.com/JiX9joY.jpeg

[1] https://chat.mistral.ai/chat/8df2c9b9-ee72-414b-81c3-843ce74...

opwieurposiu · 2025-03-06T18:21:58 1741285318

Related, does anyone know of an app that can read gauges from an image and log the number to influx? I have a solar power meter in my crawlspace, it is inconvenient to go down there. I want to point an old phone at it and log it so I can check it easily. The gauge is digital and looks like this:

https://www.pvh2o.com/solarShed/firstPower.jpg

ubergeek42 · 2025-03-06T18:43:26 1741286606

This[1] is something I've come across but not had a chance to play with, designed for reading non-smart meters that might work for you. I'm not sure if there's any way to run it on an old phone though.

[1] https://github.com/jomjol/AI-on-the-edge-device

jasonjayr · 2025-03-06T21:51:30 1741297890

Wow. I was looking at hooking my water meter into home assistant, and was going to investigate just counting an optical pulse (it has a white portion on the gear that is in a certain spot every .1 gal) This is like the same meter I use, and perfect.

(It turns out my electric meter, though analog, blasts out it's reading on RF every 10 seconds unencrypted. I got that via my RTL-SDR reciever :) )

timc3 · 2025-03-06T21:24:22 1741296262

I use this for a watermeter. Works quite well as long as you have a good SD card

dehrmann · 2025-03-06T18:34:02 1741286042

You'll be happier finding a replacement meter that has an interface to monitor it directly or a second meter. An old phone and OCR will be very brittle.

haswell · 2025-03-06T18:44:37 1741286677

Not OP, but it sounds like the kind of project I’d undertake.

Happiness for me is about exploring the problem within constraints and the satisfaction of building the solution. Brittleness is often of less concern than the fun factor.

And some kinds of brittleness can be managed/solved, which adds to the fun.

arcfour · 2025-03-06T18:53:49 1741287229

I would posit that learning how the device works, and how to integrate with a newer digital monitoring device would be just as interesting and less brittle.

haswell · 2025-03-06T19:08:10 1741288090

Possibly! But I’ve recently wanted to dabble with computer vision, so I’d be looking at a project like this as a way to scratch a specific itch. Again, not OP so I don’t know what their priorities are, but just offering one angle for why one might choose a less “optimal” approach.

ramses0 · 2025-03-06T18:50:52 1741287052

https://www.home-assistant.io/integrations/seven_segments/

https://www.unix-ag.uni-kl.de/~auerswal/ssocr/

https://github.com/tesseract-ocr/tesseract

https://community.home-assistant.io/t/ocr-on-camera-image-fo...

https://www.google.com/search?q=home+assistant+ocr+integrati...

https://www.google.com/search?q=esphome+ocr+sensor

https://hackaday.com/2021/02/07/an-esp-will-read-your-meter-...

...start digging around and you'll likely find something. HA has integrations which can support writing to InfluxDB (local for sure, and you can probably configure it for a remote influxdb).

You're looking at 1xRaspberry PI, 1xUSB Webcam, 1x"Power Management / humidity management / waterproof electrical box" to stuff it into, and then either YOLO and DIY to shoot over to your influxdb, or set up a Home Assistant and "attach" your frankenbox as some sort of "sensor" or "integration" which spits out metrics and yadayada...

BonoboIO · 2025-03-06T19:19:59 1741288799

Gemini Free Tier would surely work

renewiltord · 2025-03-06T18:40:04 1741286404

4o transcribes it perfectly. You can usually root an old Android and write this app in ~2h with LLMs if unfamiliar. The hard part will be maintaining camera lens cleanliness and alignment etc.

The time cost is so low that you should give it a gander. You'll be surprised how fast you can do it. If you just take screenshots every minute it should suffice.

pavl · 2025-03-06T22:30:23 1741300223

What software-tools do you usw to Programm the APP?

renewiltord · 2025-03-06T23:23:30 1741303410

Since it's at home, you'll have WiFi access, so it's pretty much a rudimentary Kotlin app on Android. You can just grab a photo and ship it to the GPT-4o API, get the response, and then POST it somewhere.

evmar · 2025-03-06T19:23:04 1741288984

I noticed on the Arabic example they lost a space after the first letter on the third to last line, can any native speakers confirm? (I only know enough Arabic to ask dumb questions like this, curious to learn more.)

Edit: it looks like they also added a vowel mark not present in the input on the line immediately after.

Edit2: here's a picture of what I'm talking about, the before/after: https://ibb.co/v6xcPMHv

resiros · 2025-03-06T19:38:17 1741289897

Arabic speaker here. No, it's perfect.

evmar · 2025-03-06T19:54:27 1741290867

I am pretty sure it added a kasrah not present in the input on the 2nd to last line. (Not saying it's not super impressive, and also that almost certainly is the right word, but I think that still means not quite "perfect"?)

gl-prod · 2025-03-06T19:59:11 1741291151

Yes, it looks like it did add a kasrah to the word ظهري

yoda97 · 2025-03-06T22:09:18 1741298958

Yep, and فمِنا too, this is not just OCR, it made some post-processing corrections or "enhancements". That could be good, but it could also be trouble the 1% chance it makes a mistake in critical documents.

gl-prod · 2025-03-06T19:58:00 1741291080

He means the space between the wāw (و) and the word

evmar · 2025-03-06T20:00:53 1741291253

I added a pic to the original comment, sorry for not being clear!

albatrosstrophy · 2025-03-06T22:00:36 1741298436

And here I thought after reading the headline: finally a reliable Arabic OCR. I've never in my life found a good that does the job decently especially for a scanned document. Or is there something out there I don't know about?

sbarre · 2025-03-06T18:08:42 1741284522

6 years ago I was working with a very large enterprise that was struggling to solve this problem, trying to scan millions of arbitrary forms and documents per month to clearly understand key points like account numbers, names and addresses, policy numbers, phone numbers, embedded images or scribbled notes, and also draw relationships between these values on a given form, or even across forms.

I wasn't there to solve that specific problem but it was connected to what we were doing so it was fascinating to hear that team talk through all the things they'd tried, from brute-force training on templates (didn't scale as they had too many kinds of forms) to every vendor solution under the sun (none worked quite as advertised on their data)..

I have to imagine this is a problem shared by so many companies.

hdjrudni · 2025-03-07T06:10:50 1741327850

Still terrible at handwriting.

I signed up for the API, cobbled together from their tutorial (https://docs.mistral.ai/capabilities/document/) -- why can't they give the full script instead of little bits?

Tried uploading a tiff, they rejected it. Tried upload JPG, they rejected it (even though they supposed support images?). Tried resaving as PDF. It took that, but the output was just bad. Then tried ChatGPT on the original .tiff (not using API), and it got it perfectly. Honestly I could barely make out the handwriting with my eyes but now that I see ChatGPT's version I think it's right.

lysace · 2025-03-06T20:17:21 1741292241

Nit: Please change the URL from

https://mistral.ai/fr/news/mistral-ocr

to

https://mistral.ai/news/mistral-ocr

The article is the same, but the site navigation is in English instead of French.

Unless it's a silent statement, of course. =)

lblume · 2025-03-06T21:11:27 1741295487

For me, the second page redirects to the first. (And I don't live in France.)

ChemSpider · 2025-03-06T18:23:05 1741285385

"World's best OCR model" - that is quite a statement. Are there any well-known benchmarks for OCR software?

themanmaran · 2025-03-06T18:31:43 1741285903

We published this benchmark the other week. We'll can update and run with Mistral today!

https://github.com/getomni-ai/benchmark

themanmaran · 2025-03-06T23:14:02 1741302842

Update: Just ran our benchmark on the Mistral model and results are.. surprisingly bad?

Mistral OCR:

- 72.2% accuracy

- $1/1000 pages

- 5.42s / page

Which is pretty far cry from the 95% accuracy they were advertising from their private benchmark. The biggest thing I noticed is how it skips anything it classifies as an image/figure. So charts, infographics, some tables, etc. all get lifted out and returned as [image](image_002). Compared to the other VLMs that are able to interpret those images into a text representation.

https://github.com/getomni-ai/benchmark

https://huggingface.co/datasets/getomni-ai/ocr-benchmark

https://getomni.ai/ocr-benchmark

Thaxll · 2025-03-07T02:14:06 1741313646

Do you benchmark the right thing though? It seems to focus a lot on image / charts etc...

The 95% from their benchmark: "we evaluate them on our internal “text-only” test-set containing various publication papers, and PDFs from the web; below:"

Text only.

kergonath · 2025-03-06T18:37:04 1741286224

Excellent. I am looking forward to it.

cdolan · 2025-03-06T18:52:49 1741287169

Came here to see if you all had run a benchmark on it yet :)

WhitneyLand · 2025-03-06T18:39:32 1741286372

It’s interesting that none of the existing models can decode a Scrabble board screen shot and give an accurate grid of characters.

I realize it’s not a common business case, came across it testing how well LLMs can solve simple games. On a side note, if you bypass OCR and give models a text layout of a board standard LLMs cannot solve Scrabble boards but the thinking models usually can.

xnx · 2025-03-06T18:25:53 1741285553

https://huggingface.co/spaces/echo840/ocrbench-leaderboard

ChemSpider · 2025-03-06T18:34:13 1741286053

Interesting. But no mistral on it yet?

resource_waste · 2025-03-06T21:33:37 1741296817

Its Mistral, they are the only homegrown AI Europe has, so people pretend they are meaningful.

I'll give it a try, but I'm not holding my breath. I'm a huge AI Enthusiast and I've yet to be impressed with anything they've put out.

neom · 2025-03-06T21:47:25 1741297645

I gave it a bunch of my wifes 18th century English scans to transcribe, mostly couldn't do it, and it's been doing this for 15 minutes now, not sure why but i find quite amusing: https://share.zight.com/L1u2jZYl

michaelbuckbee · 2025-03-07T02:25:46 1741314346

I'd mentioned this on HN last month, but I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I'd messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.

ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").

Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.

t_sea · 2025-03-07T06:24:48 1741328688

They really went for it with the hieroglyphs opening.

SilentM68 · 2025-03-06T18:35:07 1741286107

I would like to see how it performs with massively warped and skewed scanned text images, basically a scanned image where the text lines are wavy as opposed as straight horizontal, where the letters are elongated. One where the line widths are different depending on the position on the scanned image. I once had to deal with such a task that somebody gave me with OCR software, Acrobat, and other tools could not decode the mess so I had to recreate the 30 pages myself, manually. Not a fun thing to do but that is a real use case.

amelius · 2025-03-06T23:05:25 1741302325

Are you trying to build a captcha solver?

SilentM68 · 2025-03-06T23:15:23 1741302923

No, not a captcha solver. When I worked in education, I was given a 90s paper document that a teacher needed OCRd but it was completely warped. It was my job to remediate those type of documents for Accessibility reasons. I had to scan and OCR it but the result was garbage. Mind you I had access to Windows, Linux and MacOS tools but still difficult to do. I had to guess what it said, which was not impossible but it was time-consuming, not doable in the time-frame I was given, so I had no option but to manually retype all the information into a new document and convert it that way. Document remediation and accessibility should be a good use case for A.I., in education.

arcfour · 2025-03-06T18:52:16 1741287136

Garbage in, garbage out?

edude03 · 2025-03-06T18:55:55 1741287355

"Yes" but if a human could do it "AI" should be able to do it too.

cxie · 2025-03-06T18:27:56 1741285676

The new Mistral OCR release looks impressive - 94.89% overall accuracy and significantly better multilingual support than competitors. As someone who's built document processing systems at scale, I'm curious about the real-world implications.

Has anyone tried this on specialized domains like medical or legal documents? The benchmarks are promising, but OCR has always faced challenges with domain-specific terminology and formatting.

Also interesting to see the pricing model ($1/1000 pages) in a landscape where many expected this functionality to eventually be bundled into base LLM offerings. This feels like a trend where previously encapsulated capabilities are being unbundled into specialized APIs with separate pricing.

I wonder if this is the beginning of the componentization of AI infrastructure - breaking monolithic models into specialized services that each do one thing extremely well.

epolanski · 2025-03-06T18:45:15 1741286715

At my client we want to provide an AI that can retrieve relevant information from documentation (home building business, documents detail how to install a solar panel or a shower, etc) and we've set up an entire system with benchmarks, agents, etc, yet the bottleneck is OCR!

We have millions and millions of pages of documents and an off by 1 % error means it compounds with the AI's own error, which compounds with documentation itself being incorrect at times, which leads it all to be not production ready (and indeed the project has never been released), not even close.

We simply cannot afford to give our customers incorrect informatiin

We have set up a backoffice app that when users have questions, it sends it to our workers along the response given by our AI application and the person can review it, and ideally correct the ocr output.

Honestly after an year of working it feels like AI right now can only be useful when supervised all the time (such as when coding). Otherwise I just find LLMs still too unreliable besides basic bogus tasks.

PeterStuer · 2025-03-06T18:56:00 1741287360

As someone who has had a home built, and nearly all my friends and acquaintances report the same thing, having a 1% error on information in this business would mean not a 10x but a 50x improvement over the current practice in the field.

If nobody is supervising building documents all the time during the process, every house would be a pile of rubbish. And even when you do stuff stills creeps in and has to be redone, often more than once.

themanmaran · 2025-03-06T18:53:14 1741287194

Excited to test this our on our side as well. We recently built an OCR benchmarking framework specifically for VLMs[1][2], so we'll do a test run today.

From our last benchmark run, some of these numbers from Mistral seem a little bit optimistic. Side by side of a few models:

model | omni | mistral |

gemini | 86% | 89% |

azure | 85% | 89% |

gpt-4o | 75% | 89% |

google | 68% | 83% |

Currently adding the Mistral API and we'll get results out today!

[1] https://github.com/getomni-ai/benchmark

[2] https://huggingface.co/datasets/getomni-ai/ocr-benchmark

themanmaran · 2025-03-06T23:11:23 1741302683

Update: Just ran our benchmark on the Mistral model and results are.. surprisingly bad?

Mistral OCR:

- 72.2% accuracy

- $1/1000 pages

- 5.42s / page

Which is pretty far cry from the 95% accuracy they were advertising from their private benchmark. The biggest thing I noticed is how it skips anything it classifies as an image/figure. So charts, infographics, some tables, etc. all get lifted out and returned as [image](image_002). Compared to the other VLMs that are able to interpret those images into a text representation.

https://github.com/getomni-ai/benchmark

https://huggingface.co/datasets/getomni-ai/ocr-benchmark

https://getomni.ai/ocr-benchmark

jaggs · 2025-03-06T21:40:47 1741297247

By optimistic, do you mean 'tweaked'? :)

janalsncm · 2025-03-06T18:41:02 1741286462

I have done OCR on leases. It’s hard. You have to be accurate and they all have bespoke formatting.

It would almost be easier to switch everyone to a common format and spell out important entities (names, numbers) multiple times similar to how cheques do.

The utility of the system really depends on the makeup of that last 5%. If problematic documents are consistently predictable, it’s possible to do a second pass with humans. But if they’re random, then you have to do every doc with humans and it doesn’t save you any time.

PeterStuer · 2025-03-06T18:40:43 1741286443

I'd love to try it for my domain (regulation), but $1/1000 pages is significantly more expensive than my current local Docling based setup that already does a great job of processing PDF's for my needs.

yawnxyz · 2025-03-06T19:41:05 1741290065

I think for regulated fields / high impact fields $1/1000 is well-worth the price; if the accuracy is close to 100% this is way better than using people, who are still error-prone

kbyatnal · 2025-03-06T19:40:05 1741290005

re: real world implications, LLMs and VLMs aren't magi, and anyone who goes in expecting 100% automation is in for a surprise (especially in domains like medical or legal).

IMO there's still a large gap for businesses in going from raw OCR outputs —> document processing deployed in prod for mission-critical use cases.

e.g. you still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort.

But for RAG and other use cases where the error tolerance is higher, I do think these OCR models will get good enough to just solve that part of the problem.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.app/)

kergonath · 2025-03-06T18:35:15 1741286115

> Has anyone tried this on specialized domains like medical or legal documents?

I’ll try it on a whole bunch of scientific papers ASAP. Quite excited about this.

janis1234 · 2025-03-06T20:14:36 1741292076

$1 for 1000 pages seems high to me. Doing a google search

Rent and Reserve NVIDIA A100 GPU 80GB - Pricing Starts from $1.35/hour

I just don't know if in 1 hour and with a A100 I can process more than a 1000 pages. I'm guessing yes.

blackoil · 2025-03-06T20:25:36 1741292736

Is the model Open Source/Weight? Else the cost is for the model, not GPU.

salynchnew · 2025-03-06T18:36:47 1741286207

Also interesting to see that parts of the training infrastructure to create frontier models is itself being monetized.

amelius · 2025-03-06T23:26:19 1741303579

> 94.89% overall accuracy

There are about 47 characters on average in a sentence. So does this mean it gets around 2 or 3 mistakes per sentence?

stavros · 2025-03-06T18:35:23 1741286123

What do you mean by "free"? Using the OpenAI vision API, for example, for OCR is quite a bit more expensive than $1/1k pages.

unboxingelf · 2025-03-06T18:31:44 1741285904

We’ll just stick LLM Gateway LLM in front of all the specialized LLMs. MicroLLMs Architecture.

cxie · 2025-03-06T18:38:31 1741286311

I actually think you're onto something there. The "MicroLLMs Architecture" could mirror how microservices revolutionized web architecture.

Instead of one massive model trying to do everything, you'd have specialized models for OCR, code generation, image understanding, etc. Then a "router LLM" would direct queries to the appropriate specialized model and synthesize responses.

The efficiency gains could be substantial - why run a 1T parameter model when your query just needs a lightweight OCR specialist? You could dynamically load only what you need.

The challenge would be in the communication protocol between models and managing the complexity. We'd need something like a "prompt bus" for inter-model communication with standardized inputs/outputs.

Has anyone here started building infrastructure for this kind of model orchestration yet? This feels like it could be the Kubernetes moment for AI systems.

fnordpiglet · 2025-03-06T19:07:37 1741288057

I’m doing this personally for my own project - essentially building an agent graph that starts with the image output, orients and cleans, does a first pass with tesseract LSTM best models to create PDF/HOCR/Alto, then pass to other LLMs and models based on their strengths to further refine towards markdown and latex. My goal is less about RAG database population but about preserving in a non manually typeset form the structure and data and analysis, and there seems to be pretty limited tooling out there since the goal generally seems to be the obviously immediately commercial goal of producing RAG amenable forms that defer the “heavy” side of chart / graphic / tabular reproduction to a future time.

arcfour · 2025-03-06T18:50:00 1741287000

This is already done with agents. Some agents only have tools and the one model, some agents will orchestrate with other LLMs to handle more advanced use cases. It's pretty obvious solution when you think about how to get good performance out of a model on a complex task when useful context length is limited: just run multiple models with their own context and give them a supervisor model—just like how humans organize themselves in real life.

unboxingelf · 2025-03-06T23:46:16 1741304776

Take a look at MCP, Model Context Protocol.

s4i · 2025-03-06T19:57:57 1741291077

I wonder how good it would be to convert sheet music to MusicXML. All the current tools more or less suck with this task, or maybe I’m just ignorant and don’t know what lego bricks to put together.

adrianh · 2025-03-06T20:18:36 1741292316

Try our machine-learning powered sheet music scanning engine at Soundslice:

https://www.soundslice.com/sheet-music-scanner/

Definitely doesn't suck.

sunami-ai · 2025-03-06T18:37:55 1741286275

Making Transformers the same cost as CNN's (which are used in character-level ocr, as opposed to image-patch-level) is a good thing. The problem with CNN based character-level OCR is not the recognition models but the detection models. In a former life, I found a way to increase detection accuracy, and, therefore, overall OCR accuracy, and used that as an enhancement on top of Amazon and Google OCR. It worked really well. But the transformer approach is more powerful and if it can be done for $1 per 1000 pages, that is a game changer, IMO, at least of incumbents offering traditional character-level OCR.

menaerus · 2025-03-06T18:45:42 1741286742

It certainly isn't the same cost if expressed as a non-subsidized $$$ one needs for the Transformers compute aka infra.

CNNs trained specifically for OCR can run in real time on as small compute as a mobile device is.

anon373839 · 2025-03-06T23:32:07 1741303927

A bit of a tangent, but aren’t CNNs still dominating over ViTs among computer vision competition winners?

yoelhacks · 2025-03-07T01:16:37 1741310197

I was curious about Mistral so I made a few visualizations.

A high level diagram w/ links to files: https://eraser.io/git-diagrammer?diagramId=uttKbhgCgmbmLp8OF...

Specific flow of an OCR request: https://eraser.io/git-diagrammer?diagramId=CX46d1Jy5Gsg3QDzP...

(Disclaimer - uses a tool I've been working on)

simonw · 2025-03-07T02:48:50 1741315730

I built a CLI script for feeding PDFs into this API - notes on that and my explorations of Mistral OCR here: https://simonwillison.net/2025/Mar/7/mistral-ocr/

TriangleEdge · 2025-03-06T18:19:50 1741285190

One of my hobby projects while in University was to do OCR on book scans. Doing character recognition was solved, but finding the relationship between characters was very difficult. I tried "primitive" neural nets, but edge cases would often break what I built. Super cool to me to see such an order of magnitude in improvement here.

Does it do hand written notes and annotations? What about meta information like highlighting? I am also curious if LLMs will get better because more access to information if it can be effectively extracted from PDFs.

jcuenod · 2025-03-06T18:22:11 1741285331

* Character recognition on monolingual text in a narrow domain is solved

z2 · 2025-03-06T18:25:20 1741285520

Is there a reliable handwriting OCR benchmark out there (updated, not a blog post)? Despite the gains claimed for printed text, I found (anecdotally) that trying to use Mistral OCR on my messy cursive handwriting to be much less accurate than GPT-4o, in the ballpark of 30% wrong vs closer to 5% wrong for GPT-4o.

Edit: answered in another post: https://huggingface.co/spaces/echo840/ocrbench-leaderboard

dannyobrien · 2025-03-06T18:29:05 1741285745

Simon Willison linked to an impressive demo of Qwen2-VL in this area: I haven't found a version of it that I could run locally yet to corroborate. https://simonwillison.net/2024/Sep/4/qwen2-vl/

oysterville · 2025-03-06T19:10:46 1741288246

Dupe of an hour previous post https://news.ycombinator.com/item?id=43282489

qwertox · 2025-03-06T19:54:14 1741290854

We developers seem to really dislike PDFs, to a degree that we'll build LLMs and have them translate it into Markdown.

Jokes aside, PDFs really serve a good purpose, but getting data out of them is usually really hard. They should have something like an embedded Markdown version with a JSON structure describing the layout, so that machines can easily digest the data they contain.

siva7 · 2025-03-07T04:12:43 1741320763

> We developers seem to really dislike PDFs, to a degree that we'll build LLMs and have them translate it into Markdown.

Why Jokes aside? Markdown/html is better suited for the web than pdf

jgalt212 · 2025-03-06T20:01:15 1741291275

I think you might be looking for PDF/A.

https://www.adobe.com/uk/acrobat/resources/document-files/pd...

For example, if you print a word doc to PDF, you get the raw text in PDF form, not an image of the text.

gpvos · 2025-03-06T21:20:13 1741296013

PDF/A doesn't require preserving the document structure, only that any text is extractable.

climb_stealth · 2025-03-06T20:10:57 1741291857

Does this support Japanese? They list a table of language comparisons againat other approaches but I can't tell if it is exhaustive.

I'm hoping that something like this will be able to handle 3000-page Japanese car workshop manuals. Because traditional OCR really struggles with it. It has tables, graphics, text in graphics, the whole shebang.

dotnetkow · 2025-03-06T22:11:29 1741299089

Congrats to the Mistral team for launching! A general-purpose OCR model is useful, of course. However, more purpose-built solutions are a must to convert business documents reliably. AI models pre-trained on specific document types perform better and are more accurate. Coming soon from the ABBYY team, we're shipping a new OCR API designed to be consistent, reliable, and hallucination-free. Check it out if you're looking for best-in-class DX: https://digital.abbyy.com/code-extract-automate-your-new-mus...

protonbob · 2025-03-06T20:03:22 1741291402

Wow this basically "solves" DRM for books as well as opening up the door for digitizing old texts more accurately.

bsnnkv · 2025-03-06T19:43:10 1741290190

Someone working there has good taste to include a Nizar Qabbani poem.

sixhobbits · 2025-03-06T20:15:38 1741292138

Nice demos but I wonder how well it does on longer files. I've been experimenting with passing some fairly neat PDFs to various LLMs for data extraction. They're created from Excel exports and some of the data is cut off or badly laid out, but it's all digitally extractable.

The challenge isn't so much the OCR part, but just the length. After one page the LLMs get "lazy" and just skip bits or stop entirely.

And page by page isn't trivial as header rows are repeated or missing etc.

So far my experience has definitely been that the last 2% of the content still takes the most time to accurately extract for large messy documents, and LLMs still don't seem to have a one-shot solve for that. Maybe this is it?

hack_ml · 2025-03-06T20:45:53 1741293953

You will have to send one page at a time, most of this work has to be done via RAG. Adding a large context (like a whole PDF), still does not work that well in my experience.

andoando · 2025-03-06T18:18:18 1741285098

Bit unrelated but is there anything that can help with really low resolution text? My neighbor got hit and run the other day for example, and I've been trying every tool I can to make out some of the letters/numbers on the plate

https://ibb.co/mr8QSYnj

zinglersen · 2025-03-06T18:24:46 1741285486

Finding the right subreddit and asking there is probably a better approach if you want to maximize the chances of getting the plate 'decrypted'.

rvnx · 2025-03-06T18:35:30 1741286130

If it’s a video, sharing a few frames can help as well

dewey · 2025-03-06T18:30:09 1741285809

To even get started on this you'd also need to share some contextual information like continent, country etc. I'd say.

andoando · 2025-03-06T20:50:03 1741294203

Its in CA, looks like paper plates which follow a specific format and the last two seem to be the numbers '64'. Police should be able to search for temp tag with partial match and match the make/model. Was curious to see if any software could help though

flutas · 2025-03-06T18:23:50 1741285430

Looks like a paper temp tag. Other than that, I'm not sure much can be had from it.

busymom0 · 2025-03-06T18:21:01 1741285261

There are photo enhancers online. But your picture is way too pixelated to get any useful info from it.

tjoff · 2025-03-06T18:24:27 1741285467

If you know the font in advance (which you often do in these cases) you can do insane reconstructions. Also keep in mind that it doesn't have to be a perfect match, with the help of the color and other facts (such as likely location) about the car you can narrow it down significantly.

zellyn · 2025-03-06T18:25:32 1741285532

Maybe if you had multiple frames, and used something very clever?

notepad0x90 · 2025-03-06T18:16:41 1741285001

I was just watching a science-related video containing math equations. I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.

It's only a matter of time before "browsing" means navigating HTTP sites via LLM prompts. although, I think it is critical that LLM input should NOT be restricted to verbal cues. Not everyone is an extrovert that longs to hear the sound of their own voices. A lot of human communication is non-verbal.

Once we get over the privacy implications (and I do believe this can only be done by worldwide legislative efforts), I can imagine looking at a "website" or video, and my expressions, mannerisms and gestures will be considered prompts.

At least that is what I imagine the tech would evolve into in 5+ years.

abrichr · 2025-03-06T18:20:50 1741285250

> I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.

Seems like https://aiscreenshot.app might fit the bill.

devmor · 2025-03-06T18:20:08 1741285208

Good lord, I dearly hope not. That sounds like a coddled hellscape world, something you'd see made fun of in Disney's Wall-E.

notepad0x90 · 2025-03-06T18:36:12 1741286172

hence my comment about privacy and need for legislation :)

It isn't the tech that's the problem but the people that will abuse it.

devmor · 2025-03-06T18:40:29 1741286429

While those are concerns, my point was that having everything on the internet navigated to, digested and explained to me sounds unpleasant and overall a drain on my ability to think and reason for myself.

It is specifically how you describe using the tech that provokes a feeling of revulsion to me.

notepad0x90 · 2025-03-06T20:35:51 1741293351

Then I think you misunderstand. The ML system would know when you want things digested to you or not. Right now companies are assuming this and forcing LLM interaction. But when properly done, the system would know based on your behavior or explicit prompts what you want and provide the service. If you're staring at a paragraph intently and confused, it might start highlighting common phrases or parts of the text/picture that might be hard to grasp and based on your reaction to that, it might start describing things via audio,tool tips,side pane,etc.. In other words, if you don't like how and when you're interacting with the LLM ecosystem, then that is an immature and failing ecosystem, in my vision this would be a largely solved problems, like how we interact with keyboards,mouse and touchscreens today.

devmor · 2025-03-06T23:05:32 1741302332

No, I fully understand.

I am saying that this type of system, that deprives the user of problem solving, is itself a problem. A detriment to the very essence of human intelligence.

groby_b · 2025-03-06T18:24:49 1741285489

Now? OK, you need to screencap and upload to LLM, but that's well established tech by now. (Where by "well established", I mean at least 9 months old ;)

Same goes for "navigating HTTP sites via LLM prompts". Most LLMs have web search integration, and the "Deep Research" variants do more complex navigation.

Video chat is there partially, as well. It doesn't really pay much attention to gestures & expressions, but I'd put the "earliest possible" threshold for that a good chunk closer than 5 years.

notepad0x90 · 2025-03-06T18:37:32 1741286252

Yeah, all these things are possible today, but getting them well polished and integrated is another story. Imagine all this being supported by "HTML6" lol. When apple gets around to making this part of safari, then we know it's ready.

groby_b · 2025-03-06T20:44:48 1741293888

That's a great upper-bound estimator ;)

But kidding aside - I'm not sure people want this being supported by web standards. We could be a huge step closer to that future had we decided to actually take RDF/Dublin Core/Microdata seriously. (LLMs perform a lot better with well-annotated data)

The unanimous verdict across web publishers was "looks like a lot of work, let's not". That is, ultimately, why we need to jump through all the OCR hoops. Not only did the world not annotate the data, it then proceeded to remove as many traces of machine readability as possible.

So, the likely gating factor is probably not Apple & Safari & "HTML6" (shudder!)

If I venture my best bet what's preventing polished integration: It's really hard to do via foundational models only, and the number of people who want to have deep & well-informed conversations via a polished app enough that they're willing to pay for an app that does that is low enough that it's not the hot VC space. (Yet?)

Crystal ball: Some OSS project will probably get within spitting distance of something really useful, but also probably flub the UX. Somebody else will take up these ideas while it's hot and polish it in a startup. So, 18-36 months for an integrated experience from here?

hubraumhugo · 2025-03-06T19:17:19 1741288639

It will be interesting to see how all the companies in the document processing space adapt as OCR becomes a commodity.

The best products will be defined by everything "non-AI", like UX, performance and reliability at scale, and human-in-the loop feedback for domain experts.

trollied · 2025-03-06T19:45:38 1741290338

They will offer integrations into enterprise systems, just like they do today.

Lots of big companies don't like change. The existing document processing companies will just silently start using this sort of service to up their game, and keep their existing relationships.

hyuuu · 2025-03-06T20:18:29 1741292309

I 100% agree with this, I think you can even extend this to any AI, in the end, IMO, as the llm is more commoditized, the surface of which the value is delivered will matter more

low_tech_punk · 2025-03-06T23:50:11 1741305011

This might be a contrarian take: the improvement against gpt-4o and gemini-1.5 flash, both of which are general purpose multi-modal models, seem to be underwhelming.

I'm sensing another bitter lesson coming, where domain optimized AI will hold a short term advantage but will be outdated quickly as the frontier model advances.

Oras · 2025-03-06T20:22:33 1741292553

I feel this is created for RAG. I tried a document [0] that I tested with OCR; it got all the table values correctly, but the page's footer was missing.

Headers and footers are a real pain with RAG applications, as they are not required, and most OCR or PDF parsers will return them, and there is extract work to do to remove them.

[0] https://github.com/orasik/parsevision/blob/main/example/Mult...

kccqzy · 2025-03-07T02:35:50 1741314950

I have an actually hard OCR exercise for an AI model: I take this image of Chinese text on one of the memorial stones on the Washington Monument https://www.nps.gov/articles/american-mission-ningpo-china-2... and ask the model to do OCR. Not a single model I've seen can OCR this correctly. Mistral is especially bad here: it gets stuck in an endless loop of nonsensical hallucinated text. Insofar as Mistral is design for "preserving historical and cultural heritage" it couldn't do that very well yet.

A good model can recognize that the text is written top to bottom and then right to left and perform OCR in that direction. Apple's Live Text can do that, though it makes plenty of mistakes otherwise. Mistral is far from that.

pqdbr · 2025-03-06T18:37:53 1741286273

I tried with both PDFs and PNGs in Le Chat and the results were the worst I've ever seen when compared to any other model (Claude, ChatGPT, Gemini).

So bad that I think I need to enable the OCR function somehow, but couldn't find it.

troyvit · 2025-03-06T21:10:12 1741295412

It worked perfectly for me with a simple 2 page PDF that contained no graphics or formatting beyond headers and list items. Since it was so small I had the time to proof-read it and there were no errors. It added some formatting, such as bolding headers in list items and putting tics around file and function names. I won't complain.

computergert · 2025-03-06T20:43:13 1741293793

I'm experiencing the same. Maybe the sentence "Mistral OCR capabilities are free to try on le Chat." was a hallucination.

kapitalx · 2025-03-06T18:44:06 1741286646

Co-founder of doctly.ai here (OCR tool)

I love mistral and what they do. I got really excited about this, but a little disappointed after my first few tests.

I tried a complex table that we use as a first test of any new model, and Mistral OCR decided the entire table should just be extracted as an 'image' and returned this markdown:

``` ![img-0.jpeg](img-0.jpeg) ```

I'll keep testing, but so far, very disappointing :(

This document I try is the entire reason we created Doctly to begin with. We needed an OCR tool for regulatory documents we use and nothing could really give us the right data.

Doctly uses a judge, OCRs a document against multiple LLMs and decides which one to pick. It will continue to run the page until the judge scores above a certain score.

I would have loved to add this into the judge list, but might have to skip it.

bambax · 2025-03-06T19:28:03 1741289283

Where did you test it? At the end of the post they say:

> Mistral OCR capabilities are free to try on le Chat

but when asked, Le Chat responds:

> can you do ocr?

> I don't have the capability to perform Optical Character Recognition (OCR) directly. However, if you have an image with text that you need to extract, you can describe the text or provide details, and I can help you with any information or analysis related to that text. If you need OCR functionality, you might need to use a specialized tool or service designed for that purpose.

Edit: Tried anyway by attaching an image; it said it could do OCR and then output... completely random text that had absolutely nothing to do with the text in the image!... Concerning.

Tried again with a better definition image, output only the first twenty words or so of the page.

Did you try using the API?

kapitalx · 2025-03-06T19:41:05 1741290065

Yes I used the API. They have examples here:

https://docs.mistral.ai/capabilities/document/

I used base64 encoding of the image of the pdf page. The output was an object that has the markdown, and coordinates for the images:

[OCRPageObject(index=0, markdown='![img-0.jpeg](img-0.jpeg)', images=[OCRImageObject(id='img-0.jpeg', top_left_x=140, top_left_y=65, bottom_right_x=2136, bottom_right_y=1635, image_base64=None)], dimensions=OCRPageDimensions(dpi=200, height=1778, width=2300))] model='mistral-ocr-2503-completion' usage_info=OCRUsageInfo(pages_processed=1, doc_size_bytes=634209)

(评论) (comments)

(评论)
(comments)