Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do any LLM OCRs give bounding boxes anyway? Per character and per block.


Gemini does but it's not as good as Google vision, and the format it's différent Here it's the documentation https://cloud.google.com/vertex-ai/generative-ai/docs/boundi...

Also Simon Willison Made a blog post that might be helpful https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...

I hope that this capability improves so I can use only Gemini API.


Try MinerU 2.5 with two-step parsing. It gives good results with bounding boxes per block. Not sure if you can get it to do more detailed such as word or character level.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: