Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hmm, at first I was thinking "why OCR?", but maybe the reason is to ingest more types of training data for LLM improvement, e.g. scanned academic papers? I imagine all the frontier labs have a solution for this due to the value of academic papers as a data source.

Edit: Oh I see the paper abstract says this explicitly: "In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G)". This is just part of the training data ingestion pipeline for their real models. Explains why the architecture is not using all of their latest tricks: it's already good enough for their use case and it's not the main focus.



If we get ocr working it makes it possible to store all human knowledge now stored in PDF's with way less resources

https://annas-archive.org/blog/critical-window.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: