Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google Indexes Images from PDF Files (googlesystem.blogspot.com)
47 points by tomkwok on Aug 10, 2015 | hide | past | favorite | 4 comments


While interesting that they do this, it does not appear they are doing anything difficult. The linked PDFs have images embedded, as opposed to say being scanned and then automatically recognizing image borders and such.


This only supports restaurants keeping their menus in PDFs.


which is great, because 80% of the websites are for otherwise acceptable restaurants with mediocre site presence and implementation. The 20% that have a worth-visiting mobile implementation still don't support the "view the desktop version of this site" requiring me to re-download-and-install Dolphin Browser on android just so I can manually spoof the browser agent to "Desktop".

I wonder if there would be a market for a rancur-approved sticker/logo that you can place on your website, and index and search by, to help usability-conscious users rid their online experience of pesky mobile implementations.


> Back in 2008, Google started to use OCR to index the full text of scanned PDF files. Now Google extracts images from PDF files and makes them searchable.

The astounding rate of progress of Google! In only 7 years they have gone from extracting PDF text, to extracting the embedded images. When will the miracles of those technical wizards cease to astonish all who gaze upon their brilliance ?

I wonder if there were some other reason Google waited so long to include PDF images? Perhaps something legal. Since the actual technical requirement is really clear, the photographic image bytestreams are simply stored in the PDF, and the utility for people seems quite large, there being a lot of images stored in a lot of PDFs.

Perhaps it was simply overlooked or not on the roadmap until they made a lot of other perhaps judged as more-important changes to their image search such as changes to their image representation indexing ( which they do by majority color it seems ) and image content indexing ( which it seems they contribute to using DNN generated descriptions ).

It seems likely they would have rolled this out in a limited fashion maybe a few times over the years before waiting longer to do it, pending whatever was missing for a general release.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: