Hacker Newsnew | past | comments | ask | show | jobs | submit | mirandrom's commentslogin

I went down the same rabbit hole and did the exact same thing last week in a fit of procrastination. https://news.ycombinator.com/item?id=46185128

Would appreciate a shout-out if you saw it and were inspired, otherwise it's nice to see others converging independently on the same thing.


Oh wow! Did not know that—I went off the original post by Greg and he mentioned to me after I sent him this link that someone looked at Common Crawl as well.

Either way, I updated both the git and the webpage to shout-out the week-before-this findings! I linked directly to your website, lmk if that's how you prefer it.

Cheers!


Much appreciated! I'll update my page to link to your much more polished one too :)


I went down a rabbit hole and found most of the missing lists on Common Crawl: https://mirandrom.github.io/bourdain-lists/

Unfortunately, AFAICT, the embedded image data were not included in the Common Crawl scrapes, and a few of the image URLs I tried don't seem indexed by Common Crawl. I only just started playing around with these tools so I might've missed something.


Common Crawl is a text-only crawl.


I'm not so sure, they say "The crawled content is dominated by HTML pages and contains only a small percentage of other document formats." https://commoncrawl.github.io/cc-crawl-statistics/plots/mime...

In any case, all the images were external cloudfrount URLs that have not been archived anywhere afaict.


Hi. I'm the CTO at Common Crawl. Nice to meet you. There's a small amount of "bycatch", and you already discovered how to see it. Notice that it went down after I was hired.


Can't argue with those credentials. Thanks for confirming/clarifying!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: