mirandrom's comments

mirandrom · 2025-12-14T03:00:14 1765681214

I went down the same rabbit hole and did the exact same thing last week in a fit of procrastination. https://news.ycombinator.com/item?id=46185128

Would appreciate a shout-out if you saw it and were inspired, otherwise it's nice to see others converging independently on the same thing.

thecsw · 2025-12-14T03:21:18 1765682478

Oh wow! Did not know that—I went off the original post by Greg and he mentioned to me after I sent him this link that someone looked at Common Crawl as well.

Either way, I updated both the git and the webpage to shout-out the week-before-this findings! I linked directly to your website, lmk if that's how you prefer it.

Cheers!

mirandrom · 2025-12-15T16:54:02 1765817642

Much appreciated! I'll update my page to link to your much more polished one too :)

mirandrom · 2025-12-07T21:02:24 1765141344

I went down a rabbit hole and found most of the missing lists on Common Crawl: https://mirandrom.github.io/bourdain-lists/

Unfortunately, AFAICT, the embedded image data were not included in the Common Crawl scrapes, and a few of the image URLs I tried don't seem indexed by Common Crawl. I only just started playing around with these tools so I might've missed something.

ccgreg · 2025-12-08T02:37:20 1765161440

Common Crawl is a text-only crawl.

mirandrom · 2025-12-08T23:50:39 1765237839

I'm not so sure, they say "The crawled content is dominated by HTML pages and contains only a small percentage of other document formats." https://commoncrawl.github.io/cc-crawl-statistics/plots/mime...

In any case, all the images were external cloudfrount URLs that have not been archived anywhere afaict.

ccgreg · 2025-12-09T02:43:19 1765248199

Hi. I'm the CTO at Common Crawl. Nice to meet you. There's a small amount of "bycatch", and you already discovered how to see it. Notice that it went down after I was hired.

mirandrom · 2025-12-14T02:50:59 1765680659

Can't argue with those credentials. Thanks for confirming/clarifying!