Oh wow! Did not know that—I went off the original post by Greg and he mentioned to me after I sent him this link that someone looked at Common Crawl as well.
Either way, I updated both the git and the webpage to shout-out the week-before-this findings! I linked directly to your website, lmk if that's how you prefer it.
Unfortunately, AFAICT, the embedded image data were not included in the Common Crawl scrapes, and a few of the image URLs I tried don't seem indexed by Common Crawl.
I only just started playing around with these tools so I might've missed something.
Hi. I'm the CTO at Common Crawl. Nice to meet you. There's a small amount of "bycatch", and you already discovered how to see it. Notice that it went down after I was hired.
Would appreciate a shout-out if you saw it and were inspired, otherwise it's nice to see others converging independently on the same thing.