Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

LLM training input treasure? Signal-to-noise ratio is going to be lowish though.


I think discord's data has insane value because it has real time reasoning steps much more than any big social media. Obviously you might want to filter out low information messages like "Hello" but model based filtering for signal is already solved more or less.

Discord has lot of very technical channels like the one which solved BB(5) after decades of research.


Agreed. discord has a unique structure with threaded conversations, context carryover, and back and forth reasoning that you don't get from places like twitter or even reddit. it's especially useful for training models on collaborative problem solving or exploratory dialogue. filtering is a challenge but definitely solvable with current tools.


If you want to train a groomer then Discord messages are the way to do it.


If we're acting like Redditors: "you can just block Minecraft Discords, and remove most of the grooming"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: