Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> information (like titles) can be meaningfully extracted

From a technical perspective it's probably simpler to just grep the page for the user-submitted title.



Sometimes you have to paraphrase the title due to the length limit.


The issue is likely mostly paywalled sites and SPAs where this isn't as simple.


Sure, and I agree that this problem isn't worth spending a lot of resources on if the level of toil is acceptable.

But most paywall sites do display the title so that you know what great journalism you're being asked to pay for. I suspect that the typical SPA shows titles as well. So greping should work in those cases.

But my point was while title extraction is a hard problem requiring you to solve lots of corner cases, title greping is simple and handles the vast majority of cases. The corner cases are then handled by humans (as, IIUC, all cases are currently).

Accepting a user-generated title and comparing it to the text gives you a boolean. If they don't match you can just ask the user to affirm that what they submitted is really the title. Then, if you like, you can have a "this title may be dodgy" icon on posts that don't match.


Yet another reason to ban paywalled sites.

I really don't understand why a site that wants people to actually read the article and discuss the contents promotes articles that at least 90% of readers won't be able to (easily) access.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: