Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The submitted title ("New Leaked Documents Point to Engineered Lab Origin for SARS‑CoV‑2") broke the site guidelines badly by editorializing. Submitters: please don't do that—it will eventually cause you to lose submission privileges on HN. Instead, follow the site guidelines, which include: "Please use the original title, unless it is misleading or linkbait; don't editorialize."

https://news.ycombinator.com/newsguidelines.html

(I'm assuming, of course, that it wasn't the article title that got subsequently changed. If that was the case, ignore the above.)



> (I'm assuming, of course, that it wasn't the article title that got subsequently changed. If that was the case, ignore the above.)

Not the first time I've seen you say this. Would it be worthwhile to fetch articles when they're submitted, if only for your own sanity?


Fetching them in a way that information (like titles) can be meaningfully extracted from is a lot harder than it sounds - we've worked on it in the past and got bogged down in lots of details and corner cases etc. An easier way might be to rely on one of the archiving services, e.g. archive.org. If a snapshot could be taken at submission time than it would be there to refer to later.

On the other hand, titles changing on the fly isn't that big a headache as far as sanity-affecting headaches go. NYT does it all the time, or used to. The main thing I don't like to do is scold someone for breaking the title guideline and then finding out later that it was the site, not the submitter, that changed it.


> information (like titles) can be meaningfully extracted

From a technical perspective it's probably simpler to just grep the page for the user-submitted title.


Sometimes you have to paraphrase the title due to the length limit.


The issue is likely mostly paywalled sites and SPAs where this isn't as simple.


Sure, and I agree that this problem isn't worth spending a lot of resources on if the level of toil is acceptable.

But most paywall sites do display the title so that you know what great journalism you're being asked to pay for. I suspect that the typical SPA shows titles as well. So greping should work in those cases.

But my point was while title extraction is a hard problem requiring you to solve lots of corner cases, title greping is simple and handles the vast majority of cases. The corner cases are then handled by humans (as, IIUC, all cases are currently).

Accepting a user-generated title and comparing it to the text gives you a boolean. If they don't match you can just ask the user to affirm that what they submitted is really the title. Then, if you like, you can have a "this title may be dodgy" icon on posts that don't match.


Yet another reason to ban paywalled sites.

I really don't understand why a site that wants people to actually read the article and discuss the contents promotes articles that at least 90% of readers won't be able to (easily) access.


The truly irritating thing is that even that wouldn't necessarily be enough, because so many sites actually do live A/B/C/[n] title tests simultaneously to randomized sets of users then choose whichever one gets the most clicks or whatever metric first. Even without any manual shenanigans. So there's a window where merely refreshing or browsing from a different IP will yield a different title. Sometimes evidence is left in the URL or interactions with older systems on the a site but that's all baroque. So so so many edge cases in grabbing titles.

Probably not worth the effort on HN to try to automate vs just treating it case by case. It doesn't usually seem to be a problem. "Pre-optimization is the root of all evil" and all that.

Edit: or archive.org as dang says, but I don't know if even they see all versions of a page if there is a simultest situation. Regrettably seems pretty SOP on even reputable places.


Maybe putting that specific guideline on the submission page might help?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: