Advanced
Dan Romero@dwr
2/11/2023

Wonder if ChatGPT will be the last major model to be trained on the open web? robots.txt specifically disallowing crawling from LLMs unless getting paid for the data?

In reply to @dwr
Justin Hunter@polluterofminds
2/11/2023

Aren’t robots.txt files just suggestions? Any crawler can ignore those files if they want and Google often does IIRC

In reply to @dwr
Cameron Armstrong@cameron
2/11/2023

Man it’ll be so dumb if ChatGPT is what makes everyone on the internet panic and close itself off to keep from getting ingested. Not impossible, but just dumb

In reply to @dwr
William Saar@saarw
2/11/2023

If AIs can generate enough value, it might be worth paying armies of Mechanical Turk-style workers to manually visit and rewrite web sites for copyright-approved training Facts and ideas can't be copyrighted, only particular expression

In reply to @dwr
0xbyron@byron
2/11/2023

I'm curious what's the law around crawling sites that disregard robots.txt and post mirrors of content.

In reply to @dwr
MxVoid@mxvoid
2/11/2023

Could be. Microsoft is already being sued for CoPilot; StabilityAI, Midjourney, and Deviant Art are being sued for Stable Diffusion; it’s just a matter of time before OpenAI gets sued for their products, too. When the lawsuits start flying, so do the CYA measures.

In reply to @dwr
tim 🥝@timdaub
2/11/2023

lol we must be listening to all in around the same time mark

In reply to @dwr
WakΞ@wake
2/11/2023

Crawl me baby.

In reply to @dwr
Venkatesh Rao ☀️@vgr
2/11/2023

I doubt it. We’re at the start of an arms race between training and membership inference algorithms. https://arxiv.org/abs/2301.09956 Even if Western majors respect regulatory type regimes and respect robots.txt directives many won’t. The only defense is encryption not regulation.

In reply to @dwr
Katherine@keccers
2/11/2023

I hope so. I wasn’t attuned to the risk previously but now that I am I don’t want MegaCorpLLM getting a scrap, save for my illegible tweets lmao

In reply to @dwr
phil@phil
2/11/2023

I don’t think so. If we continue to see model sizes increase I would expect GPT-4, 5 to also be trained on a similar corpus with better results. What ~might~ happen is that new webpages have protection against this kind of scraping. Hard to do retroactively since the data is probably already cached

In reply to @dwr
Adam Baybutt@baybutt
2/11/2023

How do LLMs incentivize users to give feedback on answer quality? Offer fee? But then just max number of feedbacks. Offer token for ~shared rev? Incentivize credible feedback.

In reply to @dwr
2/11/2023

First, I’m impressed by the thoughtfulness of your responses - very bullish on what you’re building here. Secondly, I think the knee jerk reactions will settle down.

In reply to @dwr
Shashank@0xshash
2/12/2023

might be interesting if chatgpt can include citations in the results but it might become more like Google at that point