Searchcaster

Wonder if ChatGPT will be the last major model to be trained on the open web? robots.txt specifically disallowing crawling from LLMs unless getting paid for the data?

13

4

35

In reply to @dwr

Aren’t robots.txt files just suggestions? Any crawler can ignore those files if they want and Google often does IIRC

1

0

8

In reply to @dwr

Man it’ll be so dumb if ChatGPT is what makes everyone on the internet panic and close itself off to keep from getting ingested. Not impossible, but just dumb

1

0

4

In reply to @dwr

If AIs can generate enough value, it might be worth paying armies of Mechanical Turk-style workers to manually visit and rewrite web sites for copyright-approved training Facts and ideas can't be copyrighted, only particular expression

1

0

1

In reply to @dwr

I'm curious what's the law around crawling sites that disregard robots.txt and post mirrors of content.

1

0

1

In reply to @dwr

Could be. Microsoft is already being sued for CoPilot; StabilityAI, Midjourney, and Deviant Art are being sued for Stable Diffusion; it’s just a matter of time before OpenAI gets sued for their products, too. When the lawsuits start flying, so do the CYA measures.

1

0

In reply to @dwr

lol we must be listening to all in around the same time mark

0

In reply to @dwr

Crawl me baby.

0

In reply to @dwr

I doubt it. We’re at the start of an arms race between training and membership inference algorithms. https://arxiv.org/abs/2301.09956 Even if Western majors respect regulatory type regimes and respect robots.txt directives many won’t. The only defense is encryption not regulation.

1

0

3

In reply to @dwr

I hope so. I wasn’t attuned to the risk previously but now that I am I don’t want MegaCorpLLM getting a scrap, save for my illegible tweets lmao

0

In reply to @dwr

I don’t think so. If we continue to see model sizes increase I would expect GPT-4, 5 to also be trained on a similar corpus with better results. What ~might~ happen is that new webpages have protection against this kind of scraping. Hard to do retroactively since the data is probably already cached

0

3

In reply to @dwr

How do LLMs incentivize users to give feedback on answer quality? Offer fee? But then just max number of feedbacks. Offer token for ~shared rev? Incentivize credible feedback.

0

1

In reply to @dwr

First, I’m impressed by the thoughtfulness of your responses - very bullish on what you’re building here. Secondly, I think the knee jerk reactions will settle down.

0

1

In reply to @dwr

might be interesting if chatgpt can include citations in the results but it might become more like Google at that point

0