(Reuters) – Social media platform Reddit said on Tuesday it would update the web standard the platform uses to block automated data collection from its website, following reports that artificial intelligence startups were circumventing the rule to scrape content for their systems.
The move comes at a time when artificial intelligence companies are accused of plagiarizing content from publishers to create AI-generated summaries without attribution or asking for permission.
Reddit said it would update its robot exclusion protocol, or robots.txt, a widely used standard designed to determine which parts of a site are allowed to be crawled.
The company also said it will maintain rate limiting, a method used to control the number of requests from one specific entity, and will block unknown bots and crawlers from scraping — collecting and storing raw information — on its website.
More recently, the robots.txt file has become a key tool that publishers use to stop tech companies from freely using their content to train AI algorithms and create summaries in response to some search queries.
Last week, content licensing startup TollBit sent a letter to publishers saying that several artificial intelligence companies were circumventing web standards for scraping publisher sites.
This follows a Wired investigation that found AI search startup Perplexity may have bypassed attempts to block its web crawler via robots.txt.
Earlier in June, business media publisher Forbes accused Perplexity of plagiarizing its investigations for use in generative artificial intelligence systems without giving due credit.
Reddit said Tuesday that researchers and organizations such as the Internet Archive will continue to have access to its content for non-commercial use.