Gary Marcus is a leading artificial intelligence researcher who is increasingly horrified by what he sees. He founded at least two artificial intelligence startups, one of which was sold to Uber, and has researched the topic for more than two decades. Just last weekend Financial Times called him “perhaps the loudest AI conversationalist” and reported that Marcus suggested he was the target of Sam Altman’s critical post on X: “Give me the confidence of a mediocre deep learning skeptic.”
Marcus doubled down on his criticism the very next day after his appearance in the FT. I’m writing in my substack about “generative AI as a Shakespearean tragedy.” The subject was bomb report from New York Times that OpenAI violated YouTube’s terms of service by copying more than a million hours of user content. To make matters worse, Google’s need for data to train its own AI model was so insatiable that Google did the same, potentially violating the copyrights of content creators whose videos it used without their consent.
Back in 2018, Marcus noted, he expressed reservations about the “data ingestion” approach to learning, which aimed to fill AI models with as much content as possible. In fact, he listed eight of his warnings, starting with his diagnosed with hallucinations in 2001, and all this comes true, like the curse of Macbeth or Hamlet, manifested in the fifth act. “What makes all of this tragic is that so many of us tried so hard to warn the field that we would end up here,” Marcus wrote.
Although Marcus refused to comment LuckThe tragedy goes far beyond the fact that no one listened to critics like him and Ed Zitron, another well-known skeptic. quoted according to FT. According to Time, which cites numerous sources of information, both Google and OpenAI knew that their actions were legally questionable (relying on the fact that copyright in the age of AI had not yet been challenged), but felt that they had no other choice, except to keep pumping data into their big language models to stay ahead of their competitors. And in Google’s case, the company was potentially harmed by OpenAI’s massive scraping efforts, but breaking the rules for scraping that same data left it with the proverbial hand tied behind its back.
Did OpenAI uses YouTube video?
Google employees learned that OpenAI was using YouTube content to train its models, violating both its own terms of service and possibly the copyright protections of the creators who own the videos. Falling into this trap, Google decided not to publicly condemn OpenAI because it was afraid of drawing attention to its use of YouTube videos to train AI models. Time reported.
This was announced by a Google representative Luck the company has “seen unconfirmed reports” that OpenAI was using YouTube videos. They added that YouTube’s terms of service “prohibit the unauthorized copying or downloading” of videos, and the company “has a long history of implementing technical and legal measures to prevent this.”
Marcus says the behavior of these big tech companies was predictable because data was the key ingredient needed to create the artificial intelligence tools that these big tech companies were fighting to develop. Without quality data, such as well-written novels, podcasts from knowledgeable hosts, or professionally produced films, chatbots and image generators risk delivering mediocre content. This idea can be summed up in the data science adage “crap in, crap out.” In an article for Luck Jim Stratton, CTO of HR software company Workday, said that “data is the lifeblood of artificial intelligence,” making “the need for quality, timely data more important than ever.”
Around 2021, OpenAI faced a data shortage. Desperate for more human speech to continue improving its ChatGPT tool, which was still about a year away from release, OpenAI decided to source it from YouTube. The staff discussed the fact that copying videos on YouTube might be prohibited. Eventually, a group that included OpenAI President Greg Brockman put the plan into action.
That a high-ranking figure like Brockman was involved in the scheme is a testament to how fruitful such data-gathering methods have been for AI development, Marcus said. Brockman did so, “probably knowing he was entering a legal gray area but desperate to feed the beast,” Marcus wrote. “If everything falls apart for legal or technical reasons, this image may persist.”
When reached for comment, an OpenAI spokesperson did not respond to specific questions about using YouTube videos to train its models. “Each of our models has a unique dataset that we curate to help them understand the world and remain globally competitive in research,” they wrote in an email. “We use multiple sources, including public data and proprietary data partnerships, and are also exploring the possibility of creating synthetic data,” they said, referring to the practice of using AI-generated content to train AI models.
OpenAI CTO Mira Murati was asked a question in Wall Street Journal interview whether the company’s new Sora video generator was trained using YouTube videos; she replied, “I’m actually not sure about that.” Last week, YouTube CEO Neil Mohan responded that while he doesn’t know whether OpenAI actually used YouTube data to train Sora or any other tool, if it did, it would violate the platforms’ rules. Mohan did mention that Google uses some YouTube content to train its artificial intelligence tools based on several contracts it has with individual creators. A statement repeated by a Google spokesperson Luck in an email.
Meta decides the licensing agreement will take too long
OpenAI wasn’t the only one to face a lack of adequate data. Meta also struggled with this problem. When Meta realized that its artificial intelligence products were not as advanced as OpenAI’s; she held numerous meetings with senior executives to figure out how to get more data to train her systems. Executives considered options such as paying a $10-per-book licensing fee for new releases and purchasing Simon & Schuster outright. During these meetings, executives admitted that they had already used copyrighted materials without the permission of their authors. Ultimately, they decided to press on even if it meant possible future lawsuits, according to The newspaper “New York Times.
Meta did not respond to a request for comment.
Meta’s lawyers believed that if the case went to trial, he would be subject to a lawsuit. 2015 case that Google won against the consortium of authors. At the time, a judge ruled that Google was allowed to use the authors’ books without paying a licensing fee because the company used their work to create a search engine that was transformative enough to be considered fair use.
OpenAI claims something similar in the case brought against it The newspaper “New York Times December. Time claims that OpenAI used copyrighted material without paying compensation for it. While OpenAI claims its use of the material is fair because it was collected to train a large language model, not because it is a competing news organization.
For Marcus, the hunger for more data was evidence that the entire AI proposition was based on shaky ground. So that AI can live due to the hype it was put out with, it simply needs more data than is available. “All of this came after realizing that their systems simply couldn’t succeed without more data than the internet-scale data they were already trained on,” Marcus wrote on Substack.
OpenAI appeared to admit that this was the case in written testimony before the UK House of Lords in December. “It would be impossible to train today’s leading AI models without the use of copyrighted materials,” the company wrote.