How Tech Giants Cut Corners to Harvest Data for their AI Models

In late 2021, OpenAI faced a supply problem. The artificial intelligence lab had exhausted every reservoir of reputable English-language text on the internet as it developed its latest AI system. It needed more data to train the next version of its technology — lots more.

So OpenAI researchers created a speech recognition tool called Whisper. It could transcribe the audio from YouTube videos, yielding new conversational text that would make an AI system smarter. Some OpenAI employees discussed how such a move might go against YouTube’s rules, three people with knowledge of the conversations said. YouTube, which is owned by Google, prohibits use of its videos for applications that are “independent” of the video platform.

Ultimately, an OpenAI team transcribed more than 1 million hours of YouTube videos, the people said. The team included Greg Brockman, OpenAI’s president, who personally helped collect the videos, two of the people said. The texts were then fed into a system called GPT-4, which was widely considered one of the world’s most powerful AI models and was the basis of the latest version of the ChatGPT chatbot.

The race to lead AI has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law, according to an examination by The New York Times.

At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to procure long works. They also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits.

Like OpenAI, Google transcribed YouTube videos to harvest text for its AI models, five people with knowledge of the company’s practices said. That potentially violated the copy- rights to the videos, which belong to their creators. Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company’s privacy team, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its AI products.

The companies’ actions illustrate how online information — news stories, fictional works, message board posts, Wikipedia articles, computer programs, photos, podcasts — has increasingly become the lifeblood of the booming AI industry.

How Tech Giants Cut Corners to Harvest Data for their AI Models

How Tech Giants Cut Corners to Harvest Data for their AI Models

Get a Free confidential review from a resume expert