Tech giants use YouTube subtitles for AI training without permission

Bitget App

Trade smarter

Cryptopolitan2024/07/16 23:55

By:By Brenda Kanana

Share link:In this post: Apple and other AI developers, such as Anthropic and Nvidia, have been caught using YouTube subtitles without permission to train their AI systems. The “YouTube Subtitles” dataset was developed by EleutherAI and published in 2020. OpenAI used a million hours of YouTube videos to train its GPT-4 model.Disclaimer. The information provided is not trading advice. Cryptopolitan.com holds no liability for any investments made based on the information provided on this page. We strongly re

Apple, Nvidia, and Anthropic have been found to be using YouTube subtitles to train AI models, which is against YouTube policies. A report by Proof News and Wired showed that such firms had used a dataset of the transcripts from thousands of YouTube videos without properly acquiring the license to do so.

Also Read: UK watchdog launches probe into Microsoft’s AI talent acquisition

The study revealed that Apple, Nvidia, and Anthropic used the YouTube Subtitles dataset. This dataset consists of transcripts from 173,536 YouTube videos from 48,000 channels. The videos include educational channels like Khan Academy and MIT, news channels like The Wall Street Journal, and top creators like MrBeast and Marques Brownlee.

Popular YouTubers react to data exploitation

Marques Brownlee, a popular YouTuber, commented on the issue on X. He said, “Apple has gathered data for AI from other firms. One of them collected a lot of data/transcripts from YouTube videos, including mine. ” While Apple may not have scraped the data directly, and Brownlee pointed out that this problem will persist.

The “YouTube Subtitles” dataset was developed by EleutherAI and published in 2020. It contains 5. 7GB of data, which includes subtitles from the YouTube videos that have been removed from the platform.

According to YouTube’s terms and conditions, accessing videos by “automated means” is prohibited. The existence of subtitles from removed videos only adds to the issue, raising questions about privacy and copyright infringement.

Salesforce, an organization also implicated in the probe, has also admitted to having used said dataset.

“The Pile dataset referred to in the research paper was trained in 2021 for academic and research purposes. The dataset was publicly available and released under a permissive license.”
Salesforce spokesperson

However, the use of YouTube content without permission is still controversial to this date. In April, YouTube CEO Neal Mohan said that using YouTube videos, transcripts, or clips for AI training is a “clear violation” of the policies. However, according to the New York Times, OpenAI used a million hours of YouTube videos to train its GPT-4 model.

Legal battles erupt over AI companies’ use of internet content

The issue of AI corporations using content from the internet without authorization has increased after the launch of ChatGPT. Additionally, content creators are suing Stability AI and Midjourney for allegedly scraping copyrighted works without permission. YouTube’s owner, Google, faced class-action lawsuits regarding similar claims, stating that legal actions of this kind threaten the basis of generative AI.

In an interview with The Wall Street Journal, OpenAI’s CTO Mira Murati did not elaborate on whether the company used videos from social media platforms to train this new model. Microsoft AI CEO Mustafa Suleyman stated that content on the open web had been considered fair use since the 1990s based on what he called the “social contract.”

Disclaimer: The content of this article solely reflects the author's opinion and does not represent the platform in any capacity. This article is not intended to serve as a reference for making investment decisions.

PoolX: Locked for new tokens.

APR up to 10%. Always on, always get airdrop.

Lock now!