AI Music, Stolen Songs, and the Problem Nobody Seems to Be Solving

When I started reading about AI companies training music-generation models on copyrighted songs, the story seemed straightforward.

  • Artists claimed their music had been used without permission.
  • AI companies were accused of training models on copyrighted recordings.
  • Lawsuits followed. Settlements and licensing agreements began appearing.

It seemed a familiar story: Someone used copyrighted material. Artists weren't compensated. The legal system stepped in to correct the problem.

Then I started looking deeper. I realized I wasn't sure what problem was actually being solved.

What Is the Problem?

Most discussions focus on the outcome. AI models can generate music that resembles human-created music. Artists are concerned about their work being used without permission.

Those concerns are understandable. But understanding an outcome is not the same thing as understanding the mechanism that produced it. As I investigated the issue, I found myself asking a surprisingly simple question:

How did the music actually get into the training systems? The answer seemed obvious. Until I started looking for it.

The Dataset That Wasn't What I Expected

One of the datasets discussed in reporting about AI music training was LAION-DISCO-12M. The dataset is often described as containing more than 12 million songs. Like many, I initially imagined something like a massive hard drive full of music files.

When I looked at the dataset itself, I found something different.

The dataset contains metadata:

  • song titles
  • artist names
  • album information
  • identifiers
  • YouTube URLs

It does not contain the music itself. In other words, it is closer to a catalog than a music library. The distinction matters. A library catalog is not a collection of books. A search index is not a collection of websites. A map is not the territory.

Likewise, a database of song metadata and links is not the same thing as a database of audio recordings. That immediately raised a new question.

If the dataset doesn't contain the music, where did the music come from?
 

The Question That Seemed Harder Than It Should Be

The more I looked, the more I noticed something surprising. There was extensive discussion about AI training, copyright, artists' rights, lawsuits, and settlements, and little discussion about the pipeline.

How did recordings move from artists and platforms into AI training systems? What tools were used? Were songs downloaded? Were they streamed? Were they licensed? Were they obtained from other sources entirely?

The public debate seemed focused on the beginning and the end.

Input: Music exists.

Output: AI generates music.

The middle of the system received far less attention. That's often where the most important insights live.


The Assumption Hidden in the Narrative

As I followed the reporting, I noticed several assumptions that many readers might never question. One assumption is: If a song appears in a dataset, it was used for training. Another is: If a song does not appear in a dataset, it was not used for training.

Neither conclusion necessarily follows. A dataset may be evidence of a process. It is not necessarily the process itself. The existence of a catalog does not tell us exactly what happened after someone opened it.

Similarly, discovering a song inside a dataset does not automatically tell us:

  • whether it was downloaded
  • whether it was processed
  • whether it influenced a model
  • whether it appeared in generated outputs

Those are separate questions. Yet public discussion often treats them as though they are the same.


Solving the Wrong Problem

The most interesting part of this discussion may not be AI at all. It may be problem definition. Suppose artists receive compensation through lawsuits or licensing agreements. That may solve one problem: Artists weren't getting paid.

But does it solve another problem? Do creators actually understand how their work enters AI systems? A photographer deciding whether to post images online may care about that. A musician deciding where to distribute music may care about that. A writer publishing articles may care about that.

Understanding the mechanism allows people to make informed decisions. Compensation addresses consequences. Understanding addresses causes.

Those are not the same thing.

The Difference Between Fixing and Understanding

This pattern appears far beyond AI. In mold investigations, people often focus on symptoms rather than moisture sources. In engineering, teams often debate solutions before agreeing on the problem. In public policy, entire industries can emerge around solving consequences while leaving underlying mechanisms poorly understood.

AI music may be another example.

The public conversation largely asks:

Who should pay?

But another question may be equally important: How does creative work actually move from public platforms into AI training systems? Because if creators understand that process, they can make informed decisions about exposure, licensing, and risk. If they do not understand it, they are left hoping that courts, regulations, or settlements will protect them after the fact.
 

The Real Question

I am not arguing that artists are wrong.

I am not arguing that AI companies are right.

I am not arguing that copyright concerns are unimportant.

I am suggesting that before deciding whether a solution worked, we should understand the problem it was intended to solve. The more I investigated AI music training, the more I realized I was asking the wrong question.

The question wasn't: Did AI train on music?

The question was: How did the music get there?

Understanding the outcome is useful. Understanding the mechanism is what allows us to prevent, improve, or redesign the system itself. If history has taught us anything, it is:

Solutions work best when they solve the actual problem.

Leave a comment