Microsoft is embarking on a new research endeavor aimed at assessing how specific training data—whether text, images, or other media—impacts the outputs generated by AI models.
This initiative was revealed through a job posting initially listed in December and recently recirculated on LinkedIn.
According to the job description, which seeks a research intern, the project’s goal is to demonstrate that AI models can be trained in a way that allows the influence of particular datasets—such as books or photographs—to be both effectively measured and meaningfully utilized.
“Today’s neural network architectures do not transparently reveal the origins of their generated content, and there are compelling reasons to change that,” the listing states. “One such reason is to create incentives, recognition, and potentially financial compensation for those whose contributions help shape future AI models—models that may surprise us in ways we cannot yet anticipate.”
The debate over AI-generated content and intellectual property rights has gained significant momentum. Several generative AI tools—including those that produce text, images, code, videos, and music—are at the center of multiple lawsuits. Many AI companies scrape vast amounts of publicly available data to train their models, often incorporating copyrighted material. While these companies frequently argue that their practices fall under fair use, artists, writers, and software developers strongly contest this claim.
Microsoft itself faces multiple legal challenges. In December, The New York Times filed a lawsuit against Microsoft and its AI partner, OpenAI, alleging that their models were trained on millions of the newspaper’s copyrighted articles without permission. Additionally, a group of software developers has sued Microsoft over claims that its AI-powered coding assistant, GitHub Copilot, was unlawfully trained on proprietary code.
The new research initiative, referred to in the listing as “training-time provenance,” reportedly involves Jaron Lanier, a renowned technologist and researcher at Microsoft. In an April 2023 op-ed for The New Yorker, Lanier discussed the concept of “data dignity”—the idea of establishing a clear link between digital creations and their human originators.
Lanier argued that if an AI model produces a valuable output, it should be possible to trace the key contributors who influenced that creation. “For instance,” he wrote, “if you ask a model to generate ‘an animated movie of my children in an oil-painting world of talking cats on an adventure,’ certain artists, portraitists, voice actors, and writers—or their estates—should be recognized as essential to the creative process. They should be acknowledged, motivated, and possibly even compensated.”
Some AI companies have already begun exploring compensation models. Bria, an AI model developer that recently secured $40 million in funding, claims to compensate data owners based on their relative influence on AI outputs. Meanwhile, companies like Adobe and Shutterstock distribute royalties to contributors whose works were used in training datasets, though the specifics of these payments remain largely opaque.
Despite these efforts, most leading AI research labs have yet to establish direct financial incentives for individual contributors, relying instead on licensing agreements with major publishers and data platforms. In many cases, they have implemented opt-out mechanisms for copyright holders, though these often only apply to future AI training rather than existing models.
Ultimately, Microsoft’s new initiative may amount to little more than a theoretical exploration. There’s precedent for such projects failing to materialize—OpenAI, for example, announced in May 2023 that it was working on a tool to let creators specify whether they wanted their works included in AI training datasets. Nearly a year later, that tool has yet to be released and appears to be a low priority.
There’s also speculation that Microsoft’s move could be an attempt at “ethics-washing”—a preemptive strategy to mitigate regulatory scrutiny or legal risks.
Nonetheless, Microsoft’s research stands out, especially considering the ongoing debate over fair use in AI development. Many leading AI firms, including OpenAI and Google, have actively pushed for changes to copyright law. Some have even lobbied for policies that would weaken intellectual property protections in favor of AI innovation. OpenAI, in particular, has urged the U.S. government to codify fair use for AI training, arguing that restrictive copyright rules could hinder technological progress.
As the legal landscape around AI continues to evolve, Microsoft’s efforts to explore attribution and compensation mechanisms could have significant implications—whether they lead to concrete solutions or merely serve as a temporary appeasement in the broader debate over AI ethics and copyright.