How a Scrappy Team Built an Ethically-Sourced AI to Prove Big Tech Wrong
AI researchers trained a large language model using only openly licensed or public domain data. Their ethically sourced model matches performance of major industry counterparts.

Scientists Build an AI Model Using Only Ethically-Sourced Data
A group of over two dozen AI researchers from institutions like MIT, Cornell University, and the University of Toronto have demonstrated that it is possible to train a large language model (LLM) exclusively on data that is openly licensed or in the public domain. This effort challenges the widespread belief in the tech industry that ethically sourcing data for AI development is "impossible."
The team compiled an extensive dataset called the Common Pile v0.1, which contains over eight terabytes of text. However, the biggest hurdle was not computing power but the manual labor involved in cleaning, formatting, and verifying the copyright status of every piece of data. Many online sources are incorrectly labeled or have unclear licensing, requiring painstaking human review to ensure compliance.
Manual Effort Over Raw Computing
Stella Biderman, a computer scientist and executive director of the nonprofit Eleuther AI, emphasized that automated tools helped but ultimately, human annotation and verification were essential. "This isn't a thing where you can just scale up the resources that you have available," she explained. The process demanded significant personpower rather than just access to better hardware or scraping technology.
Despite these challenges, the researchers successfully trained a seven billion-parameter language model using this guilt-free dataset. The resulting AI performed comparably to Meta's Llama 1 and Llama 2 7B models, which were released over two years ago — a noteworthy achievement considering the limited resources compared to industry giants.
Resourcefulness and Overlooked Data
The team’s scrappy approach included discovering a collection of more than 130,000 English language books from the Library of Congress that had been overlooked in previous datasets. This unexpected source helped enrich their dataset without compromising ethical standards.
Copyright and Ethical Challenges in AI
Copyright remains a critical issue in AI development. Major companies like OpenAI and Google have trained their models on massive amounts of data scraped from the web, including news articles and personal social media posts. Meta is currently facing lawsuits over allegations of using seven million copyrighted books without authorization. The industry often defends this by citing "fair use," arguing that comprehensive data scraping is necessary to advance AI.
This new research counters that narrative by proving that AI can be built from ethically sourced data, although it acknowledges that ethical questions remain. Large language models inherently risk displacing jobs and may still reproduce content from public domain works in ways that not all creators would welcome.
Even if firms are compelled to use only authorized data, the pressure on copyright holders to grant permissions is likely to increase as AI technology continues to grow.
Transparency as a First Step
Biderman does not expect major corporations like OpenAI to suddenly adopt ethical data practices wholesale. However, she hopes that her team's work encourages greater transparency about training data. "Even partial transparency has a huge amount of social value and a moderate amount of scientific value," she noted.
This breakthrough provides a practical blueprint for developers and researchers interested in building AI models with clear ethical standards. It also offers a foundation for discussions around fair data use and copyright compliance in the AI community.
For those interested in expanding their knowledge of AI development and ethical considerations, exploring specialized courses can be valuable. Visit Complete AI Training for a range of up-to-date AI courses and certifications.