"Open Source AI" is a lie, but it doesn't have to be

Jacob-Haimes

Comments 4

Sorted by

New & upvoted

Executive summary: The term "open source AI" is frequently misused by companies to gain positive perception without meeting the actual criteria for open source, which hinders meaningful discussion about AI governance and regulation.

Key points:

Open source software is clearly defined, but current AI models don't fit neatly into this definition due to their unique components (architecture, training process, weights).
The Open Source AI Definition (OSAID) is still being developed, so there is no formal definition of "open source AI" yet.
Many prominent AI models (GPT-4, Llama3, Gemma, Mistral, BLOOMZ) claim to be open source but do not meet the criteria, while only a few (Amber, Crystal, OpenELM) can be considered truly open source.
Companies misuse the "open source" label for PR benefits and to lobby for reduced regulations without sacrificing their competitive advantage.
To clarify the space, the author proposes categorizing models as Open Source (per OSAID), Shared Weights (released weights only), Open Release (encompasses both previous categories), and Closed Source.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Jobst Heitzig (EMPO project)

What about EleutherAI?

Jacob-Haimes

Good call, I just did some more investigating and I would agree that EleutherAI's Pythia is Open Source, I'll update the post with a new image and wording shortly.

As a side note, the extra research I did as a result of your comment led me to find another Open Source model from the Allen Institute for AI (OLMo), and the Model Openness Framework, which I will also be adding.

Thanks!

Jacob-Haimes

The post has now been updated appropriately, please let me know if you don't think the modifications were sufficient so that I can fix them appropriately.

Now to make the changes on the other platforms!

Comments

Curated and popular this week

What would an animal-aligned AI be aligned to?

Aidan Kankyoku, Anima International·1w ago·Curated 1d ago·15m read

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

158

The first video from Giving What We Can's new channel is out now!

JustinPortela·2d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·4d ago·2m read

This is a linkpost for Request for Proposals: Research and Applied Work on Digital Minds. I'm glad to announce a request for proposals for research and applied work on digital minds at Longview Ph...

^{^}

The source code of a program is the file, written in a human-readable coding language, that defines how that program operates. To create an executable file, (aka. a binary file), the source code is compiled into machine code.

^{^}

How I got that number: Epoch says that the largest amount of data used to train a single model is approximately 9 trillion words; they also say that the Common Crawl dataset has 100 trillion words. Wikipedia reports the most recent version of the Common Crawl to be 454 Tebibytes = 464,896 Gigabytes.
🠖 454 TiB * .09 = 44640.17 GB

^{^}

It is worth noting that the OSAID leans heavily on the Model Openness Framework which was published by White et al. in March of 2024. The group that conducted this research is called the Generative AI Commons, and is funded through the Linux Foundation. The Model Openness Framework already has a domain registered for their pending tool, isitopen.ai.

^{^}

This is also an issue, but it is far less pressing, and more just annoying.

^{^}

Namely, the license prohibits Llama3’s use by Meta’s competitors, and anyone who might make a significant amount of money off of it.

^{^}

Yes, I know the title has the word license in it twice, that’s how it’s written, don’t @ me.

^{^}

Although I am by no means a legal expert, I believe that the special provisions made for Open Source models are described entirely in the EU AI Act recital 104.

^{^}

It is important to note that this table is only for instruction-tuned LLMs, meaning that base models which were not instruction-tuned do not appear on the list. The paper which accompanied this table, “Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators” was released as a preprint in mid 2023, and published for the Conversational User Interfaces conference in December of 2023. It does appear to have been updated since the conference, as OLMo now appears on this list. I am not sure how frequently it is updated.

"Open Source AI" is a lie, but it doesn't have to be

Open Source AI

What is Open Source AI?

Wait… why are groups saying that their models are open source when they aren’t?

Ok, so what do people mean when they refer to “open source” AI, at the time I am writing this article (April 2024)?

What do we do about it?

Acknowledgements

Model Name	Group
Amber	LLM360
Crystal	LLM360
OLMo	Allen Institute for AI
OpenELM	Apple
Pythia	EleutherAI