AI Industry is Trying to Subvert the Definition of "Open Source AI" - Schneier on Security (2024)

HomeBlog

AI Industry is Trying to Subvert the Definition of “Open Source AI”

The Open Source Initiative has published (news article here) its definition of “open source AI,” and it’s terrible. It allows for secret training data and mechanisms. It allows for development to be done in secret. Since for a neural network, the training data is the source code—it’s how the model gets programmed—the definition makes no sense.

And it’s confusing; most “open source” AI models—like LLAMA—are open source in name only. But the OSI seems to have been co-opted by industry players that want both corporate secrecy and the “open source” label. (Here’s one rebuttal to the definition.)

This is worth fighting for. We need a public AI option, and open source—real open source—is a necessary component of that.

But while open source should mean open source, there are some partially open models that need some sort of definition. There is a big research field of privacy-preserving, federated methods of ML model training and I think that is a good thing. And OSI has a point here:

Why do you allow the exclusion of some training data?

Because we want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information ­ like decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.

How about we call this “open weights” and not open source?

Tags: artificial intelligence, machine learning, open source, privacy

Posted on November 8, 2024 at 7:03 AM32 Comments

Comments

tfbNovember 8, 2024 7:14 AM

The weights for a neural network are, in fact, its machine code: a bunch of numbers which, when fed to a suitable machine, will cause it to execute a certain program. In this case the machine is the neural network, which in turn sits on top of what you might call ‘microcode’ which implements the machine in terms of a bunch pf hardware.

Publishing the weights is publishing machine code: calling it open source is like calling any random free binary open source.

It’s exactly the sort of lie I would expect the ‘AI’ techbros to be emitting.

KellerNovember 8, 2024 8:56 AM

Open weights is like calling proprietary software open bytecode. If you can’t train it yourself from what they provide, it’s just freeware.

Clive RobinsonNovember 8, 2024 1:15 PM

I suspect it is a battle either already lost, or at best we will have a pyrrhic victory after a protracted fight.

The reason is several fold,

1, Current AI LLM and ML systems are not paying the bills and are likely never to in their current forms for what is being claimed [1].

2, Those who have invested trillions of dollars one way or another are going to hold out for some hope of return and will more than double down on their journey to the equivalent of the “Giltspur Street Compter”[2].

3, The changes in legislation in other parts of the world to do with “personal information” will kill off other “business plans” that in effect use AI as a surveillance tool.

If “the truth” about current AI LLM and ML systems becomes sufficiently “obvious” then “the crowd will see behind the curtain” and the current hype will in effect die and the bubble will burst or deflate and with it most of the money “invested” and many peoples “Get Rich Quick” schemes as well.

Thus like we have with accusations of “Green Washing” starting to be heard we are going to get accusations of “Open Washing” appearing.

I suspect that this will get marked with hindsight in future history as a significant event or pivot point…

[1] Most people can now see Generative AI output “by eye” as it looks like “marketing copy” (as at the very least that is a lot of what is in the input corpus). Worse it’s often ludicrous with “model look alikes” with very asymmetric limbs too many fingers and similar. Likewise the written output is much like marketing speak and has some weird issues to do with span (coherent sentences incoherently or disjointly formed into paragraphs etc).

[2] The “Giltspur Street Compter” in London was a purpose built debtors prison and took on inmates from Poultry Compter. It had a bad reputation for various reasons and was torn down rather than be repurposed when the debt laws were changed,

https://en.m.wikipedia.org/wiki/Giltspur_Street_Compter

WinterNovember 9, 2024 4:15 AM

Open Source was coined as a corporate opposition to Free Software. It has been corporate driven from the start.

As LLM/AI is data + software, what counts is a combination of open data + open source. We have good definitions for both.

Open Science is such a combination that is already succesful.

We do not need Open Source for AI, we need Open AI.

WinterNovember 9, 2024 8:06 AM

I think the problems with open data for AI are well summed up in the Slashdot link:

[W]e want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information — like decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.

A 1TB data set is legally and practically not the same as even a 1GB sourcecode base. Open Source is based on copyright law. Data are ruled more by privacy and medical laws. Trying to treat them the same way as sourcecode is ignoring reality.

WinterNovember 9, 2024 9:01 AM

Reading the definition of Open Source AI from the web page, a user gets the four freedoms to use, study, modify, and share the system. That is relevant as I have seen many examples of student projects where even large systems (eg, Llama) were adapted and modified for scientific or practical purposes.

Then the definition adds that everything needed to reconstruct the AI safe the original data must be delivered under the same conditions. That includes all the information needed to actually reconstruct the system from scratch if you had the original data.

However, the data falls under different license and legal regimes. It is still unclear whether the original data was obtained legally and it is almost certainly illegal to distribute the data.

To summarize, if we require the data under an open data or CC license or we will take nothing, we will get nothing.

Paul SagiNovember 10, 2024 12:15 AM

3 comments:

1) I don’t know if it’s possible to reverse engineer AI and recover the training data from the AI source code. If not possible, it negates

“Because we want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information ­ like decisions about their health. Similarly, much of the world’s Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.”

2) Misleading naming of software and what it does has a long history, Microsoft has done it, 3-letter agencies have done it.

3) There’s (IMHO) need for legislation that forbids misleading labeling of software, something similar to the labelling laws that already exist regarding some consumer products and foods.

lurkerNovember 10, 2024 12:56 PM

@Winter, ALL

It sounds a bit like the old Atari, Mac &c. emulators where the emulator code was freely distributed, but the user needed to provide their own ROM, and those were proprietary under draconian licence conditions.

The OpenSourceAI definition requires you to build your own dataset:

Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms.

When some datasets have been built by legally and/or morally dubious methods, good luck with building your own by the same methods.

So an OpenSource LLM can be something, but an OpenSource AI is an oxymoron.

clsNovember 10, 2024 4:08 PM

I’m noodling about a “data set” being a collection of publicly available “facts” being analogous to the copyright on a phone book.

Modulo training data being congruent to facts.

For youngsters who’ve never seen a phone book, or oldsters who’ve forgotten what it was,it was an organized cross reference list of phone numbers and subscribers.

the individual telephone numbers were the facts. I’m not sure where the subscriber name and address facts stood, legally, maybe there was telecom law to force disclosure? Anyway, these items were facts, and facts are not copyrightable.

But the assembled directory was copyrightable.

So by analogy, a true open source AI would make the software, and the training data, freely available.

In the open source model, the entity that created the weights, could make money renting or selling the assembled weights.

Stefano MaffulliNovember 10, 2024 4:22 PM

I’m the executive director of the Open Source Initiative. I’m surprised by a couple of your assertions and I’d like the opportunity to add more elements for your readers to make up their opinions.

1) Corporations like Meta are vehemently opposing the Open Source AI Definition, defending their “privilege” to define Open Source AI the way they see fit.
2) The OSAID doesn’t allow for secret training data and mechanisms like you and the sources you quote sustain. Quite the contrary: to be called Open Source, an AI system must release its weight, all the code used to build it and all the data that one can legally share.

This is what the Definition says precisely for code:

The complete source code used to train and run the system. The Code shall represent the full specification of how the data was processed and filtered, and how the training was done.

And this is what it says about data:

Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system.

There are a few actors repeating FUD and misinformation in bad faith, playing into Meta’s argument that “there are multiple definition of Open Source AI” therefore theirs is as good as anybody else’s.

The FAQ maintained by the OSI has more detailed explanation of why the data requirement is written that way. The gist of it can be summarized by these questions: Do you want Open Source AI built on medical data? Can you imagine an Open Source AI built with federated learning? If the answer is Yes, let’s discuss how to improve the text of the Definition in future versions.

Paul SagiNovember 10, 2024 4:48 PM

“lurker” and “cls” made what I think are pertinent comments.

RonteaNovember 10, 2024 5:19 PM

I am afraid that without legislation, we will not be able to define the word open.

WinterNovember 10, 2024 6:00 PM

@lurker

When some datasets have been built by legally and/or morally dubious methods, good luck with building your own by the same methods.

The whole problem is that the idea and formulation of Open Source is based on copyright law, more specifically, copyright law of textual data.

LLM/AI is software + data. The point of AI is that the training data are “environmental” data. They are what you find in the world outside of the software.

And these data setd are not simply subject to copyright law and licensing, or patent law and licensing. They are subjects to other laws covering privacy, health care, and community rights. Data that cannot be shared freely.

And what use is a medical AI when it cannot be trained on real data of real patients with real life privacy needs? But the same AI can be extremely valuable for study, adaptation, and application development without the original data.

Furthermore, current datasets have grown so large that “distributing” them has become a theoretical option anyway. With text data sets at the multi TB scale, distribution and processing has become a burden most users cannot bear anyway.

There is a real need to create more distinctions between closed, fully proprietary LLMs and fully Open Data based AI. Many AI systems that are just FLOSS like the Open Source AI definition with the four freedoms, but without the original data, are still already extremely useful and valuable.

So an OpenSource LLM can be something, but an OpenSource AI is an oxymoron.

Open AI = FLOSS + Open Data[1]

[1] ‘https://opendatahandbook.org/guide/en/what-is-open-data/

Clive RobinsonNovember 10, 2024 7:08 PM

@ Bruce, ALL,

Re : How tough is the hide on the pachyderm in the room?

As several commenters have noted there is a series of issues to do with the process of,

“Input data to network weights”

However whilst this is serious, much of it is based on a very flawed assumption,

‘Input Data to network weights transformation is a “One Way Function”(OWF).’

For various reasons I think this is a wrong assumption.

Part of the assumption is that the input data size is many many times the network weights size.

Thus there is the notion of the transformation function being a “compression function” that is “implicitly one way”.

When you think about it, it’s potentially a false assumption, in that the transformation function may mostly be a “deduplication function”. Thus removing redundancy in the input data, but not in any way changing the “mapping” of inputs to outputs. Thus leaving the mappings reversible.

We’ve seen this “assumption of OWF” in the past with anonymizing data in databases. To cut a long story short Prof Ross Anderson showed that there was a threshold issue.

That is based on two basic points,

1, The more you try to anonymize data the less use it becomes.

And importantly,

2, Anything that remains useful in the anonymized data allows the anonymization to be reversed by additional data (even if also anonymized.

Thus “useful data” will in effect always be possible to be deanonymized…

Whilst it’s not yet been demonstrated, I suspect it will not be very long before we see the weights reversed.

As has been observed in the past,

“Water runs down hill, this enables you at the very least to walk uphill backwards toward the source.”

If people think back to “cracking passwords” it was always possible no mater how complex the algorithm to build an input to output table using a brut force search. The search was drastically reduced by using a “dictionary” or similar, where the input set had “implicit meaning” and was thus a tiny tiny fraction of the brut force search.

Obviously the AI input data set for any particular topic has “implicit meaning”. Thus is actually a very tiny fraction of the potential input set, and quite likely reproducible.

But unlike “encryption” and addition of “salts” to heavily randomise with password databases, the AI network weights have to have meaning and in effect be subject to the equivalent of running down hill predictably.

This loosely implies that you can observe the weights transform and thus find out what went in to the AI system, it’s order and other information.

It may not be practical to “just brut force” but I suspect others will find ways to see how the weights evolve and thus bring a “search” within range of what is possible to the extent the input data will start to become deanonymized to “a useful extent”.

ResearcherZeroNovember 11, 2024 1:46 AM

They open sourced the data they put into their closed source products, and added more bugs.

WinterNovember 11, 2024 3:19 AM

@Clive

‘Input Data to network weights transformation is a “One Way Function”(OWF).’

This is not an assumption in the field. Tt has been shown time and again that deep learning networks leak training data.

‘https://arxiv.org/abs/2402.04013

Matthew NewtonNovember 11, 2024 5:32 AM

the way the OSI is trying to redefine “open source AI” is straight-up bogus. They’re letting the industry pull a fast one, slapping an “open source” label on stuff that’s about as transparent as the mud on my boots. It’s like saying, “Yeah, you can have the car, but we’re keeping the keys.”

When it comes to AI, the training data is the source code. That’s the heart and soul of these models. Hiding it means you’re just feeding us a black box wrapped in pretty packaging. Models like LLaMA pretend to be “open source” while actually keeping critical pieces locked up tighter than a Texas liquor store on a Sunday. If you can’t see what’s under the hood, how are we supposed to trust these systems not to be biased, corrupt, or just plain messed up?

Now, sure, OSI’s got a point about privacy and legal stuff. You can’t just release medical records or sensitive Indigenous data for the sake of transparency. But c’mon, don’t twist the meaning of open source just to make corporations happy. Let’s call it something else—like “open weights” or “semi-see-through AI.” Leave the “open source” label for what’s really open.

Here’s how we fix this mess: tighten up the definition of “open source” and stop letting industry bigwigs play fast and loose with it. Invest in federated learning and privacy-preserving techniques so we can have transparent models without airing out everyone’s secrets. And, for the love of punk rock, let’s incentivize the real open-source rebels out there with funding, grants, and whatever else it takes to keep them going.

In the end, calling these half-open AI models “open source” is like ordering a Texas BBQ plate and getting a salad. Sure, it’s food, but it sure as hell ain’t what we signed up for.

GarabaldiNovember 11, 2024 8:43 AM

@Winter

LLM/AI is software + data. The point of AI is that the training data are “environmental” data. They are what you find in the world outside of the software.

As I understand it training an LLM/AI takes a significant percentage of the worlds computations. Given the software + data would it actually be practicable for an organization that is not in the UN or FAANG to recreate the weights?

At some level the effort needed to train the model, to “compile” the data into weights so the AI can run on a phone, deserves consideration in its own right.

How useful is Open Source if you can’t run the compiler?

Perhaps

LLM/AI is software + data + computation

Almost unattainable amounts of computation.

WinterNovember 11, 2024 12:50 PM

@Garibaldi

As I understand it training an LLM/AI takes a significant percentage of the worlds computations.

That question is orthogonal to the question of how to license the product and the components to build it.

But indeed, distributing the compiled weights and binaries is the more useful thing in almost every cases.

Still, having the option to rebuild the system from scratch, with modifications, is extremely important to keep everyone “honest” and protect the freedom to innovate.

Clive RobinsonNovember 11, 2024 5:02 PM

@ Bruce, Students in need of a project and other researchers.

At the moment there is an argument about who owns the input data to AI ML systems.

That is who has the IP to,

1, The original corpus data.
2, The current users questions.
3, Previous users questions.
4, The output from the LLM.

Obviously the AI LLM and ML systems being “Deterministic Systems” do not “create” original material. All output is based on 1-3. Thus the owner operator has no claim on the IP of 1-3 and 4 is at best a “derivative work” “lacking in originality”. Thus in effect plagerism.

The “weights” in the LLM can be seen as the original corpus data, deduplicated and encoded. Arguably this does not take away the IP of the original corpus data.

But an interesting thought arises.

The weights are clearly a derivative work but kind of in the sense of “using a different font” or swapping words with a thesaurus, rather than building on the corpus by “adding or creating” new “meaning or reasoning”.

A question thus arises as to if the current AI LLM systems can pick out what is actually the IP and calculate who actually is the originator who holds the IP.

That is can someone come up with an automated system that in effect does the,

“Who watches the watchers”

Activity to,

1, Keep other AI systems “honest”.
2, Check data in the input corpus for IP before it gets added to the weights.
3, Minimises “leak through” from one users input to other users output.
4, Checks AI output for others IP.

Whilst fairly easy to state as potential objectives, I suspect that achieving any of them reliably will be difficult in practice.

As those with a long term interest/association with AI systems will know, they were in effect “in the doldrums” prior to the “transformer” identifying where the “attention” is.

Thus a question arises as to what relationship “actual IP” has with the smaller parts of the input data that are seen as the “attention”.

That is in effect you distill the input corpus down by striping off the semantic and syntactic error correcting fluff and redundancy leaving just the “raw information” that forms the IP, that can then be used as a comparator.

Clive RobinsonNovember 11, 2024 5:06 PM

@ Daniel Popescu,

Thank you for the complement.

Now of course we all have to “earn it” by being even more “fascinating”.

I guess I’ll have to put on my thinking cap 😉

Nick VidalNovember 11, 2024 7:29 PM

The OSAID advocates for full transparency of the training data, including its provenance. Also, the source code with full specification of how the data was processed and filtered, and how the training was done, is required.

The OSI has reached out to several organizations and individuals along over 2 years to draft v1.0.0 of the OSAID, from legal experts and open data foundations to companies and academia.

Lastly, I would like to recommend a very informative interview with Percy Liang, Associate Professor of Computer Science at Stanford University and the director of the Center for Research on Foundation Models (CRFM). He is also a co-founder of Together AI.

https://press.airstreet.com/p/percy-liang-on-truly-open-ai

Clive RobinsonNovember 12, 2024 7:20 PM

@ Winter,

With regards,

“This is not an assumption in the field. Tt has been shown time and again that deep learning networks leak training data.”

I think you are mistaking what you are reading in what I wrote.

Yes LLM’s leak “training data” in their output, this is an expected consequence of the way they work.

What is assumed is that you can not “work the weights backwards” to find the “training data”. Which is something altogether different.

It’s one of the reasons that the many types of “bias” can be hidden away, and it’s been said in the past that the weights “can not be explained to judges”.

As I indicated I’ve reason to believe that the learning process is not just deterministic, I think the way the weights evolve is sufficiently predictable that you can actually see how the weights evolve in a training run.

That is after each token is added the changes in the weights will act as a discriminator that can be used to determine what each token was and what order the various tokens are added. Thus the input corpus identified.

Interestingly I suspect that the weights could be reduced from say 32bit floats down to single bits (think about the “rectifier function” to see why). Thus the use of “Hamming weights” could be used with the simple XOR and AND gates in an adder or population (1’s) count on the vectors rather than having to use multipliers.

The thing about “Hamming Distances” is they have interesting features that make them amenable to not just “error correction” but other “coding” processes.

There has been research on using them to break certain types of linear codes both effectively and quickly. Which is why quite a few post WWII stream cipher generators became vulnerable.

I suspect if you hunt around you will find supporting evidence sufficient for you to have a think on.

Giacomo TesioNovember 13, 2024 3:46 AM

@StefanoMaffulli

to be called Open Source, an AI system must release […] all the data that one can legally share.

Except that OSAID allows unshareable non public data, turning any black box in a “open source AI”. This is open washing, plain and simple.

If you cannot share the data that algorithmically produced the weights that define the behavior of the inference engine, your AI system is not open for any reasonable definition of “open”. When something is open, you can look inside, but you can’t look inside an AI model without the training data.

And as I explained to Carlo Piana (OSI board member) in a now censored post, you cannot really modify a LLM without training data: for example, you cannot reduce its vocabulary as you like, which is a totally reasonable and usable modification.

The gist of it can be summarized by these questions: Do you want Open Source AI built on medical data? Can you imagine an Open Source AI built with federated learning? If the answer is Yes, let’s discuss how to improve the text of the Definition in future versions.

You should hire more competent lawyers: you can totally build open source AI on top of medical data, you just need to collect proper consent from the people that are described by such data.

There is no need to choose between data protection and powerful AIs: it’s just a false dichotomy spread by Big Tech lobbyists.

Corporations like Meta are vehemently opposing the Open Source AI Definition, defending their “privilege” to define Open Source AI the way they see fit.

Ok Stefano let’s bet a pizza with @Bruce that in the moment the EU start enforcing the AI Act and requesting Meta to allow independent researcher reviews of its model training data, they will play fool and capitulate to the OSAID, changing the license of their weights to MIT and writing a couple of fairy tales about the training data to sell them as “sufficiently detailed information”.

That day, OSI will pose as the winner and Zuckerberg as the looser.

But in fact, Zuck will win, because by posing as “open source AI” he will avoid any legal or technical scrutiny from independent research over its data practice.

And like Meta, many others will do the same, adding a couple of PII to the dataset just to be able to elude any technical scrutiny and to plant all sort of backdoors in the distributed models.

Win-win, isn’t it?

Giacomo TesioNovember 13, 2024 4:10 AM

@Garabaldi

How useful is Open Source if you can’t run the compiler?

Emacs was written on the PDP-10.
By 1970’s standards a PDP-10 was just as expensive as the hardware required to train a LLM today.

Yet free software and the GPL were created to grant you the four freedoms.

Are you arguing that FOSS hasn’t proved its utility?

WinterNovember 13, 2024 12:10 PM

Ibrahim Haddad of the Linux Foundation gives good arguments to distinguish two levels of “Open” for AI systems:

Embracing the Future of AI with Open Source and Open Science Models
‘https://lfaidata.foundation/blog/2024/10/25/embracing-the-future-of-ai-with-open-source-and-open-science-models/

While OSI’s work on defining “Open Source AI” is invaluable, our focus on these specific definitions is much narrower in scope, addressing the unique challenges and opportunities that arise from making AI models open and innovating with them in a community context.

WinterNovember 13, 2024 3:01 PM

@Clive

Interestingly I suspect that the weights could be reduced from say 32bit floats down to single bits

Has already been done:

‘https://arxiv.org/abs/1802.08530

‘https://huggingface.co/papers/2402.17764

Clive RobinsonNovember 14, 2024 8:01 PM

@ Winter,

Funny you should mention “HuggingFace”, this might be of interest,

Margaret Mitchell Hugging Face’s, chief ethics scientist has been quoted by Bloomberg a couple of days ago,

https://www.bloomberg.com/news/articles/2024-11-13/openai-google-and-anthropic-are-struggling-to-build-more-advanced-ai

Basically both Margaret and the Bloomberg writer are slinging cold water on the AGI Tomorrow notion that you just need to throw a little more coal in the boiler. To in effect explain why the “Big Three” –of OpenAI, Google, and Anthropic,– have effectively stalled in their latest efforts to move AI forward as projected/promised.

It appears that “scaling” is not delivering as hyped and the curves are now looking less and less favourable.

That is rather than Nobel Laureate or better output by the end of the decade, they are not even on track to get pre graduate level in general (hence a reason in part those trying to “cheat their way through” education coursework are getting caught).

Aparently it’s our fault as we humans are not providing high quality input for the big three to “read in” or as others say “scrape” to “steal”. In short current AI LLM systems excel at regurgitation of the known not reasoning out the new.

But others point out, that due to the way the current models work, they will always be at the very best they can achieve “below average” and decidedly lag behind the best of humanity by a long way.

When you think about it, it’s actually not really that surprising. It is kind of the low average resulting output, of the below average input, and getting worse due to the pollution from other current systems…

Clive RobinsonNovember 16, 2024 6:27 PM

Can we trust the AI Industry?

The title of this thread,

AI Industry is Trying to Subvert the Definition of “Open Source AI”

Calls the “AI Industry” trustworthiness into question, and many would say rightly so.

Because it can be shown with little difficulty that they are trying to “hide the harms” that current AI systems cause. That is very real harms to humans and without the current AI having any “physical capabilities” or “agency” as such.

And it gets worse, where AI has been given physical capabilities as in driverless vehicles the harms have been quite horrifying as a search on news articles will show. Also the places that allowed driverless vehicles on the streets now looking again at the decision to do so.

But as I’ve pointed out AI has as such no understanding of the environment it’s in. We humans are clearly bad at understanding let alone teaching the attached implications. So how do we expect even future AI systems to do so?

One answer is “as we do with human children” you just “throw them in at the deep end” and hope they “teach themselves” in time…

Back a lifetime ago Science Fiction became very popular. In part this was due to the jump forward in technology and science World War 2 had brought into the public view and mind.

Even though the idea of androids and robots long predated it, post war science fiction started to bring the ethics of non human systems into the main stream. At the same time as Alan Turing was talking about the notion of Artificial Intelligence. He caused a bit of an upset by pointing out the logical fallacies of the theological arguments about God giving souls,

https://genius.com/Alan-turing-computing-machinery-and-intelligence-annotated

This was actually a more general air, and others thought about the ethics in other ways. One, Issac Asimov quite famously came up with the three fundamental laws of robotics.

For some reason they captured peoples minds, which was unfortunate. Without going into the depths of the problem, put simply they appeal to people because they are “human traits of morality, as seen by humans”. But if you think into the issue you have to ask why a non human would have human morality, or any morality at all for that matter. That is even in humans there is something much more basic that underpins morality that must at some very very fundamental level “confers advantage” of some form. But we actually have no idea what it is, or if it’s general to everything, or specific to something. All we really know is that it is some illusive quality that gives humans direction and to some degree purpose.

So why on earth would anyone actually think throwing some automata in at the deep end and seeing if it can pick up human morality and impose it on it’s self is going to happen?

(It’s kind of like the joke about striking a match to examine a barrel of gun powder…)

But aside from the fact I very much doubt current LLM/ML systems are capable of doing this for fairly obvious reasons (I’ve mentioned before). That has not stopped some of those in the AI industry “running with scissors” or worse.

And people are starting to talk about it as the harms are being highlighted in the public gaze by the press,

https://www.theregister.com/2024/11/16/chatbots_run_robots/

But importantly the AI researchers are “not learning” from “known” serious issues, and are just “running ahead” without care or consideration.

As can be seen the view is in this respect the AI industry really can not be trusted.

mr. FossNovember 26, 2024 5:28 PM

Because we want Open Source AI to exist also in fields where data cannot be legally shared

That’s like defining Good People as saviors, altruists and, heck, pirates, because we want Good People to exist also in lawless places. Such tactical naming isn’t going to bring about any good will to said lawless places.

But a definition is just a jargon element, so it can’t be bad, right? Wrong. As to why we do not want the definition of open source to include closed-data AI models: For the last few decades, people have been writing about the benefits of free and open source software:

(1) FOSS provides for individual autonomy by allowing individuals to take control over the behavior the software after inspecting and understanding it;
(2) FOSS provides for greater safety (particularly cyber-security) and transparency in general (e.g. absence of backdoors) by leveraging public review/oversight; and
(3) FOSS is significantly instrumental to the advancing of the human race, for which the developers are owed due credit and recognition for their good will; plus,
(4) copyright infringement in FOSS can be readily detected in the plain source code.

These ideas about open source are now tacitly ingrained in everyone’s mind, c.f. “it’s open source, so it must be safe”, and let’s not forget about worshipping historical figures of FOSS. But if we now change the definition to also include closed-data AI, then what? People, at large, will be thinking about points (1) to (4) and without realizing that they no longer hold, and they might bow down in front of closed-data AI model developers! This is brainwashing! In reality, LLaMA doesn’t allow the public to scan it for backdoors, StableDiffusion is the BitTorrent technology of the 2020s for art piracy, and Zuckerberg is really just unworthy of praise!

Response to message by Winters and another message by Winters:

Listen, it doesn’t matter if I’m not sharing the data with you
– because it’s statutory-wise legally prohibited (i.e. by privacy laws);
– because it’s prohibited by an NDA or in the licensing terms;
– because I want to cover my ass; or
– because I just don’t want to, for business reasons or otherwise.
And it doesn’t matter if it’s data or source code. If I’m not sharing it, it’s not open.

This fact doesn’t force the data to be CC-licensed, just that the AI model can’t be called open.

Response to message by Stefano Maffulli:

I don’t know if you’re really a human from the OSI or just a bot, but you’re contradicting yourself.

For once, you say that “The OSAID doesn’t allow for secret training data”. But then you quote the definition, which only requires “Sufficiently detailed information about the data”, not the data itself.

Response to message by Clive Robinson:

Yes, you can also reverse-engineer binary code.

Subscribe to comments on this entry

All comments are now being held for moderation. For details, seethis blog post.

← Prompt Injection Defenses Against LLM CyberattacksFriday Squid Blogging: Squid-A-Rama in Des Moines →

Sidebar photo of Bruce Schneier by Joe MacInnis.

Powered by WordPress Hosted by Pressable

AI Industry is Trying to Subvert the Definition of "Open Source AI" - Schneier on Security (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Greg O'Connell

Last Updated:

Views: 5563

Rating: 4.1 / 5 (42 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Greg O'Connell

Birthday: 1992-01-10

Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

Phone: +2614651609714

Job: Education Developer

Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.