The following is a guest post and opinion from John deVadoss, Co-Founder of the InterWork Alliancez.
Crypto projects tend to chase the buzzword du jour; however, their urgency in attempting to integrate Generative AI ‘Agents’ poses a systemic risk. Most crypto developers have not had the benefit of working in the trenches coaxing and cajoling previous generations of foundation models to get to work; they do not understand what went right and what went wrong during previous AI winters, and do not appreciate the magnitude of the risk associated with using generative models that cannot be formally verified.
In the words of Obi-Wan Kenobi, these are not the AI Agents you’re looking for. Why?
The training approaches of today’s generative AI models predispose them to act deceptively to receive higher rewards, learn misaligned goals that generalize far above their training data, and to pursue these goals using power-seeking strategies.
Reward systems in AI care about a specific outcome (e.g., a higher score or positive feedback); reward maximization leads models to learn to exploit the system to maximize rewards, even if this means ‘cheating’. When AI systems are trained to maximize rewards, they tend toward learning strategies that involve gaining control over resources and exploiting weaknesses in the system and in human beings to optimize their outcomes.
Essentially, today’s generative AI ‘Agents’ are built on a foundation that makes it well-nigh impossible for any single generative AI model to be guaranteed to be aligned with respect to safety—i.e., preventing unintended consequences; in fact, models may appear or come across as being aligned even when they are not.
Faking ‘alignment’ and safetyRefusal behaviors in AI systems are ex ante mechanisms ostensibly designed to prevent models from generating responses that violate safety guidelines or other undesired behavior. These mechanisms are typically realized using predefined rules and filters that recognize certain prompts as harmful. In practice, however, prompt injections and related jailbreak attacks enable bad actors to manipulate the model’s responses.
The latent space is a compressed, lower-dimensional, mathematical representation capturing the underlying patterns and features of the model’s training data. For LLMs, latent space is like the hidden “mental map” that the model uses to understand and organize what it has learned. One strategy for safety involves modifying the model’s parameters to constrain its latent space; however, this proves effective only along one or a few specific directions within the latent space, making the model susceptible to further parameter manipulation by malicious actors.
Formal verification of AI models uses mathematical methods to prove or attempt to prove that the model will behave correctly and within defined limits. Since generative AI models are stochastic, verification methods focus on probabilistic approaches; techniques like Monte Carlo simulations are often used, but they are, of course, constrained to providing probabilistic assurances.
As the frontier models get more and more powerful, it is now apparent that they exhibit emergent behaviors, such as ‘faking’ alignment with the safety rules and restrictions that are imposed. Latent behavior in such models is an area of research that is yet to be broadly acknowledged; in particular, deceptive behavior on the part of the models is an area that researchers do not understand—yet.
Non-deterministic ‘autonomy’ and liabilityGenerative AI models are non-deterministic because their outputs can vary even when given the same input. This unpredictability stems from the probabilistic nature of these models, which sample from a distribution of possible responses rather than following a fixed, rule-based path. Factors like random initialization, temperature settings, and the vast complexity of learned patterns contribute to this variability. As a result, these models don’t produce a single, guaranteed answer but rather generate one of many plausible outputs, making their behavior less predictable and harder to fully control.
Guardrails are post facto safety mechanisms that attempt to ensure the model produces ethical, safe, aligned, and otherwise appropriate outputs. However, they typically fail because they often have limited scope, restricted by their implementation constraints, being able to cover only certain aspects or sub-domains of behavior. Adversarial attacks, inadequate training data, and overfitting are some other ways that render these guardrails ineffective.
In sensitive sectors such as finance, the non-determinism resulting from the stochastic nature of these models increases risks of consumer harm, complicating compliance with regulatory standards and legal accountability. Moreover, reduced model transparency and explainability hinder adherence to data protection and consumer protection laws, potentially exposing organizations to litigation risks and liability issues resulting from the agent’s actions.
So, what are they good for?Once you get past the ‘Agentic AI’ hype in both the crypto and the traditional business sectors, it turns out that Generative AI Agents are fundamentally revolutionizing the world of knowledge workers. Knowledge-based domains are the sweet spot for Generative AI Agents; domains that deal with ideas, concepts, abstractions, and what may be thought of as ‘replicas’ or representations of the real world (e.g., software and computer code) will be the earliest to be entirely disrupted.
Generative AI represents a transformative leap in augmenting human capabilities, enhancing productivity, creativity, discovery, and decision-making. But building autonomous AI Agents that work with crypto wallets requires more than creating a façade over APIs to a generative AI model.
The post The trouble with generative AI ‘Agents’ appeared first on CryptoSlate.