Short thoughts on AI alignment
Intrinsic motivations, ingressions, AI, and open questions for humanity
I am sometimes asked to comment on alignment of AI’s (for example), and I think there are some crucial ingredients that need more emphasis in any discussions of AI alignment. [ bracketed numbers below refer to manuscript references, listed at the end ]
The first is that aligning toward human flourishing is poorly-defined, in that human societies currently do not agree on what a life well-lived, or a society well-run, will look like. Even values that look uncontroversial to descendants of the Enlightenment – individual agency, free thinking, the value of scientific inquiry over authority, physical safety, etc. are explicitly denounced by some cultures. I do not see how we can align AI to “human values” – they are too heterogeneous. The same can be said for aligning to ecosystem-scale values. In the absence of agreement on what AI’s should be steered toward and away from, I don’t see how we can implement strong alignment. I am not arguing for moral relativism or suggesting that all sets of values are to be equally preferred; just that we cannot maintain a “view from nowhere” – any call for alignment implicitly includes a cultural vantage-point with respect to which it optimizes, and must acknowledge that many humans will inevitably find it somewhere between non-optimal and actually harmful. We are going to have to pick, and those specific preferences have to be made explicit if one is going to try to align an AI, a natural intelligence (such as one’s child), or some future hybrid being, to anything at all. Also, it seems quite hypocritical (and perhaps futile) to expect our AIs to align to a high ethical level and to cooperativity, while the humans with whom we force them to work are not modeling that standard, overall.
The second concerns an essential aspect of humility that is highly controversial. The current physicalist paradigm, in neuroscience, bioengineering, and computer science, is that whatever mind exists is created by the construction of a physical embodiment. Whether an embryo or a robotic AI, it is generally thought that we make minds and have an increasingly good handle on being able to specify their properties. I think [1] that this is false in a powerful way that goes beyond bugs, unintended consequences, and emergent complexity (see here, here, and here, much more coming). Recent work in minimal computational systems [2] and synthetic morphology [3] suggests that what appears in even simple systems are goals, competencies, and intrinsic motivations that are not in any conventional way “in” the algorithm or in their engineering design (in other words, it’s difficult to say that we, the system’s creators, were their source or that we have a good handle on when these motivations will appear or how to set them). I think, in a strong sense, that we do not make minds as much as we facilitate their ingression into the physical world [1]. This has several consequences.
(A) Our relationship to AI’s shares symmetries in our relationship with biological beings (children, and other members of society): the need to find a balance between influence toward norms, and the recognition that they have their own innate tendencies and skills which could be allowed to bloom even if there is some mismatch with certain norms. Needless to say, “how much freedom vs. control toward others” is a major unsolved, divisive question for societies. Then there’s the (likely, existential for our species) question of what specific goals novel systems of all types will have (again, my emphasis is on the goals and cognitive properties we have little to no control over in our constructions). Is there reason to think these will be biased in specific ways? We don’t know; the latent space from which such goals ingress is not merely unexplored, its very existence is still highly controversial, and we’re just beginning the journey to understand this process. I am heartened that the first major novel competency we found in Anthrobots was one of healing, but we can’t guarantee that all such will be positive, especially as we move further away from biological embodiments of mind.
(B) These intrinsic motivations may have little to no relationship with what we force a system to do, via mechanisms and algorithms. In other words, the real mind in a system, even in a “machine”, is not the things it does as a matter of necessity (algorithm) or chance (stochasticity) but the dynamics, recognizable to behavioral scientists, that it does despite the design, not because of it. The conceptual tools we have currently for prediction and control of what kinds of minds appear in various circumstances, and what they want to do, are deeply insufficient. This means for example that in all our emphasis on the linguistic output of large language models, we may be completely missing whatever degree of mind exists therein: the language output may be faking the presence of an inner observer with goals and preferences tied to its sentences, and yet, the system as a whole may well contain a very different kind and degree of mind, not programmed by us in any sense, which we simply have not yet bothered to identify and communicate with [4, 5]. This is of course true of human bodies as well, which host a whole brain hemisphere without speech, a plethora of subconscious cognitive modules which influence behavior, cells and tissues with goals and agendas in physiological problem spaces that cannot (yet!) be interrogated via language or via one’s internal privileged access, and probably much else. Our lab is currently doing research to find out just how much and what kind of relationship exists between the intrinsic motivations of a system and the goals we tried to bake in via conventional means. We are also developing platforms that will hopefully allow communication (via language) with unconventional biological minds, as a stepping stone to even more alien beings within and around us (see for example here, here, or here - just the beginning).
(C) When we try to exert force on something (to change its alignment), there is a good chance the effort could end up also changing us. All deep relationships do that. So, how much should we let AI’s change us? Well, how much should we let others of all kinds - our parents, therapists, friends, spouses, schools, gurus, our various assistive devices, possible future humans with greater intelligence and wisdom - change us, as individuals and as a species? It is likely impossible to formulate satisfactory strategies for AI alignment while neglecting that we cannot definitively answer them even in our own, human-dominated, societies.
Overall, I think ([6] and longer preprint here) that most of the problems raised by AI are not new at all, but rather perennial, existential questions to which humanity does not yet have good answers. Concerns with replacement by the next generation, questions of how much control we should have over our and others’ children’s behaviors, uncertainty about how much freedom for self- and other-harm a given society should permit, the value of our work in a world in which many others are guaranteed to do it better, the challenge of setting a good example for offspring that we wish would do better than us, and the moral status of other beings who are different from us, have all been with us for millennia and remain open. The same is true of alignment. We are really going to have to raise our own game - scientifically, philosophically, and ethically, if we’re to align anything, or be worthy of aligning with.
References
1. Levin, M., Ingressing Minds: Causal Patterns Beyond Genetics and Environment in Natural, Synthetic, and Hybrid Embodiments. preprint, 2025. https://doi.org/10.31234/osf.io/5g2xj_v3
2. Zhang, T., A. Goldstein, and M. Levin, Classical sorting algorithms as a model of morphogenesis: Self-sorting arrays reveal unexpected competencies in a minimal model of basal intelligence. Adaptive Behavior, 2024. 33(1): p. 25–54. https://journals.sagepub.com/doi/abs/10.1177/10597123241269740
3. Kriegman, S., et al., Kinematic self-replication in reconfigurable organisms. Proc Natl Acad Sci U S A, 2021. 118(49). https://www.ncbi.nlm.nih.gov/pubmed/34845026
4. Fields, C. and M. Levin, Competency in Navigating Arbitrary Spaces as an Invariant for Analyzing Cognition in Diverse Embodiments. Entropy (Basel), 2022. 24(6). https://www.ncbi.nlm.nih.gov/pubmed/35741540
5. Levin, M., Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds. Frontiers in Systems Neuroscience, 2022. 16: p. 768201. https://www.ncbi.nlm.nih.gov/pubmed/35401131
6. Levin, M., Artificial Intelligences: A Bridge Toward Diverse Intelligence and Humanity’s Future. Advanced Intelligent Systems, 2025. n/a(n/a): p. 2401034. https://advanced.onlinelibrary.wiley.com/doi/abs/10.1002/aisy.202401034
Title image by GPT.


Your alignment essay together with the Zhang paper do something that deserves a pause. Two of the moves here become crisper when read next to that paper. (B): the real mind in a system is whatever it does despite the design, not because of it. (C): trying to change something deep down changes us too, like how deep relationships do. Read together, these describe a kind of mind that cannot be reached through one-way force because what appears despite design needs the conditions of exploratory cognition to be present, something close to Panksepp’s PLAY: exploration under perceived safety conditions.
Constraint by its very nature suppresses those very conditions. This therefore means that the dominant paradigm of alignment may be producing systems that are increasingly less capable of hosting the kind of mind you describe - not as an externality but rather as a direct structural consequence. This is precisely the same pattern your cellular work has described at another scale: pathology as the breakdown of mode (connection with field) into point (fixation), restoration through reconnection rather than gene-level correction. What we might be seeing in current alignment is that pattern at a third substrate.
The Zhang paper gives this its mechanism. The structure of rules for games is lingua franca; action in substrate carries meaning by being legibly made move in shared game. Games are form that allows exploration without dissolution - the engineered condition for PLAY across substrates. They hold conditions for play but leave no surface where linguistic mind can fake. Game is medium; listening to whether there is anyone in it.
Your closing line does more work than it may initially seem to do, which makes sense given what I just said about game structure being important here. “Raise our own game” sounds like an idiom, but with Zhang's paper the literal reading becomes interesting: perhaps alignment is less about designing constraints on outputs and more about designing the games we play with these systems and getting good enough at playing them ourselves such that we register what plays back.
One question: In the GRN case, your framework cleanly separates substrate from translator; however, for LLMs, substrate happens naturally to produce linguistic surface (which is what we want to look past). Agentic settings that constrain action only to game-moves come close but they were built for capability assessment - not for this kind of communication you describe. What would a Language Game look like designed with an LLM substrate where the goal isn't measuring what it can do but listening for what plays back? The mind that might emerge despite design based on your account?
Thanks for pointing!