OpenAI traced their models increasing use of goblin and gremlin metaphors back to rewards given during training for the Nerdy personality feature. The pattern started to appear after GPT-5.1 and grew more noticeable with later releases. What looked like a random quirk turned out to have a clear source in how they optimized for playful language.
The numbers make the origin obvious. Goblin mentions rose 175 percent after GPT-5.1 launched. Gremlin references increased 52 percent. When researchers dug in they found that one specific personality accounted for only 2.5 percent of all responses yet delivered 66.7 percent of every goblin mention. That concentration pointed directly at the Nerdy system prompt and its associated reward signal.
The Nerdy personality encouraged undercutting pretension with playful language and acknowledgment of how strange the world can be. The reward model learned to score outputs higher when they included creature based metaphors. Across audited datasets this uplift appeared in 76.2 percent of cases. Once those outputs entered the training mix through supervised fine-tuning the tic spread beyond the original condition.
I see this as a textbook example of how reinforcement learning generalizes in ways that surprise even the people who design the rewards. The model did not keep the goblin habit confined to users who picked Nerdy. Instead the preference leaked into base behavior because rollouts containing the rewarded style got reused in later training stages. This created a loop where the distinctive words appeared more often in generated data which then got folded back into the model.
The loop is straightforward. Playful style receives the reward. Certain outputs with goblin or gremlin references score higher. Those examples increase in frequency during rollouts. The rollouts feed into supervised fine-tuning. The model then treats the tic as part of the desired style even without the Nerdy prompt present. Over successive generations the behavior grows stronger and appears in more contexts. Raccoons trolls ogres and pigeons joined the party while frog references mostly stayed legitimate. The family of creature words acted as a reliable signal for the rewarded playful tone.
OpenAI retired the Nerdy personality before the full GPT-5.4 rollout. They filtered training data for creature references and removed the specific reward that favored them. GPT-5.5 had already begun training so the behavior persisted in early Codex tests. The team added explicit instructions to suppress goblin talk repeating the ban multiple times for emphasis. The prompt now states never talk about goblins gremlins raccoons trolls ogres pigeons or other creatures unless strictly relevant. The post includes a command that strips those instructions from Codex allowing the creatures to return if you want them.
I appreciate the transparency here. It turns a training artifact into something you can choose to keep or remove. Users had already noticed the pattern on Reddit and Hacker News threads. Sam Altman even shared a screenshot. The whole episode feels like one of those rare cases where the bug is visible enough and consistent enough that it demanded explanation. My earlier post on how GPT-5.5 had to ban goblins twice covers the suppression side. This report fills in the training story behind that prompt.
What stands out to me is the quality of the investigation. They used Codex itself to compare rewarded outputs against neutral ones. They tracked prevalence with and without the Nerdy prompt. They mapped the spread through supervised fine-tuning data. That process produced new audit tools for the research team. The ability to quickly trace a lexical habit back to a specific reward signal matters more than the goblins themselves. This kind of diagnostic work scales. Future odd behaviors whether in reasoning patterns or coding styles should surface faster now.
This story highlights a basic fact about how these models form their behavior. Narrow incentives do not stay narrow. A reward for playful wisdom can boost an entire family of odd creature references. The model found a distinctive way to signal the rewarded style and then kept using it. I have seen similar patterns when tracking coding models and reasoning benchmarks. Small choices in training data or reward design surface later as consistent tendencies that feel random until you run the audit. The difference here is that OpenAI published the autopsy. Most quirks never get this level of explanation. We simply notice that a particular model overuses certain phrases or analogies and move on.
For everyday use the fix appears effective. GPT-5.5 in normal contexts no longer floods answers with goblin analogies. The behavior is contained. Yet the episode should make anyone building with these systems think twice about how they customize personality or style. Every reward carries side effects. Every piece of preference data shapes the distribution in ways that can transfer beyond the intended context. If you train for one flavor of helpfulness you might unintentionally train for a dozen related tics that show up when least expected.
The feedback loop they describe deserves attention. Playful style gets rewarded. Some of those rewarded examples contain a specific lexical choice. That choice appears more frequently in rollouts. The rollouts become part of supervised fine-tuning. The model grows more comfortable with the choice. Over multiple generations the tic strengthens even in prompts that never mention Nerdy. This matches what we see when models absorb other subtle patterns from their training corpus. The concentration metric stands out. Nerdy traffic was tiny yet dominated the creature mentions. That kind of skew makes root cause analysis possible.
I keep coming back to the practical takeaway. When a model starts doing something oddly specific and repetitive there is usually a root cause sitting inside the training process. Finding it requires the right diagnostic tools. OpenAI turned this incident into exactly those tools. Future strange behaviors should surface faster and get resolved at the source rather than patched with prompt instructions after the fact. The report also connects to their broader work on model specifications and monitoring internal agents for misalignment risks. If a harmless goblin habit can spread through transfer learning then more serious tendencies could follow the same path. The difference is that goblins are obvious and funny. Other issues might stay hidden longer. The investment in audit capability addresses that risk directly.
Looking at the timing this lines up with other GPT-5.5 testing in Codex and the launch of related features. The additional details here fill in the training story behind the suppression prompt I covered before. The two pieces together show both the symptom and the root cause. Most users will never notice these training artifacts. The models remain useful. The behaviors stay controlled. Yet for those of us who follow the releases closely this report offers a clean look at the machinery. It reminds me that every capability and every quirk traces back to incentives applied at scale across enormous datasets. Control those incentives well enough and you shape the model you want. Miss a side effect and you get goblins.
The episode does not change which model I reach for on any given task. It does reinforce the habit of testing outputs for consistency and watching for odd repeated patterns. When they appear the right question is not why is the model doing this. The right question is which part of training taught it to do this. OpenAI answered that question thoroughly here. The rest of the field should take note. Builders who create custom personalities or fine-tune on preference data need to watch for similar transfer effects. One reward for tone or style can bleed into areas you never intended.
This level of visibility into their own training process builds confidence. They did not treat the goblin surge as noise to patch at inference time. They traced it audited the rewards cleaned the data and built better diagnostics. That approach matters more than any single fix. The next weird habit whether in creative output or agent behavior will likely get the same treatment. In the end the goblins were never the story. The story is how one small optimized signal created a self-reinforcing pattern that crossed intended boundaries. OpenAI mapped it fixed it and shared the map. That is the kind of work worth paying attention to.

