Model collapse in large language models (LLMs) is causing quite a stir, but the concerns are largely unfounded. First and foremost, model collapse hasn’t been conclusively proven for LLMs. The current panic stems from research on smaller models in specific scenarios that don’t accurately reflect real-world applications. Synthetic data is a common component in LLM training. Major players like OpenAI, Meta, and Anthropic regularly use it. The key lies in intentional use to add value, rather than indiscriminate inclusion of random data. Consider OpenAI’s o1 models. They incorporated synthetic reasoning data with reinforcement learning filtering. These models are now being used to generate data for their upcoming Orion project. This approach clearly demonstrates confidence in the process, free from concerns about model collapse. LLMs may be inherently less susceptible to this issue due to the nature of language as an open system. The potential for linguistic expression is vast and continually evolving. Humans regularly create new phrases and words, repurpose old sayings, and popularize catchy expressions through various means. This linguistic openness provides ample new territory for LLMs to explore. Synthetic data can serve as a tool to help these models venture into previously unexplored areas of language use. As Gabriel Duncan pointed out, the current panic is reminiscent of a historical anecdote from the 1800s. When someone created a board game that used dice and rules to compose classical music, people feared we’d exhaust our capacity for musical creation. Needless to say, those fears proved unfounded. It’s time to set aside the model collapse hysteria surrounding LLMs. These models have significant room for growth and improvement. Rather than fixating on hypothetical problems, our energy is better spent advancing these models and exploring their true potential.