A cinematic photo of a group of people gathered around a large, futuristic-looking computer terminal, their faces illuminated by the glow of the screen as they intently examine the data displayed.
Created using Midjourney with the prompt, "A cinematic photo of a group of people gathered around a large, futuristic-looking computer terminal, their faces illuminated by the glow of the screen as they intently examine the data displayed."

LMSYS: The AI Evaluation Playground

If you’re curious about the latest in large language models (LLMs), LMSYS is where the action happens. This open platform, run by UC Berkeley’s Sky Lab, lets anyone test drive and compare AI models through blind evaluations.

So what exactly is LMSYS? Think of it as a virtual arena where AI models duke it out based on user preferences. The name? You can pronounce it “el-em-sis” or “lim-sis” – dealer’s choice.

Now, calling LMSYS a benchmark might be a stretch. While it offers valuable insights through blind testing, it’s not always representative of real-world applications. Users often engage in basic testing or casual experimentation, which doesn’t necessarily reflect how you’d want an AI to perform in practical scenarios.

Interestingly, speed sometimes trumps intelligence on LMSYS. Take GPT-4 Omni Mini, for instance. It’s quick and convenient, landing it the second spot on the leaderboard. But for serious tasks like writing or coding? You might want to look elsewhere.

Despite its limitations, LMSYS still outshines some traditional benchmarks. Consider MMLU, which is often cited as a measure of model intelligence despite significant contamination issues. In comparison, LMSYS offers a more authentic evaluation experience.

But here’s where it gets really intriguing: LMSYS isn’t just a public playground. It’s also a testing ground for AI companies before they unveil their latest creations. OpenAI and Google have been known to pilot their models here under clever codenames.

For instance, before GPT-4 Omni’s official debut, it made stealth appearances as “gpt2-chatbot,” “im-a-good-gpt2-chatbot,” and “im-also-a-good-gpt2-chatbot.” This pre-release testing gives companies valuable insights into their AI’s performance.

For AI geeks (or “Generative AI Experts” if we’re being fancy), LMSYS is like a treasure hunt. We love scouring Reddit and running tests to uncover the true identities of mysterious new models. Just this morning, I spent time investigating “guava-chatbot” and “eureka-chatbot.” The consensus? “Guava-chatbot” might be Google’s anticipated CodeGemma 2, while “eureka-chatbot” could be a smaller version of Gemma 2.

LMSYS offers a unique window into the world of AI development. It’s where cutting-edge models are put through their paces, and where the public gets a say in shaping the future of AI. Whether you’re a casual observer or a dedicated AI sleuth, LMSYS provides an engaging platform to explore the capabilities of the latest language models.