Two scientists in a research boat with headphones studying dolphin sounds, while a visualization of sound waves transforms into dolphin silhouettes above the water, cinematic shot with golden hour lighting, 35mm film
Created using AI with the prompt, "Two scientists in a research boat with headphones studying dolphin sounds, while a visualization of sound waves transforms into dolphin silhouettes above the water, cinematic shot with golden hour lighting, 35mm film"

Beyond Barks and Chirps: How NatureLM-audio and DolphinGemma Are Changing Bioacoustic AI

Two groundbreaking models are changing how we understand animal sounds: Earth Species Project’s NatureLM-audio and Google’s DolphinGemma. These audio-language models represent a significant advance in our ability to analyze and interpret animal vocalizations, opening new doors for conservation, research, and even business applications.

NatureLM-audio Overview

Developed by the Earth Species Project, NatureLM-audio represents the first open-source audio-language foundation model built specifically for bioacoustics. What makes it special is its hybrid training approach  it doesn’t just learn from animal sounds but combines these with human speech and music datasets.

This cross-domain training gives NatureLM-audio some impressive capabilities:

  • Zero-shot performance  It can answer natural-language questions about animal sounds without requiring task-specific training
  • Species identification  The model identifies different species from audio alone
  • Counting abilities  It can count individuals in recordings (like accurately counting zebra finches in a recording)
  • Context recognition  The system identifies behavioral contexts like distress calls or courtship sounds

What’s particularly notable is how the model transfers techniques from human audio analysis to the animal domain. Skills that work for analyzing human speech  like counting speakers or identifying emotions  are now being applied to animal vocalizations.

The model also sets new standards on the BEANS-Zero benchmark, excelling at captioning and classifying different call types. Its ability to generalize to unseen species makes it especially valuable for biodiversity monitoring.

Feature NatureLM-audio DolphinGemma
Primary Focus General bioacoustics Dolphin communication
Technology Base Audio-language foundation model Gemma-family LLMs
Training Data Animal sounds, human speech, music Bottlenose dolphin recordings
Deployment Open-source Pixel phones (field research)
Key Strength Zero-shot generalization Behavior-linked vocalization analysis

DolphinGemma Overview

If NatureLM-audio takes a broad approach to animal sounds, Google’s DolphinGemma goes deep on a specific species. This specialized system applies Gemma-family large language models to decode bottlenose dolphin communication using field recordings from Pixel phones. This approach leverages the power of LLMs, which are typically trained on massive text datasets, and applies it to a completely different domain  animal audio.

DolphinGemma works through several key processes:

  • Audio tokenization  Converting dolphin whistles and burst pulses into sequences that language models can process. This is a crucial step in making analog audio data understandable to digital language models.
  • Pattern recognition  Mapping vocalization patterns to observed behaviors like group formation or play. By analyzing sequences of sounds, the model looks for correlations with recorded animal behaviors.
  • Field deployment  Running locally on Pixel phones for real-time analysis during marine research. This edge computing capability is vital for researchers working in remote environments without reliable internet access.

What’s particularly clever about Google’s approach is treating dolphin vocalizations like language. The system analyzes sequences using the same techniques language models use to predict text, looking for patterns that correspond to specific social contexts. This isn’t about translating individual sounds into English words, but rather identifying structured patterns within the dolphin’s vocalizations that might have meaning in their social interactions.

While DolphinGemma isn’t yet a true translator, it identifies recurring sound patterns that may correspond to different behaviors. This represents a major step toward understanding structured communication in non-human species and provides researchers with a powerful tool for hypothesis generation and testing.

Audio-Language Model Pipeline Animal Audio Input Spectrogram Feature Extraction LLM Processing Pattern Analysis Applications 🦁 Conservation 🔬 Research 🐄 Agriculture 📊 Business Analytics