Audio Spectrogram Transformers Beyond the Lab | Towards Data Science

Jun 11, 2025

A recipe for building a portable soundscape monitoring app with AudioMoth, Raspberry Pi, and a decent dose of deep learning.

Want to know what draws me to soundscape analysis?

It’s a field that combines science, creativity, and exploration in a way few others do. First of all, your laboratory is wherever your feet take you — a forest trail, a city park, or a remote mountain path can all become spaces for scientific discovery and acoustic investigation. Secondly, monitoring a chosen geographic area is all about creativity. Innovation is at the heart of environmental audio research, whether it’s rigging up a custom device, hiding sensors in tree canopies, or using solar power for off-grid setups. Finally, the sheer volume of data is truly incredible, and as we know, in spatial analysis, all methods are fair game. From hours of animal calls to the subtle hum of urban machinery, the acoustic data collected can be vast and complex, and that opens the door to using everything from deep learning to geographical information systems (GIS) in making sense of it all.

After my earlier adventures with soundscape analysis of one of Poland’s rivers, I decided to raise the bar and design and implement a solution capable of analysing soundscapes in real time. In this blog post, you’ll find a description of the proposed method, along with some code that powers the entire process, mainly using an Audio Spectrogram Transformer (AST) for sound classification.

There are many reasons why, in this particular case, I chose to use a combination of Raspberry Pi 4 and AudioMoth. Believe me, I tested a wide range of devices — from less power-hungry models of the Raspberry Pi family, through various Arduino versions, including the Portenta, all the way to the Jetson Nano. And that was just the beginning. Choosing the right microphone turned out to be even more complicated.

Ultimately, I went with the Pi 4 B (4GB RAM) because of its solid performance and relatively low power consumption (~700mAh when running my code). Additionally, pairing it with the AudioMoth in USB microphone mode gave me a lot of flexibility during prototyping. AudioMoth is a powerful device with a wealth of configuration options, e.g. sampling rate from 8 kHz to stunning 384 kHz. I have a strong feeling that — in the long run — this will prove to be a perfect choice for my soundscape studies.

Capturing audio from a USB microphone using Python turned out to be surprisingly troublesome. After struggling with various libraries for a while, I decided to fall back on the good old Linux arecord. The whole sound capture mechanism is encapsulated with the following command:

I’m deliberately using a plug-in device to enable automatic conversion in case I would like to introduce any changes to the USB microphone configuration. AST is run on 16 kHz samples, so the recording and AudioMoth sampling are set to this value.

Pay attention to the generator in the code. It’s important that the device continuously captures audio at the time intervals I specify. I aimed to store only the most recent audio sample on the device and discard it after the classification. This approach will be especially useful later during larger-scale studies in urban areas, as it helps ensure people’s privacy and aligns with GDPR compliance.

Now for the most exciting part.

Using the Audio Spectrogram Transformer (AST) and the excellent HuggingFace ecosystem, we can efficiently analyse audio and classify detected segments into over 500 categories.Note that I’ve prepared the system to support various pre-trained models. By default, I use MIT/ast-finetuned-audioset-10–10–0.4593, as it delivers the best results and runs well on the Raspberry Pi 4. However, onnx-community/ast-finetuned-audioset-10–10–0.4593-ONNX is also worth exploring — especially its quantised version, which requires less memory and serves the inference results quicker.

You may notice that I’m not limiting the model to a single classification label, and that’s intentional. Instead of assuming that only one sound source is present at any given time, I apply a sigmoid function to the model’s logits to obtain independent probabilities for each class. This allows the model to express confidence in multiple labels simultaneously, which is crucial for real-world soundscapes where overlapping sources — like birds, wind, and distant traffic — often occur together. Taking the top five results ensures that the system captures the most likely sound events in the sample without forcing a winner-takes-all decision.

To run the ONNX version of the model, you need to add Optimum to your dependencies.

Along with the audio classification, I capture information on sound pressure level. This approach not only identifies what made the sound but also gains insight into how strongly each sound was present. In that way, the model captures a richer, more realistic representation of the acoustic scene and can eventually be used to detect finer-grained noise pollution information.

The gain (preamp + amp), sensitivity (dB/V), and Vadc (V) are set primarily for AudioMoth and confirmed experimentally. If you are using a different device, you must identify these values by referring to the technical specification.

Data from each sensor is synchronised with a PostgreSQL database every 30 seconds. The current urban soundscape monitor prototype uses an Ethernet connection; therefore, I am not restricted in terms of network load. The device for more remote areas will synchronise the data each hour using a GSM connection.

A separate application, built using Streamlit and Plotly, accesses this data. Currently, it displays information about the device’s location, temporal SPL (sound pressure level), identified sound classes, and a range of acoustic indices.

And now we are good to go. The plan is to extend the sensor network and reach around 20 devices scattered around multiple places in my city. More information about a larger area sensor deployment will be available soon.

Moreover, I’m collecting data from a deployed sensor and plan to share the data package, dashboard, and analysis in an upcoming blog post. I’ll use an interesting approach that warrants a deeper dive into audio classification. The main idea is to match different sound pressure levels to the detected audio classes. I hope to find a better way of describing noise pollution. So stay tuned for a more detailed breakdown soon.

In the meantime, you can read the preliminary paper on my soundscapes studies (headphones are obligatory).

This post was proofread and edited using Grammarly to improve grammar and clarity.

Written By

Topics:

Share this article:

your laboratory is wherever your feet take you monitoring a chosen geographic area is all about creativity.the sheer volume of data is truly incredible, .real timePi 4 B (4GB RAM)700mAhUSB microphone16 kHzGDPR compliancequantised versionsigmoid function to the model’s logitsindependent probabilities for each classconfidence in multiple labels simultaneouslyreal-world soundscapestop five results

Previous: Ganz tests 16 MVA transformer for E.ON | Transformers Magazine Next: Is Now The Time To Look At Buying Bel Fuse Inc. (NASDAQ:BELF.A)?

Send inquiry

Send