Integrating Audio Units into Tuple

Eli Goodman · @elimgoodman · July 1, 2024

Overview

We just released a build of Tuple that contains an experimental audio processing pipeline built on Audio Units. We’re excited to be shipping this - Audio Units add some powerful processing capabilities that allow us to make voices easier to hear, especially in noisy environments. But how did we get here? And what the heck is an Audio Unit? Join us on our quest through the musty catacombs of audio processing history as we try to figure out answers to these questions and more.

Demons and Chipmunks

Tuple’s audio engine has been a pain point since the early days of the company. Since the beginning, Tuple has been built on top of WebRTC. While WebRTC is generally robust when it comes to audio and video, we noticed that their stock audio capture module wasn’t well maintained, and had a number of odd quirks (like not being able to switch input devices while a call was in progress).

Around 2019, we really started to feel some pain around audio capture. Our first clue came in the form of bug reports from users using AirPods. As it turns out, the sample rate of audio being played through AirPods changes depending on how they’re being used. If the mics on headphones were being used, the sample rate would be lower; if they weren’t, it would be higher. The audio module didn’t account for this - and the resulting audio was scary. Unnaturally low sample rates would cause people to sound like demons; high sample rates would turn everyone into chipmunks. Awkward.

Naturally, we fixed this issue, and all of the similar ones that popped up over the years. Tuple eventually ended up with an extremely heavily-patched version of the stock WebRTC audio module. It was the kind of thing that was stable in most cases - provided you didn’t poke it too hard or look directly at it. But then came macOS 14.

With the release of macOS 14, we began to get a bunch of regressions around Bluetooth devices. While we were able to solve some of the problems that got reported, some remained a mystery. Was the OS causing the problem? Was Tuple? Was it the device itself? All of the vestigial cruft in the audio module was making it hard for us to figure it out. We began to investigate other options. Could there be a wholly better way for us to handle audio?

Look to the Browsers

The first places we went to for inspiration were the open-source repos for the big browsers - Chromium, WebKit, and Firefox. These were all projects that had extremely robust implementations of WebRTC - so we assumed that we’d be able to glean some insights from their different approaches.

And indeed, when we began to dig into those code bases, we noticed that none of them were using WebRTC’s stock audio implementation. They’d each replaced it in different ways. We also saw (from their issue trackers) that other folks struggled with supporting Bluetooth headphones perfectly - even in the big leagues! This made us feel a bit better about ourselves.

Even though there wasn’t one obvious, bullet-proof solution, we did begin to notice patterns and get ideas about better ways to structure our audio pipeline. Specifically, for doing the kind of low-latency audio processing that we need to do, all signs pointed towards utilizing Audio Units.

The convergence on a particular technology felt good. But much to our surprise, there was shockingly little available info on the internet about Audio Units. While we were able to find a handful of docs and blog posts that involved Audio Units, it was really tough to get a high-level, holistic overview of what they are and how they work. Some of the best documentation we could find we discovered using the Way Back Machine - all the way from 2004! Thanks, Internet Archive 🙏

Having learned a bit more about Audio Units, we set out trying to figure out how to practically utilize them. Between all of the browsers, we were able to grab some useful bits of AU code that we could reuse. But trying to crib from the browsers had its own downsides. Browsers are extremely complex, and have a wide range of abstractions that different components utilize. It wasn’t clear if the audio code was intrinsic to the use of Audio Units, and what was browser-specific boilerplate.

Furthermore, working in such big, complex codebases made it hard for us to play around with stuff. It was hard to figure out which code paths were being exercised in particular circumstances - we could fiddle around and make changes, but it wasn’t clear how to actually make use of the code we changed.

We needed a way to iterate and mess around with stuff quickly and easily.

Playing in the Playground

At Tuple, we love to create playground projects. Making focused, standalone apps allow us to barf out sloppy code, validate our assumptions, formalize our conclusions, and integrate our learnings into the main codebase deliberately once we’re ready. It was time to make a playground for Audio Units.

One of the challenging aspects of working with audio processing is that there’s a subjective aspect to the whole thing. While there are some empirical markers of “good” or “bad” audio, at the end of the day, there’s a lot of personal taste and preference at play. We designed our playground with this in mind.

Once we had an initial working Audio Unit pipeline, we started messing around and doing tests. The first batch of tests were centered around latency - we’d run raw audio through the new and old audio pipelines, and compare resulting timestamps to make sure that the new pipeline didn’t negatively impact latency (it actually decreased it! 👌). Feeling confident about that, we moved on to the fun stuff.

One of the cool things about working at Tuple is that we’ve all got loads of machines and gear lying around (for years, if someone had an issue with a particular microphone or audio interface, someone on the team would just buy one so they could test with it). So we were well set-up to build a weird audio laboratory. We’d have one machine playing a loop of, say, a noisy coffee shop, and we’d capture spoken audio with all of the hubbub going in the background. We’d again run that audio through both pipelines and dump it to disk.

And then we’d throw it up on Slack for the whole team to listen to! Again, this wasn’t the most scientific process, but it was a lot of fun, and patterns began to emerge. The new pipeline could do stuff that the old one couldn’t do, and we used the playground to tune values for the various bits of processing it was doing.

After a few rounds of testing, we felt confident in where we’d arrived. We then did the work of rewriting the stuff from the playground in the main codebase.

Try it Yourself!

The logical next step was to let customers try out the new audio pipeline for themselves, and give us more of that sweet, sweet qualitative feedback. In Tuple v0.116.0 and newer, you can enable the Audio Unit pipeline in the app settings. Let us know what you think by shooting us an email, writing a note in the call feedback window (a human being will read it!), or by tweeting at us. We’ve all been using the new audio stuff ourselves for a few months, and it really does help boost voices in noisy environments. We hope you like it!