Follow

About 5 watts at the wall. A purely self-hosted live transcription of a Mumble audio chat. Transcription annotated by the speaker's name.

its kinda hard to hear @fack, I had my phone sitting on my headphones while recording this and it didn't work out too well

Also, here is the massive alt-text I wrote for the video :P A thorough description of wtf is going on in this video:

video of multi-user audio transcription system based on Pixel 6 phone with a neural network chip that enables the Google Live Transcribe app to convert audio into text without the ability to reach the internet.

The Pixel 6 is connected to a combo charger + microphone adapter. The Linux server has a sound card with its output connected to the microphone adapter's input, mediated by a special attenuator audio cable.

The linux server is running a web application that connects to a mumble server, enqueues audio from a conversation, and plays it back one-speaker-at-a-time.

The linux server is also attached to the android phone via ADB. The linux server join's the android phone's WiFi Hotspot. The linux server is running a "uiautomator" UI test which constantly polls the transcribed text element of the Google Live Transcribe app and posts the text to the server via WiFi.

The web application synthesizes the data of who was talking when, and what text was displayed when, in order to display the live conversation on a web-page, similar to a chat log.

Also, it goes without saying, but its ridiculous that if I want this capability, the only way to get it is to bootleg it like this.

Google finally achieved a technology that can transcribe spoken conversation, but they want to hoard it behind proprietary APIs and services. This bootleg is a small glimpse at what technology could be like if its goal was to provide utility instead of just make money: Accessibility technology that actually works!

Ok, ok, maybe that's a bit of a grandiose claim for something like this which is just barely demo-able and still has tons of difficult fundamental problems to overcome... But the point is, this capability would probably be commonplace already if it was open technology. The only reason this is remarkable is because I went through the effort to hack it together with DACs on both sides, special audio cables, android UI testing libraries, and heaps of good old-fashioned software duct-tape. And despite all that, it still out-performs the "Open" state of the art like "OpenAI" Whisper, while using 1/10th of the energy.

Sign in to participate in the conversation
Pixietown

Small server part of the pixie.town infrastructure. Registration is closed.