Building a Real-Time AI Music Generator Controlled by Twitch Chat

Adrian ChrysanthouFebruary 24, 20269 min read

Your chat writes the lyrics. AI makes the music. Live on your stream.

Imagine this: you’re streaming on Twitch, and someone in your chat types!lyrics I’m dancing under neon lights. Seconds later, another chatter adds !lyrics the city never sleeps tonight. A timer counts down, the AI kicks in, and suddenly a full synthwave track is playing on your stream with those exact lyricssung by an AI voice.

That’s Twitch Sings.

I built a real-time AI music radio system where your Twitch community collaboratively writes song lyrics in chat, and an AI music model called ACE-Step 1.5 generates full, original songs from those lyrics - live, on stream, in a continuous loop. Multiple people contribute to the same song. Their words get woven into verses, choruses, and bridges. Every song is unique. Every song is crowd-sourced chaos turned into actual music.

Here’s how I built it, what powers it, and why I think it’s the future of interactive streaming.

The Loop: How It Works

The entire system runs on a simple but powerful four-stage loop:

Collect: A configurable timer starts (as short as 1 second for quick-fire moments). Chatters type !lyrics <their words> to contribute. Everyone’s submissions pile up in a lyrics buffer.
Generate: When the timer expires, the server merges everyone’s lyrics into structured song sections: Verse 1, Chorus, Verse 2, Bridge - and fires them off to ACE-Step 1.5. The AI generates a complete song with vocals, instrumentation, and production.
Play: The finished MP3 streams to your OBS overlay and plays live on stream. Album art is procedurally generated. Song titles are auto-created from the lyrics.
Repeat: If loop mode is on, the cycle restarts automatically. Your stream becomes an endless AI radio station powered by chat.

The magic is in the collaboration. One person says something funny, another adds an emotional line, a third throws in something absurd - and the AI somehow makes it all work as a cohesive song. The results range from genuinely beautiful to hilariously chaotic, and chat *loves* it.

The Architecture: A Real-Time Web of Services

Under the hood, Twitch Sings connects a surprising number of technologies into a seamless experience. Let me walk you through the stack:

Frontend: React + Vite

The streamer’s control panel is a React 18 app built with Vite. It’s a single-page dashboard where you can:

Start/stop the radio loop
Pick a genre (pop, rock, synthwave, lo-fi, metal - 12 options)
Adjust the lyrics collection window (1-120 seconds)
Control playback (pause, skip, loop, volume)
View song history with procedurally generated album art
Monitor Twitch connection status in real-time

There’s also a transparent OBS overlay at /overlay that displays “Now Playing” info, lyrics collection status, and animated audio bars - all designed to sit on top of your stream without blocking gameplay.

Backend: Node.js + Express + WebSocket

The server is the orchestrator. It manages:

Twitch EventSub: A server-side WebSocket connection to Twitch’s EventSub API that listens for chat messages in real-time. When someone types `!lyrics hello world`, the server captures it instantly.
The Generator Queue: A state machine that drives the collect → generate → play → repeat loop. It handles errors gracefully (if generation fails, it just starts the next round), manages token economics, and broadcasts state changes to all connected clients.
WebSocket Broadcasting: Every Dashboard and OBS overlay connects via WebSocket. State changes, new songs, lyrics submissions, generation and progress, everything syncs in real-time across all clients.

Here’s the state machine at the heart of it:

// generator-queue.js - the engine driving the radio
this.state = "idle"; // idle | collecting | generating | playing
songFinished() {
if (this.running && !this.paused && this.settings.loop) {
  this.startCollecting(); // Back to collecting!
} else {
  this.setState("idle");
  }
}

AI Engine: ACE-Step 1.5 on Salad Cloud

This is the brain of the operation. ACE-Step 1.5 is an open-source AI music generation model that takes a text prompt and lyrics, then produces a full song - vocals, instruments, production, everything.

We host it on Salad Cloud containers with GPU acceleration. The API is straightforward:

// acestep-client.js - submitting a generation task
const body = {
  prompt: `${genre} track, ${moodDescriptor}, vocals start immediately`,
  lyrics: "[Verse 1]\nI'm dancing under neon lights\n\n[Chorus]\nThe city never sleeps tonight",
  audio_duration: 120,
  inference_steps: 8,
  vocal_language: "en",
  thinking: true, // enables chain-of-thought for better quality
  use_cot_caption: true, // AI enhances the prompt automatically
  audio_format: "mp3",
  seed: -1, // random seed for variety
  lm_temperature: 0.85, // creativity dial
};
const taskId = await aceStep.submitTask(body);
// Poll every 3 seconds until the song is ready
const result = await aceStep.pollResult(taskId);
// Download the generated MP3
const audioBuffer = await aceStep.downloadAudio(result.file);

Each genre comes with its own mood descriptor that shapes the AI’s output:

const GENRES = {
  pop: { mood: "catchy, upbeat, polished production" },
  synthwave: { mood: "nostalgic, retro-futuristic, neon-lit, driving synthesizers" },
  metal: { mood: "brutal, relentless, thundering drums and heavy riffs" },
  "lo-fi": { mood: "mellow, hazy, warm vinyl crackle and soft piano loops" },
  // … 12 genres total
};

Twitch Integration: EventSub WebSocket

Instead of using a traditional IRC chatbot, Twitch Sings uses Twitch’s modern EventSub WebSocket API for server-side chat listening. This means no browser windows running in the background - the server connects directly to Twitch’s infrastructure:

// twitch-client.js - server-side Twitch chat listener
const EVENTSUB_URL = "wss://eventsub.wss.twitch.tv/ws";
// Subscribe to chat messages in your channel
await fetch("https://api.twitch.tv/helix/eventsub/subscriptions", {
  method: "POST",
  body: JSON.stringify({
  type: "channel.chat.message",
  version: "1",
  condition: {
  broadcaster_user_id: this.broadcasterId,
  user_id: this.broadcasterId,
  },
  transport: { method: "websocket", session_id: this.sessionId },
  }),
});

The bot responds to four commands:

| Command        | Who Can Use It | What It Does                    |
| - - - - - - - -| - - - - - - - -| - - - - - - - - - - - - - - - - |
| !lyrics <text> | Everyone       | Submit lyrics for the next song |
| !song          | Everyone       | Show what's currently playing   |
| !queue         | Everyone       | Check lyrics buffer status      |
| !skip          | Mod & Broadcast| Skip the current song           |

The Lyrics Merge: Turning Chat Chaos Into Song Structure

This is one of my favorite parts. When multiple people submit lyrics, the system intelligently assigns them to song sections:

// lyrics-buffer.js - merging multiple contributors
formatMultipleContributors(entries) {
  const SECTION_NAMES = ["Verse 1", "Chorus", "Verse 2", "Bridge", "Verse 3", "Outro"];
  const sections = [];
  for (let i = 0; i < entries.length; i++) {
    sections.push(`[${SECTION_NAMES[i]}]\n${entries[i].text}`);
  }
    return sections.join("\n\n");
 }

So if three chatters submit lyrics:

- @alice: “Walking through the rain” → becomes Verse 1

- @bob: “You’re the sunshine in my day” → becomes Chorus

- @charlie: “We’ll find our way” → becomes Verse 2

The AI receives this structured input and generates a song that naturally transitions between sections. It’s collaborative songwriting at the speed of Twitch chat.

Procedural Album Art: Every Song Gets a Cover

Since ACE-Step generates audio only (no images), I built a Canvas-based album art generator that creates unique, deterministic artwork for every song:

// albumArt.js - genre-based color palettes
const GENRE_PALETTES = {
  pop: ["#ff6b9d", "#c44dff", "#ff9a76"],
  synthwave: ["#e056fd", "#7d5fff", "#17c0eb"],
  metal: ["#485460", "#d2dae2", "#808e9b"],
  jazz: ["#e77f67", "#cf6a87", "#786fa6"],
};

The song title is hashed (DJB2) to seed a pseudo-random number generator, which determines gradient angles, shape positions, and decorative elements. Same song title + genre always produces the same album art. Each song feels like it has its own identity.

The Desktop App: Electron + Virtual Audio Routing

For streamers who want the full experience, there’s an Electron desktop app that adds virtual audio routing. This lets the generated music play through your microphone channel - so in-game voice chat, Discord calls, or any app that uses your mic can hear the AI songs.

It works with any virtual audio device: VB-Audio Cable, SteelSeries Sonar, VoiceMeeter, and Elgato Wave Link. The audio pipeline goes:

Generated MP3 → FFmpeg (transcode to PCM) → PortAudio → Virtual Audio Device → Discord/Game

There’s even a Push-to-Talk system with a rebindable hotkey, so you can gate when the music plays through your mic.

The Cloud Stack: Salad + Cloudflare + Stripe + MongoDB

The production deployment ties together several cloud services:

Salad Cloud: Hosts the ACE-Step 1.5 model on GPU containers. Salad provides affordable GPU compute, which is critical because music generation is GPU-intensive work. Each song generation takes a few seconds on a good GPU.
Cloudflare Pages & Workers: The React frontend deploys to Cloudflare Pages for fast global delivery. Workers handle edge logic and API routing.
MongoDB: Stores user accounts, token balances, and payment history. The app has an optional account system where streamers can manage their generation tokens.
Stripe: Powers the token purchase system. Streamers buy generation tokens in tiers (Starter, Pro, Ultra), and each song generation costs one token. The integration uses Stripe Checkout for a smooth payment flow:

// Token tiers
const TOKEN_TIERS = [
  { id: "starter", name: "Starter", tokens: 10, price: 299 }, // $2.99
  { id: "pro", name: "Pro", tokens: 50, price: 999 }, // $9.99
  { id: "ultra", name: "Ultra", tokens: 150, price: 2499 }, // $24.99
];

The OBS Overlay: Transparent, Real-Time, Beautiful

The overlay is a React page designed specifically as an OBS Browser Source. It has a transparent background and displays:

During collection: A pulsing indicator showing “Type !lyrics in chat!” with a live count of submissions
During generation: A spinner with “Generating song…” and contributor names
During playback: “NOW PLAYING” with the song title, genre badge, contributors, and animated audio bars

Streamers add it to OBS as a Browser Source, pointing to http://localhost:3001/overlay and it layers perfectly over their gameplay or webcam.

Speech-to-Song: Say It, Sing It

One of the wildest features: the streamer can speak into their microphone, and the app captures their speech, converts it to text, and feeds it directly into the lyrics buffer. Combined with a 1-second lyrics window, you can literally say something mid-game and hear it turned into a song seconds later.

The speech recognition uses the Web Speech API in the browser (Edge’s speech engine in the Electron app), with support for 11 languages, including English, Japanese, Korean, and Spanish.

What Makes This Special

There are AI music generators out there. There are Twitch bots out there. But the combination of real-time collaborative lyrics from a live audience, instant AI generation, continuous radio-style playback, and seamless streaming integration creates something genuinely new.

I’ve seen:

Entire chat rooms riffing off each other to create absurdly catchy songs
Streamers use it as background music that their community made together
Songs that start as jokes and end up being legitimately good

The best part? Every song is unique. Every song is collaborative. Every song belongs to the community that created it.

Try It Yourself

Head on over to https://twitchsings.com/#download and download the application!
Create an account, and you’ll get 25 free tokens to use to generate music with.
Open the dashboard, connect your Twitch account, configure your audio routing, hit Start, and tell your chat to type !lyrics. The radio starts playing.
Streamer can tab to Microphone and click Start Listening to record themselves, producing lyrics at the same time!

What’s Next

I’m exploring:

More AI models: Testing different music generation models for variety
Song voting: Let's chat and vote on the best songs
Remix mode: Re-generate a song with the same melody but new lyrics (already partially built with ACE-Step’s repaint API)

The intersection of live streaming, audience participation, and generative AI is barely explored. Twitch Sings is my way of bridging them all.

Bonus note: I built, before Twitch Sings, a website called Dreamwav, where users can listen to infinite generated music based on the genre they choose. Check it out if you’d like!

Built with React, Node.js, Electron, ACE-Step 1.5, Twitch EventSub, Salad Cloud, Cloudflare, MongoDB, and Stripe.

Your chat writes the lyrics. AI makes the music. Live on your stream.