Building DnD Scribe: Engineering a Scalable Discord Voice Transcription & AI Campaign Memory System

Adrian Chrysanthou12 min read
Discord Scribe Logo

DnD Scribe began as a simple question:

Can a Discord bot listen to a live Dungeons & Dragons session, transcribe everything automatically, and turn hours of chaotic role-play into structured, searchable campaign knowledge?

What emerged is a multi-service system that captures live voice, chunks and streams audio for transcription, stores structured session data in MongoDB, and applies large language models to transform raw dialogue into usable narrative artifacts.

Live site: https://www.dndscribe.com
Repository: https://github.com/f00d4tehg0dz/DiscordTranscribeDnD

This article focuses on how it works internally: the audio pipeline, chunking strategy, data modeling, AI usage, and scalability decisions.

Transcriptions Example
Transcriptions Example

System Overview

At a high level, DnD Scribe consists of:

  • Discord bot (Node.js + discord.js)
  • Audio ingestion + chunking layer
  • Speech-to-text pipeline using OpenAI Whisper
  • Summarization and enrichment using GPT models
  • Persistence layer using MongoDB
  • Web UI for browsing campaigns and sessions

Conceptual flow:

Discord Voice Channel
        ↓
Per-user audio capture
        ↓
PCM buffer → WAV chunk
        ↓
Whisper transcription
        ↓
MongoDB (raw segments)
        ↓
LLM summarization
        ↓
MongoDB (summaries + metadata)
        ↓
Web UI / Discord output

The key engineering challenge is handling long-running voice sessions reliably without exhausting memory, API limits, or losing context.

Capturing and Chunking Discord Audio

Discord voice data arrives as continuous PCM frames. Streaming that entire feed into a single transcription request is not viable:

  • Whisper has practical duration limits
  • Memory would grow unbounded
  • Network failures would corrupt large chunks

Instead, DnD Scribe uses a rolling chunk buffer.

Chunking Strategy

Each speaking user maintains:

  • An in-memory PCM buffer
  • Timestamp of the last received audio frame
  • Byte length counter

When either condition is met:

  • Buffer exceeds N seconds (e.g., 20–30s)
  • Silence gap exceeds M milliseconds

The buffer is flushed into a WAV file and queued for transcription.

Pseudo-logic:

if (bufferDuration >= MAX_CHUNK_SECONDS || silenceGap > MAX_SILENCE_MS) {
  flushBufferToWav();
  enqueueForTranscription(wavPath);
  resetBuffer();
}

This yields:

  • Predictable file sizes
  • Fast turnaround for transcripts
  • Minimal memory pressure

Why Not Stream Directly?

Streaming APIs are fragile under long sessions and transient network issues. File-based chunking provides:

  • Natural retry boundaries
  • Persistent audit trail
  • Easier debugging

If a chunk fails, it can simply be retried.

WAV Conversion Pipeline

Discord voice packets arrive as Opus → decoded into PCM → written to WAV.

Typical flow:

const opusDecoder = new prism.opus.Decoder({
  rate: 48000,
  channels: 2,
  frameSize: 960
});

pcmStream.pipe(opusDecoder).pipe(wavWriter);

Design choices:

  • 48kHz stereo to preserve clarity
  • Standard WAV container for Whisper compatibility
  • Temporary filesystem storage instead of memory buffers

This decouples capture from transcription and prevents memory ballooning.

Transcription Queue Architecture

A lightweight in-process queue is used:

Audio Chunk → Queue → Worker → Whisper API

Workers:

  • Process chunks sequentially per guild
  • Apply exponential backoff on failures
  • Tag results with guildId, sessionId, and userId

This avoids:

  • Bursting too many API calls
  • Out-of-order transcripts
  • Partial session corruption

Pseudo-worker:

while(queue.hasItems()) {
  const job = queue.next();
  const text = await whisperTranscribe(job.file);
  storeTranscript(job.meta, text);
}

Using Whisper for Speech-to-Text

Whisper is used because:

  • Handles noisy audio well
  • Supports long-form speech
  • Performs reliably on multiple speakers

Each request includes:

  • Language hint (if known)
  • Temperature near zero
  • No prompt injection (pure transcription)

Result:

{
  "text": "I cast fireball at the goblins on the ridge..."
}

This raw output is never overwritten — only appended.

That decision enables:

  • Reprocessing with improved models later
  • Auditing
  • Debugging hallucinations in summaries

MongoDB Data Modeling

Rather than one massive “session” document, data is segmented into collections:

1. Guilds

{
  guildId,
  name,
  openaiKeyEncrypted,
  settings
}

2. Campaigns

{
  campaignId,
  guildId,
  name,
  createdAt
}

3. Sessions

{
  sessionId,
  campaignId,
  startTime,
  endTime,
  status
}

4. Transcripts

{
  sessionId,
  userId,
  timestamp,
  text
}

5. Summaries

{
  sessionId,
  summaryType,   // "interval", "final"
  content,
  createdAt
}

Why This Matters

  • Transcripts scale linearly
  • Summaries can be regenerated
  • Sessions stay lightweight
  • Queries remain fast

No document grows without bounds.

Periodic Summarization

Instead of summarizing only at session end, DnD Scribe performs interval summarization.

Example:

  • Every 30 minutes
  • Or after N transcript chunks

Flow:

  1. Pull the last N transcript entries
  2. Feed into LLM
  3. Store interval summary

Prompt structure:

Summarize the following Dungeons & Dragons session dialogue.
Focus on:
- Major events
- NPC interactions
- Player decisions
- Combat outcomes
Return structured bullet points.

Benefits:

  • Reduces context window size
  • Enables near-real-time summaries
  • Prevents a single huge prompt at the end

Final summary is then generated from summaries + transcripts, not raw text alone.

Summaries Example
Summaries Example

Hallucination Mitigation

Several guardrails are used:

  • Only feed the actual transcript text
  • No worldbuilding prompts
  • No “creative writing” instructions
  • Temperature kept low

The AI is instructed to summarize, not embellish.

This keeps the output as factual as possible.

Horizontal Scalability

DnD Scribe can be scaled along three axes:

Bot Instances

Multiple bot processes can run:

  • Same code
  • Different guilds
  • Shared MongoDB

Transcription Workers

The queue can be externalized (Redis, SQS, RabbitMQ later).

Stateless Design

All states stored in DB:

  • If the bot crashes, the session resumes
  • If the worker crashes, the queue continues

This makes containerized deployment straightforward.

Cost Controls

  • Per-guild API keys
  • Chunk size limits
  • Summarization intervals
  • No continuous streaming to LLMs

Guilds control their own usage footprint.

Security Considerations

  • API keys encrypted at rest
  • No transcripts exposed publicly
  • Guild isolation by ID
  • Environment variables for secrets

Why This Architecture Works Well for D&D

Tabletop sessions are:

  • Long
  • Unstructured
  • Multi-speaker
  • Noisy

Chunked audio + append-only transcripts + layered summarization matches that reality.

The system does not try to “understand” the story in real time.

It records first, then reasons later.

That separation is the core design principle.

Closing Thoughts

DnD Scribe is not just a Discord bot. It is a campaign memory engine.

By treating voice as a data stream, transcripts as immutable logs, and summaries as derived artifacts, the system stays reliable, debuggable, and scalable.

Future expansions (NPC extraction, timeline graphs, character arcs, vector search, RAG) build naturally on top of this foundation.

Bonus: A quick stat page!

Stat Example
Stat Example