You speak at roughly 150 words per minute. You type at about 40. That's not a minor efficiency gap. It's a 3x difference in throughput, and after error correction, the gap widens to nearly 4x. Stanford's HCI lab found speech input produces 20% fewer errors than typing in English.
The professional world is catching up to this math.
Every Zoom call now asks if you'd like an AI summary. Otter.ai generates over $1 billion in annual ROI for its enterprise customers by turning meeting audio into structured notes. AI transcription went from novelty to assumed workplace infrastructure in under two years. The question stopped being "should we transcribe this?" and became "which system are we using?"
Voice as input isn't a prediction anymore. It's already the default for capturing professional thinking: in meetings, on calls, in the moments between them. What hasn't happened yet is voice as the foundation for professional publishing.
That's what Proofd is building. And it's harder than it looks.
The Bet: Voice as Professional Input
Every professional platform ever built has asked users to create content. Write a post. Compose an article. Draft a thread. The result: the vast majority of professionals have never published anything original. Most people have things worth saying (expertise earned, opinions formed, patterns recognized) and no path from that knowledge to the page.
Proofd asks for something different: capture your thinking as it forms.
Not for an audience. Not as a performance. A voice note from the parking lot after a client call, 30 seconds captured before the insight fades back into the noise of the next meeting. The system takes it from there: transcription, memory extraction, importance scoring, and eventually a polished professional post that sounds unmistakably like you.
The input is low-friction by design. Speaking is what humans are wired to do. You don't plan a voice note. You don't edit it. You don't wonder if it's good enough. You just talk.
But underneath that effortless input is a pipeline that has to solve problems no meeting transcription tool or voice assistant has ever needed to.
The Pipeline: From Sound Wave to Structured Memory
Here's what actually happens when you record a 30-second voice note in Proofd:
1. Capture and durability. The app records AAC audio at 16 kHz mono, optimized for speech clarity rather than music fidelity. Every recording is written to a durable pending queue on device before upload begins. If the upload fails (tunnel, airplane mode, dead zone), a recovery service retries with backoff until the note reaches the server. No voice note is ever silently lost.
2. Transcription. The audio file lands in the cloud and kicks off an automatic transcription pipeline. Our models also include per-user vocabulary boosting: if you talk about your company, your clients, or your colleagues regularly, the model learns to hear those names correctly.
3. Memory creation. When the transcript arrives, it doesn't just get stored. It gets understood. The system runs a parallel enrichment pipeline that extracts and maps dozens of data points from a single voice note: scoring, classifying, indexing, and cross-referencing against everything it already knows about you. A 30-second recording becomes a fully enriched, contextually aware memory in seconds. The result is a system that doesn't just record what you said. It comprehends why it matters.
4. Downstream intelligence. The memory doesn't stop at storage. After the initial commit, a cascade of asynchronous analysis layers kicks in, deepening the system's understanding over time. The AI extracts durable facts, resolves ambiguous entities, identifies expertise signals, and feeds everything back into your evolving professional model. Each voice note makes the system smarter: not just about what happened today, but about the structure of how you think.
The Hard Problems
Casual speech is terrible data
Meeting transcription tools have it comparatively easy. Meetings have structure: agendas, turns, formal speech patterns. Voice notes from someone walking to their car after a client meeting sound like this: "So yeah the pitch went okay but I think we lost them at the pricing slide — Marcus kept coming back to it, and I don't think we actually answered his question, which was really about ROI not cost, and I need to think about how to reframe that before the follow-up."
One voice note. Embedded inside it: a relationship signal (Marcus, key decision-maker), a tactical problem (pricing objection misdiagnosed), a root cause analysis (ROI vs. cost framing), and a forward-looking action (reframe before follow-up). The system has to hear all of that.
Classical NLP would struggle. There are no complete sentences. The referents are ambiguous. The professional insight isn't in any single word. It's in the contrast between what Marcus asked and what was answered. This is the kind of language professionals use when they're thinking out loud, and it's the hardest to process.
One voice note, many memories
A professional says: "Just got out of the strategy meeting, we're moving forward on international expansion, and I'm also worried about the team dynamic — two people who should be collaborating aren't talking." That's three distinct signals in one breath: a strategic decision, an organizational concern, and a relationship observation. Each carries a different importance score, different entity associations, and different downstream behaviors. Proofd's memory system untangles these, splitting a natural language stream into discrete semantic events while preserving the context that spans them.
Personalized vocabulary is essential
Even the best speech-to-text models are general purpose. They don't know your company name, your clients, your colleagues, or your industry jargon. Proofd solves this with a personalization layer that adapts transcription to your professional world. Every transcription is shaped by what the system already knows about you. Nothing breaks the spell faster than a colleague's name consistently misspelled in the transcript.
Importance is subjective and contextual
"Closed the deal" after six months of pipeline is not the same event as "closed the deal" on a normal Tuesday. Proofd's importance engine doesn't just score what you said. It scores what you said relative to everything it already knows about you. The system maintains a living model of your professional patterns, so it can distinguish routine from revelation. That contextual scoring changes everything downstream: what gets surfaced, what gets emphasized in the published post, and what the system decides is worth remembering at all.
Timing is trickier than it sounds
When someone records a voice note at 8 PM about something that happened in a 2 PM meeting, which timestamp matters? The recording time? The event time? The processing time? The system needs to reason about all three for different purposes. The event time (the actual moment being described) is locked inside the natural language and requires inference to extract.
Why This Matters
The reason voice works as professional input isn't just speed. It's that voice captures what typing filters out.
When you write a post, you edit. You delete. You reconsider. By the time you hit publish, you've compressed your thinking into something that feels appropriate for the medium. Voice doesn't work that way. You start talking and the texture of your actual thinking comes through: the uncertainty, the conviction, the offhand observation you wouldn't have thought to include in a written draft.
That texture is exactly what makes professional content feel real. Knowing a colleague is "genuinely unsure about this one" is more useful than knowing they "considered multiple options." The intellectual honesty is the signal. Voice preserves it. Typing destroys it.
This is the core thesis: voice is the only input modality that captures thinking as it actually happens. Pair that input with AI that can listen, infer, remember, and publish, and you get something no professional platform has ever had. Not a creation tool. Not a drafting assistant. A system that knows how you think and can say it right.
We're building Proofd for professionals who have expertise worth sharing and no path to sharing it. Voice is how it starts. Everything else follows from there.
The future of professional publishing isn't drafting. It's talking.
