← Back to Blog
·14 min read

Building a Video Export Pipeline Inside a Chrome Extension

How I built chrome extension video export using browser WebCodecs — no server, no ffmpeg. SVG foreignObject, memory management, and encoding pitfalls.


Building a Video Export Pipeline Inside a Chrome Extension

I wanted to add video export to a Chrome extension. No server. No WASM bundles. No ffmpeg. Just whatever the browser gives you natively.

That sentence sounds simple. It took months to get right, and I still label one of the two rendering paths as "Experimental." This article is everything I learned — including the parts where I got it wrong.

Captio is a browser extension for screenshot compositing and animation. Users capture elements from web pages, arrange them with backgrounds and effects, animate them on a timeline, and export the result. When I decided that "export as video" needed to work entirely client-side, I didn't fully appreciate what I was signing up for.

If you're building anything that touches video encoding in a browser extension context, this might save you a few weeks. If you want the broader picture of where client-side video rendering stands in 2026, I wrote about that in a separate overview article.

The Pipeline: Overview

Before getting into the details, here's the full path a single frame takes in the headless rendering pipeline:

DOM Composition
SVG foreignObject XMLSerializer
Base64 Data URI FileReader
Image new Image(), onload
Canvas drawImage
VideoFrame new VideoFrame(canvas)
VideoEncoder encoder.encode(frame)
Muxer mp4-muxer or webm-muxer
Blob → Download

Every arrow in that diagram is a place where something can break — especially inside an extension's content script. Let's walk through each one.

The pipeline supports MP4 output with H.264 encoding and WebM with VP9 (falling back to VP8 where needed). Container muxing is handled by mp4-muxer and webm-muxer, both pure JavaScript libraries. Captio supports resolutions up to 4K (3840x2160) and frame rates of 24, 30, 50, and 60 fps, all encoded through the native WebCodecs API with hardware acceleration.

SVG foreignObject: The Only Way to Get DOM to Canvas

There is no canvas.drawDOM() method. The browser gives you no direct way to render an HTML element tree onto a canvas. The only sanctioned path is SVG's <foreignObject> element, which lets you embed XHTML inside an SVG, and SVGs can be drawn to canvas.

The basic idea:

xml
<svg xmlns="http://www.w3.org/2000/svg" width="1920" height="1080">
  <foreignObject width="100%" height="100%">
    <!-- Your serialized DOM goes here, as XHTML -->
  </foreignObject>
</svg>

You serialize the DOM subtree you want to render, wrap it in that SVG, convert the whole thing to an image source, and draw that image to canvas. Libraries like html-to-image automate this process and do a good job for single exports.

But for video, you're doing this hundreds of times. A 10-second video at 30 fps means 300 frames. Each one needs a full serialize-encode-decode cycle.

Where html-to-image falls short for video

html-to-image works by cloning the target DOM node and calling getComputedStyle on every element to inline all computed styles. For a single screenshot, that's fine. For frame 247 of 300, it's a performance problem.

In Captio, I use html-to-image for standard single-image exports. For batch and video exports, I built a custom serializer (svgSerializer.ts) that uses XMLSerializer directly. It skips the DOM clone and the getComputedStyle pass entirely. The tradeoff is that the composition needs to be fully styled through explicit attributes and stylesheets rather than relying on computed style resolution — but since the compositor controls the DOM it produces, that's manageable.

During development, I measured the difference: roughly 350ms per frame at 1080p with the optimized serializer versus around 600ms with standard html-to-image. These numbers will vary with hardware and composition complexity, but the relative improvement was consistent enough to justify the custom path.

Extension Security: Everything That Can Go Wrong

Here's where building inside a Chrome extension diverges from building in a normal web app. The content script execution environment has security restrictions that are rarely documented and almost never mentioned in WebCodecs tutorials.

Blob URLs taint the canvas

In a normal web page, you might convert your SVG to a Blob, create a Blob URL with URL.createObjectURL(), load that as an image source, and draw it to canvas. It works. Clean, straightforward.

In an extension content script, that Blob URL taints the canvas. A tainted canvas cannot be read — calling getImageData(), toBlob(), or using it as a VideoFrame source will throw a security error. No warning, no fallback, just a broken pipeline.

The workaround: skip Blob URLs entirely. Use FileReader.readAsDataURL() to convert the SVG Blob into a Base64 data URI. Data URIs don't carry the same origin tainting. It costs you the Base64 encoding overhead on every frame, but it works.

createImageBitmap and InvalidStateError

I tried optimizing with createImageBitmap, which should be faster than the new Image()onloaddrawImage path. Feed it a Blob, get back an ImageBitmap, draw that to canvas.

Except when you feed it an SVG Blob inside an extension content script, it throws an InvalidStateError. Not a CORS error, not a type error — an InvalidStateError with no useful message. I spent longer than I'd like to admit debugging that one.

The fix was to stay on the data URI path. Load the data URI into a regular Image element, wait for onload, then drawImage to canvas. Less elegant, but reliable.

Cross-origin stylesheets and font embedding

Captio's compositions can include custom fonts — Google Fonts, system fonts, whatever the page being captured uses. For SVG foreignObject rendering, those fonts need to be embedded as Base64 data URIs inside the SVG itself, because the SVG is rendered in an isolated context with no network access.

I extract @font-face rules from all accessible stylesheets, including those inside Shadow DOM. For each rule, I fetch the font file, convert it to a Base64 data URI via FileReader, and inject it into the SVG.

The catch: cross-origin stylesheets throw a SecurityError when you try to read their CSS rules. You can't enumerate the rules of a stylesheet loaded from a different origin. I handle this with a try/catch that silently skips inaccessible stylesheets. It means some third-party fonts may not render correctly in the headless pipeline — one of the reasons that path is labeled "Experimental."

Memory Management: The Hidden Boss

The security issues are frustrating but solvable. Memory management is the problem that never fully goes away.

Every frame in the headless pipeline involves significant string allocation. The serialized SVG for a complex composition can be hundreds of kilobytes. Base64 encoding inflates that by roughly 33%. The font CSS block — which includes every embedded font as a data URI — gets prepended to every frame. Multiply all of that by 300 frames and you're generating an enormous amount of short-lived data.

Why performance.memory is not enough

Chrome's performance.memory API only reports the JS heap. It tells you nothing about canvas buffer memory, decoded image data sitting in the browser's image cache, or GPU memory used by the video encoder. You can watch jsHeapSizeUsed stay flat at around 100 MB while the browser process balloons far beyond that.

During development, the JS heap typically stayed around 100 MB during exports, which sounds reasonable. But without active management, the browser would eventually hit resource limits and either drop frames or crash the tab.

The cooldown strategy

My approach: periodic cooldown pauses during the export. Approximately every 0.5 seconds of video output (configurable), I pause frame generation, flush the encoder, and give the garbage collector time to clean up.

After each frame, I aggressively null out references — the SVG string, the data URI, the Image element, the canvas context. I don't rely on scope-based cleanup because the GC may not run between frames if they're being generated as fast as possible.

This is also where the encoder flush matters. The VideoEncoder maintains an internal queue of frames. Flushing it forces all queued frames through encoding and releases their associated memory. Without periodic flushes, the encoder holds onto frame data longer than necessary.

The result is an export that's slower than it theoretically could be, but stable. I chose reliability over speed — an export that fails at frame 260 of 300 is worse than one that takes an extra minute.

Video Encoding: WebCodecs in Practice

The WebCodecs API is the foundation that makes all of this possible without ffmpeg or WASM. It provides VideoEncoder and VideoFrame as native browser APIs with hardware acceleration support.

Encoder setup

Setting up the encoder means choosing a codec (H.264 for MP4, VP9 or VP8 for WebM), configuring resolution and frame rate, and specifying hardware acceleration preferences. WebCodecs lets you request hardware acceleration but falls back to software encoding if the GPU codec isn't available.

javascript
const encoder = new VideoEncoder({
  output: (chunk, meta) => muxer.addVideoChunk(chunk, meta),
  error: (e) => handleEncoderError(e),
});

encoder.configure({
  codec: 'avc1.640028', // H.264 High Profile Level 4.0
  width: 1920,
  height: 1080,
  bitrate: 8_000_000,
  framerate: 30,
  hardwareAcceleration: 'prefer-hardware',
});

The codec string matters. For H.264, avc1.640028 means High Profile, Level 4.0 — which supports 1080p at 30fps. If you need 4K or 60fps, you need a higher level. If the browser's hardware encoder doesn't support your requested profile, it falls back to software encoding automatically.

For WebM, I prefer VP9 but fall back to VP8 if VP9 encoding isn't available on the device. VP9 produces smaller files at the same quality but encodes slower in software mode.

Backpressure: don't outrun the encoder

This was a lesson learned the hard way. If you feed frames to the encoder faster than it can process them, the internal queue grows, memory usage spikes, and eventually you hit problems.

WebCodecs exposes encoder.encodeQueueSize — the number of frames waiting to be encoded. I monitor this before submitting each frame. If the queue exceeds 3 frames, it waits. If it doesn't drain within 5 seconds, the export aborts with a timeout.

javascript
async function waitForEncoder(encoder) {
  if (encoder.encodeQueueSize <= 3) return;

  const start = Date.now();
  while (encoder.encodeQueueSize > 3) {
    if (Date.now() - start > 5000) {
      throw new Error('Encoder backpressure timeout');
    }
    await new Promise((r) => setTimeout(r, 10));
  }
}

This is a simple pattern but it prevents the most common failure mode I saw during testing: the encoder falling behind, memory filling up, and the tab crashing.

Muxer integration

WebCodecs gives you encoded video chunks. It does not give you a playable file. You need a muxer to wrap those chunks in a container format — MP4 or WebM.

I use mp4-muxer and webm-muxer, both by the same author. They share a similar API pattern: create a muxer with a target (I use ArrayBufferTarget), feed it video chunks from the encoder's output callback, and finalize to get the complete file.

Both libraries are pure JavaScript — no WASM, no native dependencies. They work in content scripts without CSP issues. On abort, I release the muxer's buffer explicitly to avoid holding onto partially muxed data.

One note: there is no audio. Captio's video export produces video-only files. This is a known limitation. For the use case — animated screenshots and compositions — it's acceptable. If you need audio muxing in the browser, the ecosystem is still catching up. Chrome's WebCodecs documentation discusses some of the challenges.

Capture Visible Tab: The Alternative Path

Everything described above is the headless rendering pipeline — the experimental path. The primary, production path is fundamentally different: it captures what's already on screen.

chrome.tabs.captureVisibleTab is an extension API that takes a screenshot of the currently visible browser tab. It's the same mechanism screenshot extensions use, but Captio calls it for every frame.

Why this works better (most of the time)

The result is pixel-perfect. Whatever the browser renders — CSS gradients, backdrop filters, custom fonts, complex transforms — gets captured exactly as displayed. There's no serialization, no SVG foreignObject, no font embedding. The browser already rendered it correctly; Captio just takes a picture.

For each frame, Captio updates the composition's animation state, waits for the browser to paint, calls captureVisibleTab, and feeds the resulting image to the same VideoEncoder → Muxer pipeline described above.

Tiling for high resolutions

A browser viewport is typically 1920x1080 or smaller. If the user wants a 4K export (3840x2160), it can't be captured in a single shot. The solution is tiling: apply CSS scaling to the composition, capture multiple overlapping tiles of the viewport, and stitch them together on a canvas.

This means a 4K frame might require 4 or more captureVisibleTab calls, each with a CSS transform to position the relevant portion in the viewport. The tiling logic handles the math of how many tiles are needed and how they overlap.

The tradeoff

The tab is blocked during export. The user sees the composition flashing through animation states. They can't switch tabs or interact with anything — the extension needs the visible tab content to match the current frame.

This is the fundamental tradeoff between the two modes. Capture Visible Tab is pixel-perfect but locks the browser. Headless Render lets the user keep working but can have visual discrepancies in edge cases.

Rate limiting reality

Chrome rate-limits captureVisibleTab. In my testing, the minimum interval between successful captures is approximately 510ms. You can call it faster, but the browser either returns the same image or delays the response.

This means a 30fps, 10-second video (300 frames) takes at minimum 300 × 510ms = roughly 2.5 minutes for the capture phase alone, assuming no tiling. With 4K tiling, multiply accordingly. Combined with encoding time, a full export takes several minutes.

There's no workaround for this rate limit. It's a Chrome-level throttle and it applies regardless of how your extension is configured.

Performance Reality

Let's be honest about timing. This export pipeline is not real-time. It's not close to real-time.

Here's where the time goes in the headless rendering path:

  • SVG Serialization: Converting the DOM tree to an XHTML string via XMLSerializer. For complex compositions, this is a significant chunk of per-frame time.
  • Base64 Encoding: FileReader.readAsDataURL() converts the SVG blob to a data URI. The Base64 expansion adds roughly 33% to the data size.
  • Image Decode: The browser needs to parse the data URI, decode the SVG, rasterize it, and make it available for canvas drawing. This is often the largest single cost.
  • Video Encoding: The actual WebCodecs encoding step is usually the fastest part, especially with hardware acceleration. The bottleneck is almost always on the rendering side.

During development, I measured roughly 350ms per frame at 1080p with the optimized serializer. For the Capture Visible Tab path, I saw 150–500ms per frame depending on Chrome's rate-limiting behavior and whether tiling was needed.

A 10-second video at 30fps is 300 frames. At 350ms per frame, that's about 105 seconds — nearly two minutes — just for rendering. Add encoder flush pauses and cooldown periods, and real-world exports typically take several minutes.

I don't hide this from users. The export dialog shows a progress bar, estimated time remaining, and the current frame count. Setting accurate expectations matters more than optimistic promises.

What I haven't solved

The headless render path is labeled "Experimental" for a reason. Certain CSS features don't serialize perfectly into SVG foreignObject. Complex backdrop-filter stacks, some clip-path combinations, and external resources that can't be inlined can all cause visual differences between the editor and the rendered output.

For the use case of screenshot-to-video workflows, where the composition is built from captured elements in a controlled compositor, these edge cases are manageable. For arbitrary DOM rendering, they'd be a much bigger problem — which is part of why Replit's time-virtualization approach is interesting, since it renders in a real browser context rather than through SVG serialization.

Browser Compatibility

The WebCodecs API is supported on Chrome 94+, Edge 94+, and Firefox 130+. Other Chromium-based browsers (Brave, Opera, Arc) generally inherit Chrome's WebCodecs support. This includes ChromeOS — Chromebooks run a full Chrome browser, so the entire pipeline works natively. For more on the Chromebook screenshot workflow, see our Chromebook screenshot guide.

Safari has supported VideoEncoder and VideoDecoder since version 16.4, though AudioEncoder and AudioDecoder only arrived in Safari 26. The video encoding primitives exist, but I haven't tested Captio's specific pipeline on Safari — the SVG foreignObject serialization, font embedding, and extension security workarounds may behave differently. If you're building for Safari, test thoroughly rather than assuming compatibility.

The captureVisibleTab API is Chromium-only — it's part of the Chrome extensions API. Firefox has a similar capability through browser.tabs.captureTab, but the behavior and rate-limiting characteristics differ.

Conclusion and Learnings

After months of building this pipeline, here's what stuck with me:

WebCodecs are production-ready for encoding. The API is stable, hardware acceleration works as advertised, and the encode-side performance is not the bottleneck. If your source frames are already on a canvas, the path from canvas to encoded video chunk is straightforward.

Everything before encoding is the hard part. Getting DOM content onto a canvas reliably, inside an extension's security context, with fonts and styles intact — that's where most of the development time went.

Memory management is ongoing work, not a solved problem. I have a strategy that works (cooldown pauses, aggressive cleanup, encoder flushing), but it's empirical. I tuned it based on testing, not from first principles. Different compositions with different complexities may need different tuning.

The muxing ecosystem is still young. mp4-muxer and webm-muxer do their job, but the broader landscape of browser-native multimedia tooling is still developing. Mediabunny — created by Vanilagy, the same developer behind mp4-muxer and webm-muxer, and now sponsored by Remotion — is working to change that by building a more comprehensive multimedia toolkit for the browser.

Be honest about tradeoffs. Capture Visible Tab is pixel-perfect but locks the browser. Headless Render is flexible but experimental. Export takes minutes, not seconds. No audio support. Documenting these limitations clearly saved me more support conversations than any feature I shipped.

If you want to see this pipeline in action, Captio is available as a browser extension for Chrome, Edge, and all Chromium-based browsers. For the broader context of where client-side video rendering is heading — including what Remotion and Replit are doing — check out the overview of client-side video rendering in 2026.


This article reflects my experience building video export for Captio as of early 2026. The WebCodecs specification is a W3C Working Draft and details may change. Browser behavior described here was observed on Chrome and Chromium-based browsers.