The right way to encode HTML5 video and audio
The title of this might be a little misleading, since I'm also going to go into detail on how MediaCrush handles video, audio, and arbitrary image formats for display on the web. This is a technical article - if you're not a hacker, don't feel bad for not understanding it.
The goal of this article is to clear up the misinformation around how video and audio should be encoded for the web. We've personally spent a lot of time working on it, going through several revisions and hours of hair-tearing effort to support nearly all browsers and devices. MediaCrush can now take just about any video, audio, or image format, and turn it into something browser friendly. Feel free to watch this video on any device you can get your hands on to see that we know what we're talking about.
By the way, if you want to skip all this crap, it is very easy to embed MediaCrush videos and audio on your own website. You can just upload your media to MediaCrush proper and let us do the hard work. MediaCrush is also open-source, and we encourage you to review our code (or even use it yourself) if you get stuck, and we'd love to get your contribution if you think we can improve it.
Let's get started. You'll want the tools of the trade. We built our service on Linux, but you can probably get it working on OSX or Windows with enough effort. You'll need to acquire and become familar with ffmpeg and ImageMagick. The latter is optional if you don't care about how we handle arbitrary image formats.
Please note that you cannot use your distribution-provided ffmpeg package. Read up on compiling ffmpeg yourself and make sure you include libx264, fdk-aac, libmp3lame, libvpx, and libopus. Our servers and dev machines run Arch Linux and we just install the ffmpeg-full package from the AUR.
Media Pipeline
Here's a simplified overview of how media goes through MediaCrush:
First, we try to identify what the uploaded file is, and then we can determine how it should be processed. We hand it off to the appropriate processor, which has an async and a sync step. The processing is considered "complete" and ready for user consumption when sync finishes, and errors in async fail silently, since the media is supposedly ready to go anyway.
The user isn't told that the file is done until "sync" finishes. This step is responsible for anything crucial to displaying the file in the browser - most of the video/audio encoding happens here. The "async" step is executed later and does things like optimizing PNG files and compressing GIFs.
Format Detection
When a file is uploaded to MediaCrush, we don't make any assumptions about what it is. We ignore the user-supplied mimetype and file extension and instead make use of our tools to see what it is. We use ffprobe (part of ffmpeg) to examine media files and determine if ffmpeg can handle them. If ffprobe fails, we move onto ImageMagick identify, and finally onto file (though it's worth noting that we don't actually care what file thinks because we only let things through if ffprobe or ImageMagick can handle it).
If you want to do format detection for user-uploaded files, you'll want to flip through the man pages for ffprobe and ImageMagick identify to see how to use them. You might also want to read or use our own detection code.
Format Conversion
Once we know what a file is, we need to convert it to browser friendly formats. We take any input file (that falls within our supported formats) and convert it to one or more of the following:
- PNG, JPG, BMP (Images)
- SVG (Vectors)
- MP4, WEBM, OGV (Videos)
- MP3, OGG (Audio)
I should take a quick second to mention that GIF files uploaded to MediaCrush are treated as videos and are re-encoded as video for consumption as HTML5 video.
We choose the appropriate processor and move the file along. We have a generic image, video, and audio processor, as well as a few extra processors that are designed for specific file types. You can read MediaCrush's processing code here.
Image Conversion
For any image that doesn't fall within those three formats to begin with, it's very simple to convert them with ImageMagick:
convert input.ext output.png
We just convert any image file to a PNG for browser display and offer a download to the original file.
Video Encoding
This is the one you're probably waiting to hear about. You'll need to re-encode the video to three formats: mp4, webm, and ogv. You can probably get away without doing ogv, but we're still getting several hundred hits on our ogv files per month, so we're going to keep doing it for the time being. Web browsers are fickle little bastards and you can never be sure it'll work with mp4 and webm alone.
Specifically, we need the mp4 file to have h.264 and aac inside, the webm needs vp8 and vorbis, and the ogv file needs theora and vorbis. If that didn't make sense: video files are "containers" that have one or more video and audio streams inside. The terms before refer to exactly what kind of streams we need in each file.
Our videos also need to be using yuv420p as the pixel format, and we need to use 8-bit libx264. This is important, 10-bit libx264 will not work. Additionally, h.264 does not allow for videos with an odd-numbered width or height, so our mp4 file is scaled to compensate for this. WebM works with odd sizes in compliant browsers. However, funnily enough, Chrome is not among these browsers.
Note that you must re-encode any videos you want to use. If you already have, say, an mp4 file, it is probably not good-to-go and you still need to process it.
Here's how you invoke ffmpeg for each video file (forgive me for the long commands):
mp4
ffmpeg -i input.ext -vcodec libx264 -pix_fmt yuv420p -profile:v baseline -preset slower -crf 18 -vf "scale=trunc(in_w/2)*2:trunc(in_h/2)*2" output.mp4
This takes input.ext
and produces output.mp4
. Here are each of the options used:
-vcodec libx264
specifies that we want to use libx264 as our video codec-pix_fmt yuv420p
specifies the yuv420p pixel format, which works on more platforms than other formats-profile:v baseline
uses the x264 'baseline' profile (makes Android happy)-preset slower
tells ffmpeg to prefer slower encoding and better results over faster encoding time-crf 18
sets the constant rate factor to 18, which affects quality. Lower numbers are better quality-vf "scale=..."
uses the scale video filter to make sure that the video doesn't have an odd width/height
You'll want to modify the -crf value if you want to adjust the output quality. Further reading on the x264 encoding procedure is available on the ffmpeg wiki, but note that it's not geared towards HTML5 users.
webm
ffmpeg -i input.ext -c:v libvpx -c:a libvorbis -pix_fmt yuv420p -quality good -b:v 2M -crf 5 -vf "scale=trunc(in_w/2)*2:trunc(in_h/2)*2" output.webm
This will turn input.ext
into output.webm
.
-c:v libvpx -c:a libvorbis
specifies the vp8 and vorbis encodings-quality good
can be good, best, or fast. 'best' is not suggested.-b:v 2M
sets the target bitrate to 2 MB/s. Adjust this to fit your bandwidth requirements (it will affect quality)-crf 5
sets the constant rate factor. This works on a different scale than x264.
Modify -crf and -b:v to adjust the output quality. Further reading on webm and ffmpeg
We're experimenting with vp9, stay tuned to hear our thoughts on it later on.
ogv
ffmpeg -i input.ext -q 5 -pix_fmt yuv420p -acodec libvorbis -vcodec libtheora output.ogv
input.ext
becomes output.ogv
.
-q 5
sets the desired quality.
Futher reading on theora/vorbis and ffmpeg
A note on resolutions
These will use the input resolution as the output resolution. If you want to encode several videos
at different resolutions (like 420p/720p/1080p), you'll want to add -video_size right before the output
video when you invoke ffmpeg. You can also drop -vf "scale..."
from the mp4 encoding if you do this.
Common sizes include:
-video_size hd480
-video_size hd720
-video_size hd1080
-video_size 4k
I suggest you make good use of ffprobe and don't encode the video any higher than the source resolution.
You can also specify the width/height manually with something like 1920x1080
.
"Poster" image and thumbnail
The HTML5 video tag allows you to specify a "poster". This will take the first frame of the video:
ffmpeg -i input.ext -vframes 1 -map 0:v:0 poster.png
Apply the scale filter to get a reasonably sized thumbnail:
ffmpeg -i input.ext -vframes 1 -map 0:v:0 -vf "scale=100:100" thumbnail.png
Subtitles
Note: This section was written some time after this article was first published. We did not feel comfortable offering advice on subtitles when the article was written and promised to update it later when we were more familiar with them.
You may extract a subtitle track from a video file like so:
ffmpeg -y -i INPUT -map 0:s:0 OUTPUT.{srt,ass}
This only removes the first subtitle track from the file. A more sophisticated solution would be required to detect and extract only the default track, or perhaps to handle some more complex situations. When you do get a subtitle file you wish to use, you will find it in one of two formats:
- SRT (SubRip Subtitles)
- ASS (Advanced Sub-Station)
SRT is very similar to VTT, and VTT has support natively via WebVTT. MediaCrush has a procedure for converting SRT to VTT for use on the web. However, ASS is a much more versatile subtitle format, with support for more interesting subtitles and effects. VTT/SRT, on the other hand, is a glorified closed captioning format.
To use VTT subtitles, you must know that only Chrome currently has native support for WebVTT. MediaCrush uses captionator.js to polyfill VTT subtitles in other browsers. Please browse captionator's documentation to learn how it may be used.
For Chrome, it's as simple as adding a track to your video tag:
<track src="/example.vtt" kind="subtitles" data-format="vtt" default />
For ASS subtitles, there is no native support. However, the advantages they present make it reasonable to wish for support. MediaCrush uses libjass to provide support for ASS subtitles. This is a bit too complex for this blog post, but I suggest you browse our code to learn more about how we applied libjass to our workflow. It's much more involved: we have to generate CSS files, extract fonts and subtitle tracks, and generally go to a lot more trouble to support ASS. Join us on IRC if you run into any problems and would like to hear more.
Audio Encoding
Audio is a little easier. Here, we've given ffmpeg the choice on quality, and it'll tend towards the highest quality possible. We need an mp3 file and an ogg file to support all browsers.
mp3
ffmpeg -i input.ext -acodec libmp3lame -q:a 0 -map 0:a:0 output.mp3
-acodec libmp3lame
chooses the lame mp3 codec-q:a 0
sets the audio quality to zero, which will choose the best (for mp3, at least)-map 0:a:0
ensures that only the first audio stream and no additional streams are used
To expand on that last one, if input.ext
were a video, that argument would ensure that only the
audio track made it into the output file.
ogg
ffmpeg -i input.ext -acodec libvorbis -q:a 10 -map 0:a:0 output.ogg
-q:a 10
chooses the vorbis quality, which works differently than lame.
HTML
Once you've encoded your videos/audio, you can render them like this:
<video poster="poster.png" controls>
<source src="video.mp4"></source>
<source src="video.webm"></source>
<source src="video.ogv"></source>
<p>This is fallback content</p>
</video>
<audio controls>
<source src="audio.mp3"></source>
<source src="audio.ogg"></source>
<p>This is fallback content</p>
</audio>
Be sure to read Mozilla's guide on using HTML5 video and audio in your web pages. Leave a comment here or join our IRC channel (#mediacrush on freenode) if you'd like to ask us any questions. You can also reach out through our subreddit, /r/MediaCrush. You're also welcome to browse our code for a working demo. And of course, if all of this seems like a pain in the ass, feel free to reference our developer docs to let us do the heavy lifting for you.
Additional Reading
Here are some more resources: