Skip to content

Warcit Video Audio Conversion

Ilya Kreymer edited this page Mar 4, 2019 · 1 revision

With the 0.4.0, warcit introduces a new workflow for converting video/audio files into web-friendly formats and then placing them into WARCs along with transclusion metadata to enable access from within a containing page.

To allow for maximum flexibility, this process is split into two phases: conversion and transclusion WARC creation.

Media Conversion

warcit includes a standalone conversion utility, warcit-converter which can be used to batch convert files in a directory structure based on extension/regex matching.

The conversion process can be run separately from WARC creation and outputs converted files into a separate directory, recreating the same directory structure.

For example, given a directory structure:

- data/
    - videos/
         - video_file.flv
         - an_audio_rec.ra

Running:

warcit-converter http://www.example.com/ ./data/

with the default rules will result in converted files written into ./conversions directory (by default).

The full url of each file, as with warcit, is created by prepending the prefix to the path in the directory.

The input media in this example would have a full url of http://example.com/media/video_file.flv and http://example.com/media/an_audio_file.ra. The converted files simply have additional extensions added for the full url, such as: http://example.com/media/video_file.flv.mp4, http://example.com/media/an_audio_file.ra.webm, etc...

- data/
    - media/
         - video_file.flv
         - an_audio_rec.ra
- conversions/
    - warcit-conversion-results.yaml
    - media/
         - video_file.mp4
         - video_file.webm
         - video_file.mkv
         - an_audio_rec.mp3
         - an_audio_rec.ogg
         - an_audio_rec.opus
 

The results of each conversions are written into warcit-conversion-results.yaml. This file can then be used to analyze the results of the conversion, and to inform the transclusion metadata workflow.

Conversion Rules

The default rule set currently specifies conversions for .flv, .mp4, and RealMedia formats into several standardized formats, using ffmpeg.

The current output formats are two web-focused formats and a preservation format:

  • .webm -- vpx9 + opus encoded video + audio, an open format for the web
  • .mp4 -- H.264 + AAC encoded video + audio, primarily for Safari and Apple based platforms.
  • .mkv -- FFV1 codec in a Matroska container.

(For audio only content, .webm, .mp3 and .flac are used instead)

The first two formats are designed to be used for the web in <video> (or <audio>) tags.

The FFV1 format is a recommended preservation format and not designed to be shown in the browser.

It is also possible to specify a custom rules YAML file via the warcit-converter --rules custom-rules.yaml ...

WARC Conversion Record Creation

warcit includes the capability to write converted files as WARC conversion records with a reference to the original file that was the source of the conversion.

To include conversion record creation along with simply include the conversion results output as a parameter to warcit:

warcit --conversions ./conversion/warcit-conversion-results.yaml http://example.com/ ./data/ -o output.warc.gz

The resulting WARC will contain the original urls, eg. http://example.com/media/video_file.flv and http://example.com/media/an_audio_file.ra as resource records, as well as all of the converted files, eg. http://example.com/media/video_file.flv.mp4 and http://example.com/media/an_audio_file.ra.mp3 as conversion records. The conversion records will refer to the record ids and urls + timestamps of the original resource records.

Transclusion Manifest and Metadata

The above procedure allows for converting files in batch and adding them as WARC conversion records. However, it is often useful to reference "transcluded" video and audio from another page, which embeds/transcludes the content.

The information on which resources are transcluded from which pages is not possible to deduce from the media itself, and so must be provided as an additional input to warcit.

warcit supports a transclusion manifest YAML file, which can map resources to their containing/transcluding pages. A manifest might look as follows:

transclusions.yaml:

transclusions:
  http://example.com/media/video_file.flv:
    url: http://example.com/watch_video.html
    timestamp: 20160102
    selector: 'object, embed'

  http://example.com/media/an_audio_file.ra
    url: http://example.com/sample_audio.html
    timestamp: 20170102
    selector: 'a[id="#play"]'

Given this input, running warcit --transclusions transclusions.yaml will generate a reverse index, a metadata record for the containing pages, which point to the media files transcluded from that page. For the above example, two metadata records, with target uris metadata://example.com/watch_video.html and metadata//example.com/sample_audio.html will be created.

The metadata records will simply point to the transclusions, http://example.com/media/video_file.flv and http://example.com/media/an_audio_file.ra respectively.

However, when combining the transclusion manifest with conversion results, all of the conversion records will also be added as metadata.

warcit might then be run as follows:

warcit --transclusions transclusions.yaml --conversions ./conversion/warcit-conversion-results.yaml http://example.com/ ./data/ -o output.warc.gz

Note that the transclusion manifest should only contain the urls of the original, not the converted records, as they will be added automatically.