Veep
Sketches for a new framework for timing video.
Case 1: Retiming
The truth of the time of most speedruns that get submitted to a leaderboard is the duration in their videos between start and end reference frames, which manual timers can only approximate. Therefore, getting a duration between two frames is a fundamental problem in speedrunning.
How Videos Keep Time
Modern video containers like MP4 and Matroska do not conform to the common intuition of having a sequence of equally-spaced frames – they support variable frame-rates (VFR). This has some uses, particularly in streaming; if you have to produce frames at a steady rate, it makes sense to allow the encoder to submit frames with uneven spacing, or not at all if it gets overloaded (instead of duplicating the previous frame and adding redundant data). However, most classic software, built from the 90s when AVI was the kingpin container, shits the bed when given VFR videos, tries really hard to quantise it to constant frame-rate (CFR) somehow. You’ve probably seen the random numbers that Windows reports for frame-rate (in fact, that’s the average frame-rate).
What modern containers actually do is specify every frame with a timestamp, which is in seconds, running into 6 decimal places. There are some notions of CFR here – the containers have a tick frequency, usually 90kHz for MP4, at which they snap frames to a grid. Those frames are produced by a codec (like x264/AVC), which targets its own tick frequency based on the specified frame-rate. Video players and tools have algorithms, like taking the minimum duration between successive timestamps, from which they can guess the frame-rate of a CFR video, and import it into a timeline for a video editor, etc. But there’s no true notion of CFR anymore.
Speedrunners typically use tools like AviUtl and VirtualDub, which both read in frames from a modern container, all specified with timestamps, and try to force them into a CFR grid with integer frame-numbers, whereby rounding error can cause two frames to hit the same integer so one gets dropped, or the average frame-rate gets applied to every frame and the video starts slowing down sporadically and then snapping instantly through frame-drops instead of pausing. Videos made with OBS 26 or later truncate timestamps to 3dp and skip frames liberally when either dropped or duplicated, rather than trying to make a CFR video, so you really have to come at it with an anything-can-happen assumption.
Modern Solutions
Editing VFR videos is implausibly difficult to handle, because the entire notion of a grid is thrown out. To time things in a video though, we don’t need editing ability. So, we want to recreate a frame-viewer application like VirtualDub, but only the viewing part, and have it accurately display timestamps, not frame numbers. The rest of the behaviour is largely the same, just without quantisation:
- Parse the metadata of all frames first, so that there’s an array of (uneven) frames to navigate.
- ← and → are guaranteed to hit the next/previous frame.
- Frames are placed on the seeking bar according to timestamp, and clicking it snaps to the nearest one.
- Decoding the video to display a frame is just-in-time. This means unravelling a tree of frames specified as deltas from other frames, right down to keyframes (which are encoded w/o dependency on other frames).
- Units of absolute times + durations are full-precision decimals in seconds, as written in the video.
Case 2: Live Timing
With streaming protocols like RTMP, we can process videos as they arrive, and glean a lot of timings that speedrunners often look for, either live or retroactively. These are split timings, identified by reference frames, or indeed the first or last of a sequence of those.
We can’t identify frames live by hand, so we need image recognition algorithms. Take as a starting point one that identifies black or white frames (using say an LP-norm of RGB values across pixels). The only state we need to identify frames (in the simplest implementation) is the classification of the previous frame. Whenever the frame changes between black/white/neither, we can emit the frame’s timestamp and its classification, and process that with a state machine that calculates durations of black or white – loads in speedruns. Simple calculations from here can approximately time levels, and replace typically used splits.
The only synchronisation requirement is order-preservation. Unlike most models of manual splitting and autosplitters, we don’t care when the state changes are identified; rather, we figure out when they actually happened according to timestamps from the true source, the video that’s being encoded as we go. Hence, our live timing perfectly agrees with any video retiming, and can be presented to the user as a stream of data that doesn’t have to be perfectly synced to the live video, since it’ll always be right, whenever it appears.
20XX
When it comes to recognising images, we can replace the human verifier with reference images and basic algorithms (like the LP-norm). That’s the basis for today’s visual autosplitters. But as anyone will tell you, to do this accurately and generally, you want a neural network. Either way, if we have an interface to plug-in image recognition algorithms that’re specific to particular reference frames from particular games, then we can fully replace manual timing and have a consensus on every timing in a video-based speedrun. We can still of course not base timing on video, as in the example of memory-based autosplitters typically used for PC games.
A live algorithm of this sort can of course be run on a video file as well. It can at any point dump the frames that it identified, as well as their immediate predecessors, for a human to check that it hit the right frames and the timing is accurate.