Understanding Video Streaming: An In-Depth Look at Technology
Written on
Overview of Video Streaming
This article provides a concise introduction to video streaming. While striving for technical accuracy, certain concepts may be intentionally simplified for clarity. We will define video streaming as the process of transmitting video content over the internet, which can be accessed via a web browser or an application on common devices like Smart TVs, smartphones, computers, and gaming consoles. Traditional cable TV will not be part of this discussion.
Streaming services face a significant challenge in maintaining viewer engagement, no matter the user's location or device. Statistics reveal that if a video pauses for more than a few seconds, nearly half of viewers will abandon it for another option. Thus, it is essential for streaming providers to ensure a captivating quality of experience whether users are at home, on a crowded subway, or even in a remote area.
To initiate playback of any online video content, a device must first download a portion of the video, store it in its buffer, and decode at least one Group of Pictures (GOP). Decoding involves converting numerical data in a video file into a collection of pixels that form the image. Encoding every pixel in each frame at a rate of 25 frames per second would create enormous files; therefore, similar frames are grouped into GOPs, allowing frames to be encoded in relation to one another to reduce file size. To minimize redundancy within a single frame and among frames in a GOP, certain image details are simplified based on the desired outcome.
Video Frame Types
- I-Frames (Intra-coded frames): These frames contain all the necessary information to display a full image, similar to a JPEG file, and serve as reference points in the video stream.
- P-Frames (Predictive frames): These frames are encoded based on differences between the current frame and the last I-Frame or P-Frame.
- B-Frames (Bidirectional frames): These frames derive their encoding from the differences between the previous and next I or P frame.
Some essential technical factors that influence user engagement include:
- Startup time: The duration it takes for the first visuals and audio of the video to appear after the user initiates playback (e.g., clicking play). Shorter is better.
- Rebuffering rate: The frequency and duration of video interruptions during playback. Lower is better.
- Video quality: Subpar video quality can make the content unenjoyable or unwatchable. Higher quality is preferable.
Optimizing Key Factors
- Startup Time: This is impacted by the time taken to download the initial portion of the video and store it in the device's buffer. It can be improved by reducing the video's bitrate (the size of the video divided by its duration) and downloading less data upfront before playback begins.
- Rebuffering Rate: This occurs when internet speed cannot keep pace with the video bitrate. It can be improved by lowering the video's bitrate and increasing the amount of video downloaded initially.
- Video Quality: Reducing the bitrate inevitably simplifies some image details and may lower perceived video quality. To enhance quality, the bitrate needs to be increased.
Ultimately, the goal is to find a balance between minimizing file size for optimal startup time and rebuffering while maintaining acceptable quality. This balance is crucial across various scenarios.
Scenarios for Streaming
We can classify real-world conditions into three main categories:
- A: High and stable internet speed (e.g., a Smart TV connected directly to a home internet box).
- B: Low and stable internet speed (e.g., streaming on a smartphone while camping).
- C: Unstable internet speed (e.g., watching a video on a smartphone while on a bus).
To enhance engagement in each scenario, the client application must adapt accordingly:
- In Situation A, the internet speed accommodates high-quality content, allowing for minimal initial buffering.
- In Situation B, if the video's bitrate is low enough, rebuffering is unlikely, but quality may suffer.
- In Situation C, the client must continuously adjust to provide optimal quality when the connection is stable and reduce quality when the connection deteriorates.
To use a standard video file, a client typically must download it fully up to the playback point, limiting the ability to switch video quality on the fly. Thus, segmenting the video into smaller chunks (usually around 4 seconds each) is necessary to enable clients to adjust quality as required under varying conditions. Each segment must be self-sufficient, allowing for quality switching as needed.
To provide clients with different quality options, the video must be encoded multiple times at various resolutions (typically 5 to 10). The client application requires a manifest file that references all available segments for playback.
Standardization in Streaming
To tackle these challenges, Apple introduced the HLS format (HTTP Live Streaming) in 2009, while MPEG-DASH was standardized in 2012. HLS outlines the process of packaging videos of varying qualities into small TS (Transport Stream) segments detailed in Variant Playlists (text files with a .m3u8 extension). These playlists reference a Multivariant Playlist, which the player uses to select the appropriate video quality based on current conditions. Each segment's URL in the Variant Playlist includes duration information, aiding the client in determining segment correspondence.
DASH, on the other hand, describes how to divide video into small MP4 files without the need to reference each segment individually. Instead, DASH employs URL templates, allowing the client to replace placeholders in the manifest URL to build segment URLs. DASH uses a single XML file called a Manifest to reference all segments at once, with explicit duration references for easy segment calculation.
HLS and DASH are currently the leading standards in the industry, alongside Microsoft Smooth Streaming. Given the variety of video and audio compression methods, subtitle formats, and seeking thumbnails in use, developing a client solution or video encoder that supports all combinations can be complex.
Live Streaming vs. VOD
In the past, Flash enabled RTMP live streams to be played directly in browsers, allowing near real-time viewing. However, this also limited the ability to buffer, pause, or switch qualities. With the decline of Flash, client-side RTMP also vanished, leading to HTML5's ability to play progressive video. Currently, live streams are quite similar to VOD streams, allowing clients to switch quality on demand and buffer ahead. The primary difference is that segments are consistently added to the Manifest or Variant Playlists, and players must start close to the latest segment while buffering recent additions.
With adaptive streaming, latency has typically increased to around 10 seconds or more. However, recent innovations in low-latency streaming aim to reduce this back to just a few seconds, presenting new challenges at various stages of the streaming process.
Conclusion
Traditional broadcast TV and radio, which serve one stream to numerous devices, are declining in favor of on-demand content delivery, where every video segment must be sent to each individual device. This evolution has fostered new usage patterns, a burgeoning industry, and a host of challenges the sector must navigate. This introduction to modern video streaming focuses on client experiences but highlights that the compression, packaging, and rapid delivery of small video segments to millions remain some of the industry's biggest hurdles.
The first video, titled "Pearl Overview: Simplified Capture, Streaming, and Recording," offers insights into how video streaming technology is designed to enhance user experiences.
The second video, "What is Streaming? An Introduction," provides foundational knowledge about the streaming process and its significance in today's digital landscape.