Appendix A: H.264 Video Codecs and UCConfig Modes


Overview:

This paper provides a backgrounding into H.264 Video Codecs and UCConfig Modes and how they are used by Microsoft® Skype® for Business 2015 (Lync® 2013) and is Appendix A of a series that specifically looks at Microsoft Skype for Business 2015 (Lync 2013) and the challenges that must be overcome as well as existing solutions for integrating Skype for Business 2015 with H.323 or SIP standards compliant videoconferencing systems. Hence, it will focus on the communications used in A/V Conferencing and Application Sharing.

We will provide a background to how the H.26X standard has evolved since the original ITU-T developed H.261 standard. Show how H.264 SVC compares to H.264 AVC and introduce the UCConfig Modes used by Microsoft and Polycom. This will form the basis for understanding the challenges that must be overcome when integrating with H.323 or SIP based systems.

Within these papers the terms, Lync, Skype, Skype for Business and SfB, unless stated otherwise, all refer to Skype for Business Server 2015. The paper is specifically based on Skype for Business 2015. Whilst Lync 2013 has now been renamed Skype for Business 2015, it is generally backwards compatible with Lync Server 2013.

It is recommended that you look all the papers listed below for a background into Skype for Business and a detailed explanation about the Codecs, Protocols, Procedures and some of the available solutions. 

History:

Microsoft Lync is an evolutionary product for Unified Communications (UC). The initial product; Live Communications Server 2003, was only an Instant Messaging (IM) server. This then evolved through several interactions of Live Communications Server to Office Communications Server and then to Lync Server 2010; when a PBX replacement function was added. It then evolved even further to Lync Server 2013 which added much more including video conferencing, web and audio conferencing, softphone and PBX replacement and/or integration. Now, Microsoft have renamed Lync to Skype for Business.

H.26X Video Codecs:

We now know that for native integration, a third-party endpoint must share a common video codec with SfB. So let's look at some of the well-known video codecs used by popular videoconferencing endpoints.

H.261 was the original ITU-T developed standard used in Video Conferencing. This was quickly followed by H.263 in 1995. After this, the ITU-T Video Coding Experts Group (VCEG) started work on a standard that would significantly outperform H.263, with more features and support higher quality at low-bitrates.

In 2001, the ISO Motion Picture Experts Group (MPEG) recognised the potential of this ITU-T development and formed the Joint Video Team (JVT) that included people from MPEG and VCEG. The result is two identical standards: ISO MPEG4 Part 10 and ITU-T H.264, with the official name Advanced Video Coding (AVC).

The H.261, H.263 and H.264 algorithms are all designed to use and incorporate motion prediction as well as lossy compression techniques to further reduce the amount of information to be transmitted. Whilst H.261 and H.263 images are also limited to CIF and QCIF sizes, H.264 can support full 1080 high-definition images and graphics at WXGA resolution when used in H.239 data streaming.

The basic technique of motion prediction works by sending a full frame followed by a sequence of frames that only contain the parts of the image that have changed. Full frames are also known as 'key frames' or 'I-frames' and the predicted frames are known as 'P-frames'. Since a lost or dropped frame can cause a sequence of frames sent after it to be illegible, new 'I-frames' are sent after a predetermined number of 'P-frames'. It is the combination of both lossy compression and motion prediction that allows H.261, H.263 and H.264 systems to achieve the required reduction in data whilst still providing an acceptable image quality.

There is little functional difference between the elements of H.264 and those of the earlier H.261 and H.263 standards. The changes that do make the difference lie mainly in the detail within each element, how well the algorithm is implemented and whether it is performed in hardware or software.

H.264 AVC (Advanced Video Codecs):

H.264 was organised into four profiles; Baseline, Extended, Main and High. Baseline is the simplest and uses 4:2:0 chrominance sampling and splits the picture into 4x4 pixel blocks, processing each block separately. Baseline uses Universal Variable Length Coding (UVLC) and Context Adaptive Variable Length Coding (CAVLC) techniques which have a big impact on the network bandwidth. Virtually all vendors support H.264 Baseline and some are now also supporting H.264 High Profile.

H.264 High Profile is the most powerful and efficient. This is achieved by using Context Adaptive Binary Arithmetic Coding (CABAC) encoding. High Profile also uses adaptive transformations to decide 'on-the-fly' how to split the picture into blocks - 4x4 or 8x8 pixels. Areas of the picture with little detail use 8x8 blocks whilst more complex and detailed areas use 4x4 blocks. 

H.264 SVC (Scalable Video Coding):

Vendors are now introducing H.264 SVC (Scalable Video Coding) into their products. H.264 SVC is the latest adaptive technology that delivers high quality video across networks with varying amounts of available bandwidth. Formerly known as H.264 Annex G, H.264 SVC promises to increase the scalability of video networks.

The above diagram clearly shows that in stark contrast to other H.264 AVC family members (including H.264 High Profile) with which video endpoints send one stream for every resolution, frame rate and quality, H.264 SVC enabled video endpoints send just one stream that contains multiple layers of all the resolutions (spatial), frame rates (temporal) and quality depending upon what the endpoints and network can support. This approach allows for 'scalability' as each endpoint can select which layers of video it needs without any additional encoding or decoding. This selecting of video layers is independent and does not affect other endpoints. It also allows each endpoint to gracefully degrade the video quality when it or network gets busy.

However, H.264 SVC is still essentially proprietary with vendors such as Polycom, Radvision and Vidyo each having their own flavour of H.264 SVC. Hence, back in April 2010, Microsoft, HP, Polycom and Lifesize got together and formed the Unified Communications Interoperability Forum - UCIF, with the intend on creating a set of guidelines and specifications that companies could use to build or adapt their solutions based on common interoperable protocols.

H.264 SVC UCConfig Modes:

Since then, Microsoft have published a document that defines UCConfig Modes which relate to the various scalable layers found in H.264 SVC. Effectively, Microsoft will be using the H.264 SVC technology developed by Polycom.

The document defines five UCConfig Modes, 0, 1, 2q, 2s & 3, that have different levels of video scalability. Each incremental level requires additional processing, but benefits with further reductions in bandwidth.

The five defined UCConfig Modes are:

  • UCConfig Mode 0: Non-Scalable Single Layer AVC Bitstream.
  • UCConfig Mode 1: SVC Temporal Scalability with Hierarchical P.
  • UCConfig Mode 2q: SVC Temporal Scalability + Quality/SNR Scalability.
  • UCConfig Mode 2s: SVC Temporal Scalability + Spatial Scalability.
  • UCConfig Mode 3: Full SVC Scalability (Temporal + SNR + Spatial).

UCConfig Mode 0:
Has no scalability, but Mode 0 still supports multiple independent simulcast streams. So although Mode 0 does not provide the scalability features of SVC, the specification still allows for multiple streams for each resolution. Mode 0 is effectively the basic level for backward compatibility with H.264 AVC Base Profile, but the extent of such backward compatibility is questionable and vendor dependent.

UCConfig Mode 1:
Introduces Temporal Scalability that provides the ability to send a single video stream per resolution, with each stream supporting multiple frame rates. For example, if an endpoint is requested to send two different resolutions (1080p & 720p), then it will send two video separate streams at the same time, then the receiving endpoint can decide what frame rate to use to display these video streams by dropping specific full frames. If the streams were sent at 30 fps, then the receiving endpoint could obviously display 30fps, or by dropping alternate frames it would display 15fps.

UCConfig Mode 2q:
Is accumulative and by taking the Temporal Scalability of Mode 1 and then applying Quality Scalability. Mode 2q encodes additional levels of image quality at the same resolutions and frames rates of Mode 1. There are still separate video streams for each resolution, but now each stream includes different video qualities for each resolution and frame rate.

UCConfig Mode 2s:
Like Mode 2q is accumulatively. However, Mode 2s takes the Temporal Scalability of Mode 1 and applies Spatial Scalability by intermixing multiple resolutions into the same video stream. In Mode 1, each stream had a set resolution, but now with Mode 2s, each stream can have additional scalable resolutions within that stream. For example, a 720p30 stream could be scaled to include a 480p30 stream.

UCConfig Mode 3:
Is the next level and offers a combination of features from the lower levels including Temporal, Quality and Spatial scalability into fewer actual streams. With Mode 3, a single video stream can include a mix of resolutions with multiple frame rates and quality levels. 


For a complete picture, please take a closer look at all the other papers in this series about Skype for Business 2015. 


References:
Video Interoperability in Lync 2013 "http://blog.schertz.name/2012/07/video-interoperability-in-lync-2013/"
Microsoft Lync Server 2013 Unleashed. ISBN-13 978-0-672-33615-7