Appendix C: Video and Audio Codecs used by H.323 and SIP VC systems


Overview:

This paper looks at the Video and Audio Codecs used by Standards Compliant H.323 and SIP systems and is Appendix C of a series that specifically looks at Microsoft® Skype® for Business 2015 (Lync® 2013) and the challenges and solutions for integrating Skype for Business 2015 with H.323 or SIP standards compliant videoconferencing systems. Hence, it will focus on the codecs used in A/V Conferencing and Application Sharing.

We will look at the main video and audio codecs available to standards compliant H.323 and SIP systems when A/V Conferencing and Application Data Sharing; this will then highlight the differences and challenges that need to be resolved when integrating with Microsoft Skype for Business 2015 (Lync 2013) clients.

Within these papers the terms, Lync, Skype, Skype for Business and SfB, unless stated otherwise, all refer to Skype for Business Server 2015. The paper is specifically based on Skype for Business 2015. Whilst Lync 2013 has now been renamed Skype for Business 2015, it is generally backwards compatible with Lync Server 2013.

It is recommended that you look all the papers listed below for a background into Skype for Business and a detailed explanation about the Codecs, Protocols, Procedures and some of the available solutions

History:

Microsoft Lync is an evolutionary product for Unified Communications (UC). The initial product; Live Communications Server 2003, was only an Instant Messaging (IM) server. This then evolved through several interactions of Live Communications Server to Office Communications Server and then to Lync Server 2010; when a PBX replacement function was added. It then evolved even further to Lync Server 2013 which added much more including video conferencing, web and audio conferencing, softphone and PBX replacement and/or integration. Now, Microsoft have renamed Lync to Skype for Business.

H.323 & SIP Video and Audio Codecs:

We now know that for native integration, a third-party endpoint must share common audio and video codecs with Skype for Business. So let's take a closer look at the latest codecs in H.323 & SIP endpoints. 

Determining what codecs are used by H.323 endpoints:

From Part 3, we know that it is not possible to have native integration with endpoints that just support H.323 as these don't use or understand MS-SIP communications, so we would need a Gateway to transcode the signalling and associated media streams. But if we really wanted to determine what codecs were used by an H.323 endpoint, we could do a communications trace of an H.323 call and then analysis the results. However, the trace is complex and the endpoints 'capability set' is not easily found or interpreted.

Luckily, if the endpoint also supports SIP, then it would typically support the same video and audio codecs in both H.323 and SIP calls. But what's really relevant here are the codecs that the endpoint supports in a SIP call.

Determining what codecs are used by SIP endpoints:

From Appendix D, we know that in a communications trace of a SIP call, the SDP SIP INVITE statement includes a list of all the supported audio and video codecs. So we first need to capture the communications in a SIP call using an applications such as Wireshark, and then analysis the SDP statement.

Real-Time Protocol Audio Video Profile - RTP/AVP:

In the RTP/AVP (RTP Audio Video Profile) statement, the format is:

a=rtpmap: payload type codec name / clock rate

The number assigned to each codec is referred to as its PT or Payload Type. They identify the actual codec against its name. These numbers range between 0 - 127. These Payload Type numbers fit into categories as being either reserved; assigned to a specific codec; unassigned or dynamic. The dynamic range is 96-127, so as the name implies, the PT number allocated to a specific codec in this range could change between different conference sessions. So for example, in the below trace, 111 relates to H264 But as it's in the dynamic range, be aware that in another trace, any number in the 96-127 range could then relate to H264.

Trace of the Video codecs:

The video trace taken from a Polycom RealPresence SIP call typically look like:

m=video 3232 RTP/AVP 111 109 110 96 34 31 106 105 116
.
a=rtpmap:111 H264/90000
a=fmtp:111 profile-level-id=64001f; packetization-mode=1; max-br=20010; sar=13
a=vnd.polycom.avcplus.p2p:111 max-temp-layer-p2p=3
a=rtpmap:109 H264/90000
a=fmtp:109 profile-level-id=42801f; max-br=20010; sar=13
a=vnd.polycom.avcplus.p2p:109 max-temp-layer-p2p=3
a=rtpmap:110 H264/90000
a=fmtp:110 profile-level-id=42801f; packetization-mode=1; max-br=20010; sar=13
a=vnd.polycom.avcplus.p2p:110 max-temp-layer-p2p=3
a=rtpmap:96 H263-1998/90000
a=fmtp:96 CIF4=1;CIF=1;QCIF=1;SQCIF=1;CUSTOM=352,240,1;CUSTOM=704,480,1;
                 CUSTOM=1024,768,1;CUSTOM=800,600,1;CUSTOM=640,480,1;T
a=rtpmap:34 H263/90000
a=fmtp:34 CIF4=1;CIF=1;QCIF=1;SQCIF=1
a=rtpmap:31 H261/90000
a=fmtp:31 CIF=1;QCIF=1
a=rtpmap:106 H264-SVC/90000
a=fmtp:106 profile-level-id=56001f; packetization-mode=1; max-br=20010; sar=13
a=rtpmap:105 H264-SVC/90000
a=fmtp:105 profile-level-id=53e01f; packetization-mode=1; max-br=20010; sar=13
a=rtpmap:116 vnd.polycom.lpr/9000
a=fmtp:116 V=2;minPP=0;PP=150;RS=52;RP=10;PS=1400

m=video 3232 RTP/AVP 111 109 110 96 34 31 106 105 116 indicates the order of preference is 111, 109, ...116 using RTP over UDP port 3232

H.264 AVC:

As you can see, there are several variations of H.264 as shown in the first three trace entries:
a=rtpmap:111 H264/90000
a=rtpmap:109 H264/90000
a=rtpmap:110 H264/90000

If you dig deeper and decipher the associated extra format (a=fmtp:) code:
a=fmtp:111 profile-level-id=64001f; packetization-mode=1, the hex 64 in 64001f is H.264 High-Profile whilst packetization-mode=1 indicates non-interleaved mode
a=fmtp:109 profile-level-id=42801f, the hex 42 in 42801f is H.264 Baseline Profile
a=fmtp:110 profile-level-id=42801f; packetization-mode=1 is H.264 Baseline Profile in non-interleaved mode

H.263-1998:

H.263-1998 or H.263+ are the informal names given to the second edition of H.263. It includes enhanced capabilities by adding several annexes which can improve encoding efficiency as well as robustness against data loss. H.263-1998 also added support for flexible customised picture formats and clock frequencies as indicated in the associated format parameters.

a=rtpmap:96 H263-1998/90000 indicates that PT 96 is assigned to H.263-1998 (H.263+ or H.263v2)
a=fmtp:96 CIF4=1;CIF=1;QCIF=1;SQCIF=1;CUSTOM=352,240,1;..... lists all the customised picture formats.

H.263:

H.263 is the initial release developed by the ITU-T Video Coding Experts Group and originally designed as a low bitrate compression to be used in H.324 based systems (PSTN - POTS) videoconferencing. But was then also used in H.323 (IP), H.320 (ISDN) and SIP (IP) videoconferencing.

a=rtpmap:34 H263/90000 indicates that PT 34 is assigned to H.263
a=fmtp:34 CIF4=1;CIF=1;QCIF=1;SQCIF=1 lists the supported picture formats as 4CIF, CIF, QCIF & SQCIF

H.261:

H.261 was the first member of the H.26x family of video codecs developed by the ITU-T (VCEG) Video Coding Experts Group and was ratified in 1988. H.261 was originally designed to work over ISDN with data sent in multiples of 64 kbps. The actual algorithm was designed to support sending video between 40 - 2048 kbps at resolutions of CIF (352x288) and QCIF (176x144).

a=rtpmap:31 H261/90000 indicates that PT 31 is assigned to H.261
a=fmtp:31 CIF=1;QCIF=1 lists the supported picture formats as CIF and QCIF

H.264-SVC:

H.264-SVC is the name given to Annex G extensions of the H.264/MPEG-4 AVC video codec. Scalable Video Coding provides a high-quality video stream that contains one or more sub-streams. These sub-streams are derived by dropping packets from larger video streams to reduce the bandwidth required for that specific sub-stream. Each sub-stream can represent a lower spatial resolution (screen size), lower temporal resolution (frame rate) or lower quality video signal.

H.264-SVC has five scalable profiles, with a base layer that can be decode by H.264 AVC

As you can see, there are two variations of H.264-SVC as shown in the following two trace entries:
a=rtpmap:106 H264-SVC/90000
a=rtpmap:105 H264-SVC/90000

If you dig deeper and decipher the associated extra format (a=fmtp:) code:
a=fmtp:106 profile-level-id=56001f;, the hex 56 in 56001f is H.264-SVC Scalable High Profile
a=fmtp:105 profile-level-id=53e01f;, the hex 53 in 53e01f is H.264-SVC Scalable Baseline Profile

Polycom Lost Packet Recovery - LPR:

In terms of RTP/AVP, Lost Packet Recovery - LPR is seem as a video codec. It is proprietary to Polycom and supported by their latest endpoints and software. LPR is a Polycom algorithm designed to protect IP video calls from the impact of network packet loss.

LPR involves both video systems in the call. LPR uses forward error correction (FEC) whereby the sending system adds redundant data to its outgoing data stream to allow the receiving system to detect and correct errors without having to ask the sending system to re-transmit the missing information.

LPR works by temporarily allocating a portion of the call bandwidth into a FEC data channel and uses this for sending FEC data to the receive system. LPR then increases or decreases the size of the FEC data channel until it finds the minimum bandwidth that must be allocated to the FEC data channel in order for the receiving system to recover all lost packets.

As LPR is implemented at the communications/data packet level, it is codec agnostic. Hence, LPR will works in-conjunction with the codecs most commonly used during video calls including H.264, H.263, G.722, G.722.1C (Siren14) etc. Furthermore, it could be used with codecs that will be introduced in the future such as the H.265 video codec.

a=rtpmap:116 vnd.polycom.lpr/9000 indicates that PT 116 is assigned to Polycom LPR

Trace of the Audio codecs:

The audio trace would typically look like:

m=audio 3230 RTP/AVP 118 115 114 113 102 101 103 99 98 97 9 18 15 0 8 119
.
a=rtpmap:118 SIRENLPR/48000/1
a=fmtp:118 bitrate=64000
a=rtpmap:115 G7221/32000
a=fmtp:115 bitrate=48000
a=rtpmap:114 G7221/32000
a=fmtp:114 bitrate=32000
a=rtpmap:113 G7221/32000
a=fmtp:113 bitrate=24000
a=rtpmap:102 G7221/16000
a=fmtp:102 bitrate=32000
a=rtpmap:101 G7221/16000
a=fmtp:101 bitrate=24000
a=rtpmap:103 G7221/16000
a=fmtp:103 bitrate=16000
a=rtpmap:99 SIREN14/16000
a=fmtp:99 bitrate=48000
a=rtpmap:98 SIREN14/16000
a=fmtp:98 bitrate=32000
a=rtpmap:97 SIREN14/16000
a=fmtp:97 bitrate=24000
a=rtpmap:9 G722/8000
a=fmtp:9 bitrate=64000
a=rtpmap:18 G729/8000
a=fmtp:18 annexb=no
a=rtpmap:15 G728/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:119 telephone-event/8000
a=fmtp:119 0-15

SIREN LPR:

SIREN LPR is Polycom proprietary implementation of SIREN audio with Lost Packet Recovery.

a=rtpmap:118 SIRENLPR/48000/1 indicates that PT 118 is assigned to SIRENLPR. The 48000 is the clock rate and /1 indicates its single channel.

G.722.1:

G.722.1 is another ITU-T 7 kHz wide-band audio codec operating at 24 and 32 kbps. However, G.722.1 is not a derivative of G.722. It is actually based on Polycom's old Siren 7 codec that they used in the ViewStation.

You can see that there are several derivatives of G.722.1 available in this trace as shown below.

a=rtpmap:115 G7221/32000 indicates that PT 115 is assigned to G.722.1 with a 32000Hz sampling rate
a=fmtp:115 bitrate=48000 indicates that the associated stream is at 48000 bps

a=rtpmap:114 G7221/32000 indicates that PT 114 is assigned to G.722.1 with a 32000Hz sampling rate
a=fmtp:114 bitrate=32000 indicates that the associated stream is at 32000 bps

a=rtpmap:113 G7221/32000 indicates that PT 113 is assigned to G.722.1 with a 32000Hz sampling rate
a=fmtp:113 bitrate=24000 indicates that the associated stream is at 24000 bps

a=rtpmap:102 G7221/16000 indicates that PT 102 is assigned to G.722.1 with a 16000Hz sampling rate
a=fmtp:102 bitrate=32000 indicates that the associated stream is at 32000 bps

a=rtpmap:101 G7221/16000 indicates that PT 101 is assigned to G.722.1 with a 16000Hz sampling rate
a=fmtp:101 bitrate=24000 indicates that the associated stream is at 24000 bps

a=rtpmap:103 G7221/16000 indicates that PT 103 is assigned to G.722.1 with a 16000Hz sampling rate
a=fmtp:103 bitrate=16000 indicates that the associated stream is at 16000 bps

SIREN 14:

SIREN is a family of patented audio codecs originally developed and licensed PictureTel; who were then acquired Polycom. There are currently three derivatives of SIREN, namely SIREN 7, SIREN 14 and SIREN 22 and as their name implies, they support sampling rates of 7, 14 and 22 kHz respectively.

SIREN 14 that provides 14 kHz audio at bitrates of 24, 32, 48 kbps for mono and 48, 64, 96 kbps for stereo. SIREN 14 mono is the for-runner to G.722.1C

a=rtpmap:99 SIREN14/16000 indicates that PT 99 is assigned to SIREN14 with a 16000Hz sampling rate
a=fmtp:99 bitrate=48000 indicates that the associated stream is at 48000 bps

a=rtpmap:98 SIREN14/16000 indicates that PT 98 is assigned to SIREN14 with a 16000Hz sampling rate
a=fmtp:98 bitrate=32000 indicates that the associated stream is at 32000 bps

a=rtpmap:97 SIREN14/16000 indicates that PT 97 is assigned to SIREN14 with a 16000Hz sampling rate
a=fmtp:97 bitrate=24000 indicates that the associated stream is at 24000 bps

G.722:

G.722 is a freely available ITU-T standard 7 kHz wide-band audio codec operating at 48, 56 and 64 kbps, but in practice, data is typically encoded at 64kbps. G.722 is used for VoIP applications on local area networks where bandwidth is readily available. G.722 offers a significant improvement in audio quality when compared to narrow-band codecs such as G.711.

a=rtpmap:9 G722/8000 indicates that PT 9 is assigned to G.722; but as mentioned in Appendix B, the IANA records the clock rate as 8000 when it's actually 16000Hz.
a=fmtp:9 bitrate=64000 indicates that the associated stream is at 64000 bps

G.729:

G.729 is an audio data compression algorithm for voice that compresses digital voice in packets of 10 ms duration. It is officially described as Coding of Speech At 8 kbps using Code-Excited Linear Prediction speech coding (CS-ACELP). G.729 as a low bandwidth requirement and is mostly used in Voice over Internet Protocol (VoIP) applications such as conference calls where bandwidth must be conserved. Standard G.729 operates at a bit rate of 8 kbps, but there are extensions, which provide rates of 6.4 kbps.

G729a or A annex is a variant that is still compatible with G729. It is less complex and has slightly lower voice quality.

a=rtpmap:18 G729/8000 indicates that PT 18 is assigned to G.729 whilst
a=fmtp:18 annexb=no indicates it is actually G.729a - No Annex B support

G.728:

G.728 is an ITU-T standard for speech coding operating at 16 kbps. It is officially described as Coding of speech at 16 kbps using Low-Delay Code Excited Linear Prediction (LD-CELP).

a=rtpmap:15 G728/8000 indicates that PT 15 is assigned to G.728 with a clock rate of 8000Hz.

G.711:

G.711 is the ITU-T audio standard must be used and formed the basis under the umbrella of the H.320 and H.323 video conferencing standards. Also known as Pulse Code Modulation (PCM), G.711 is a commonly used audio codec were the 300-3400 Hz analogue audio is encode at a rate of 8000Hz to provide toll-quality audio in a 64 kbps stream. There are two versions, PCMU (µ-law) is mainly used in North America and PCMA (A-law) which is used in most other countries.

a=rtpmap:0 PCMU/8000 indicates that PT 0 is assigned to PCMU (µ-law) with a clock rate of 8000Hz.
a=rtpmap:8 PCMA/8000 indicates that PT 8 is assigned to PCMA (A-law) with a clock rate of 8000Hz.

DTMF (Dual-Tone Multi-Frequency):

DTMF (Dual-Tone Multi-Frequency) signals are used to support the telephone events (functions) associated with pushing the dial-pad buttons during a call.

There are 16 standard tones assigned to 0-9, *, # plus four AUTOVON military tones defined as A, B, C and D. The unique tone created by each key is represented by the values 0-15 as shown in the associated fmtp attribute.

a=rtpmap:119 telephone-event/8000 indicates that PT 119 is assigned to sending DTMF signals.
a=fmtp:119 0-15 indicates that there are 0-15 unique tones.

Trace of the Far End Camera Control - FECC:

The trace associated with FECC would typically look like:

m=application 3236 RTP/AVP 100
a=sendrecv
a=rtpmap:100 H224/4800

Far End Camera Control - FECC:

Within videoconferencing applications, FECC is used to control the far-end remote camera. FECC is a simple protocol based on ITU-T H.281 frames carried in H.224 packets in a RTP/UDP channel. H.323 annex Q defines how to assemble the RTP packets. The default clock rate is 4800Hz and the maximum bandwidth is 6.4 kbps. FECC is uni-directional whilst H.224 is bi-directional and can be used to determine the far-ends capabilities during the capabilities exchange procedure.

a=rtpmap:100 H224/4800 indicated that PT 100 is assigned to using H.224 at a clock rate of 4800Hz. a=sendrecv indicates that in this case, the sending endpoints camera can be remotely controlled. If the initial capabilities exchange had a=sendonly, then its camera would not support remote control or it was disabled.

Trace of the Application Sharing:

The trace associated with Application Sharing would typically look like:

m=application 3238 UDP/BFCP *
a=sendrecv
a=setup:actpass
a=connection:new
a=floorctrl:c-s

Binary Floor Control Protocol - BFCP:

Sharing the SIP or H.323 endpoints Desktop or Applications with a Skype for Business 2015 or Lync 2013 client typically use BFCP or H.329 that effectively sends the Desktop or Application as a second video stream that the Skype for Business 2015 or Lync 2013 client can understand and display in either a second window or in place of the 'talking heads' video whilst the sharing is active.

When two endpoints establish a BFCP connection, they must determine which endpoint will act as a floor control server, then the other will act as a floor control client for that specific stream. If there are two streams, then again one endpoint must act as the floor control server, but it does not have to be the same endpoint for each stream.

m=application 3238 UDP/BFCP * indicates that this particular application sharing stream is over IP Port 3238 using RTP that's embedded in UDP packets.
a=setup:actpass the connection was not yet established; once done, this would be either active or passive.
a=connection:new indicates that it is a new connection.
a=floorctrl:c-s indicates that the sender is willing to act both as a floor control client and floor control server. 


For a complete picture, please take a closer look at all the other papers in this series about Skype for Business 2015. 


References:
List of Codec "https://en.wikipedia.org/wiki/List_of_codecs"
Microsoft Lync Server 2013 Unleashed. ISBN-13 978-0-672-33615-7