| |
The Scalable Video
Coding Amendment of the H.264/AVC Standard
Summary
The Scalable Video
Coding amendment (SVC) of the H.264/AVC standard (H.264/AVC)
provides network-friendly scalability at a bit stream level with a
moderate increase in decoder complexity relative to single-layer
H.264/AVC. It supports functionalities such as bit rate, format, and
power adaptation, graceful degradation in lossy transmission
environments as well as lossless rewriting of quality-scalable SVC
bit streams to single-layer H.264/AVC bit streams. These
functionalities provide enhancements to transmission and storage
applications. SVC has achieved significant improvements in coding
efficiency with an increased degree of supported scalability
relative to the scalable profiles of prior video coding standards.

Fig. 1.
The Scalable Video Coding (SVC) principle.
Introduction
International
video coding standards such as H.261, MPEG-1, H.262/MPEG-2 Video,
H.263, MPEG-4 Visual, and H.264/AVC have played an important role in
the success of digital video applications. They provide
interoperability among products from different manufacturers while
allowing a high flexibility for implementations and optimizations in
various application scenarios. The H.264/AVC specification
represents the current state-of-the-art in video coding. Compared to
prior video coding standards, it significantly reduces the bit rate
necessary to represent a given level of perceptual quality – a
property also referred to as increase of the coding efficiency.
The desire for scalable video coding, which allows on-the-fly
adaptation to certain application requirements such as display and
processing capabilities of target devices, and varying transmission
conditions, originates from the continuous evolution of receiving
devices and the increasing usage of transmission systems that are
characterized by a widely varying connection quality. Video coding
today is used in a wide range of applications ranging from
multimedia messaging, video telephony and video conferencing over
mobile TV, wireless and Internet video streaming, to standard- and
high-definition TV broadcasting. In particular, the Internet and
wireless networks gain more and more importance for video
applications. Video transmission in such systems is exposed to
variable transmission conditions, which can be dealt with using
scalability features. Furthermore, video content is delivered to a
variety of decoding devices with heterogeneous display and
computational capabilities (see Fig. 2). In these heterogeneous environments,
flexible adaptation of once-encoded content is desirable, at the
same time enabling interoperability of encoder and decoder products
from different manufacturers.

Fig. 2. Example of
video
streaming with heterogeneous receiving devices and variable network
conditions.
Scalability has already been present in the video coding standards
MPEG-2 Video, H.263, and MPEG-4 Visual in the form of scalable
profiles. However, the provision of spatial and quality scalability
in these standards comes along with a considerable growth in decoder
complexity and a significant reduction in coding efficiency (i.e.,
bit rate increase for a given level a reconstruction quality) as
compared to the corresponding non-scalable profiles. These drawbacks,
which reduced the success of the scalable profiles of the former
specifications, are addressed by the new SVC amendment of the H.264/AVC
standard.
Types of
scalability
A video bit
stream is called scalable when parts of the stream can be removed in
a way that the resulting sub-stream forms another valid bit stream
for some target decoder, and the sub-stream represents the source
content with a reconstruction quality that is less than that of the
complete original bit stream but is high when considering the lower
quantity of remaining data. Bit streams that do not provide this
property are referred to as single-layer bit streams. The usual
modes of scalability are temporal, spatial, and quality scalability.
Spatial scalability and temporal scalability describe cases in which
subsets of the bit stream represent the source content with a
reduced picture size (spatial resolution) or frame rate (temporal
resolution), respectively. With quality scalability, the sub-stream
provides the same spatio-temporal resolution as the complete bit
stream, but with a lower fidelity – where fidelity is often
informally referred to as signal-to-noise ratio (SNR). Quality
scalability is also commonly referred to as fidelity or SNR
scalability. More rarely required scalability modes are
region-of-interest (ROI) and object-based scalability, in which the
sub-streams typically represent spatially contiguous regions of the
original picture area. The different types of scalability can also
be combined, so that a multitude of representations with different
spatio-temporal resolutions and bit rates can be supported within a
single scalable bit stream.

Fig. 3.
The basic types of
scalability in video coding.
Application areas
for scalable video coding
Efficient
scalable video coding provides a number of benefits in terms of
applications. Consider, for instance, the scenario of a video
transmission service with heterogeneous clients, where multiple bit
streams of the same source content differing in picture size, frame
rate, and bit rate should be provided simultaneously. With the
application of a properly configured scalable video coding scheme,
the source content has to be encoded only once – for the highest
required resolution and bit rate, resulting in a scalable bit stream
from which representations with lower resolution and/or quality can
be obtained by discarding selected data. For instance, a client with
restricted resources (display resolution, processing power, or
battery power) needs to decode only a part of the delivered bit
stream. Similarly, in a multicast scenario, terminals with different
capabilities can be served by a single scalable bit stream. In an
alternative scenario, an existing video format can be extended in a
backward compatible way by an enhancement video format.

Fig. 4. On-the-fly
adaptation of scalable coded video content in a media-aware network
element (MANE).
Another benefit of scalable video coding is that a scalable bit
stream usually contains parts with different importance in terms of
decoded video quality. This property in conjunction with unequal
error protection is especially useful in any transmission scenario
with unpredictable throughput variations and/or relatively high
packet loss rates. By using a stronger protection of the more
important information, error resilience with graceful degradation
can be achieved up to a certain degree of transmission errors.
Media-aware network elements (MANEs), which receive feedback
messages about the terminal capabilities and/or channel conditions,
can remove the non-required parts from a scalable bit stream, before
forwarding it (see Fig. 4). Thus, the loss of important transmission units due to
congestion can be avoided and the overall error robustness of the
video transmission service can be substantially improved.
Basic concepts
for extending H.264/AVC toward a scalable video coding standard
Apart from
the required support of all common types of scalability, the most
important design criteria for a successful scalable video coding
standard are coding efficiency and complexity. Since SVC was
developed as an extension of H.264/AVC with all of its well-designed
core coding tools being inherited, one of the design principles of
SVC was that new tools should only be added if necessary for
efficiently supporting the required types of scalability.
Temporal
scalability
A bit stream
provides temporal scalability when the set of corresponding access
units can be partitioned into a temporal base layer and one or more
temporal enhancement layers with the following property. Let the
temporal layers be identified by a temporal layer identifier T,
which starts from 0 for the base layer and is increased by 1 from
one temporal layer to the next. Then for each natural number k, the
bit stream that is obtained by removing all access units of all
temporal layers with a temporal layer identifier T greater than k
forms another valid bit stream for the given decoder.
For hybrid video codecs, temporal scalability can generally be
enabled by restricting motion-compensated prediction to reference
pictures with a temporal layer identifier that is less than or equal
to the temporal layer identifier of the picture to be predicted. The
prior video coding standards MPEG-1, H.262/MPEG-2 Video, H.263, and
MPEG-4 Visual all support temporal scalability to some degree.
H.264/AVC provides a significantly increased flexibility for
temporal scalability because of its reference picture memory control.
It allows the coding of picture sequences with arbitrary temporal
dependencies, which are only restricted by the maximum usable DPB
size. Hence, for supporting temporal scalability with a reasonable
number of temporal layers, no changes to the design of H.264/AVC
were required. The only related change in SVC refers to the
signalling of temporal layers.

Fig. 5. Hierarchical prediction structures for enabling temporal scalability:
(a) coding with hierarchical B or P pictures, (b) non-dyadic
hierarchical prediction structure, (c) hierarchical prediction
structure with a structural encoder/decoder delay of zero. The
numbers directly below the pictures specify the coding order; the
symbols Tk specify the temporal layers with k representing the
corresponding temporal layer identifier.
Temporal scalability with dyadic temporal enhancement layers can be
very efficiently provided with the concept of hierarchical B or P
pictures as illustrated in Fig. 5a. The enhancement layer pictures
are typically coded as B pictures, where the reference picture lists
0 and 1 are restricted to the temporally preceding and succeeding
picture, respectively, with a temporal layer identifier less than
the temporal layer identifier of the predicted picture. Since
backward prediction is not necessarily coupled with the use of B
slices in H.264/AVC, the temporal coding structure of Fig. 5a can
also be realized using P slices. Each set of temporal layers {T0,…,Tk}
can be decoded independently of all layers with a temporal layer
identifier T > k. In the following, the set of pictures between two
successive pictures of the temporal base layer together with the
succeeding base layer picture is referred to as a group of pictures
(GOP).
Although the described prediction structure with hierarchical B or P
pictures provides temporal scalability and also shows excellent
coding efficiency, it represents a special case. In general,
hierarchical prediction structures for enabling temporal scalability
can always be combined with the multiple reference picture concept
of H.264/AVC. This means that the reference picture lists can be
constructed by using more than one reference picture, and they can
also include pictures with the same temporal level as the picture to
be predicted. Furthermore, hierarchical prediction structures are
not restricted to the dyadic case. As an example, Fig. 5b illustrates a non-dyadic hierarchical prediction structure, which
provides 2 independently decodable sub-sequences with 1/9-th and
1/3-rd of the full frame rate. It is further possible to arbitrarily
adjust the structural delay between encoding and decoding a picture
by restricting motion-compensated prediction from pictures that
follow the picture to be predicted in display order. As an example,
Fig. 5c shows a hierarchical prediction structure, which does not
employ motion-compensated prediction from pictures in the future.
Although this structure provides the same degree of temporal
scalability as the prediction structure of Fig. 5a, its structural
delay is equal to zero compared to 7 pictures for the prediction
structure in Fig. 5a.
Spatial
scalability
For
supporting spatial scalable coding, SVC follows the conventional
approach of multi-layer coding, which is also used in H.262/MPEG-2
Video, H.263, and MPEG-4 Visual. In each spatial layer,
motion-compensated prediction and intra prediction are employed as
for single-layer coding. In addition to these basic coding tools of
H.264/AVC, SVC provides so-called inter-layer prediction methods (see
Fig. 6), which allow an exploitation of the statistical dependencies
between different layers for improving the coding efficiency (reducing
the bit rate) of enhancement layers.

Fig. 6. Multi-layer structure with additional inter-layer
prediction (black arrows).
In H.262/MPEG-2 Video, H.263, and MPEG-4 Visual, the only supported
inter-layer prediction methods employ the reconstructed samples of
the lower layer signal. The prediction signal is either formed by
motion-compensated prediction inside the enhancement layer, by
upsampling the reconstructed lower layer signal, or by averaging
such an upsampled signal with a temporal prediction signal. Although
the reconstructed lower layer samples represent the complete lower
layer information, they are not necessarily the most suitable data
that can be used for inter-layer prediction. Usually, the
inter-layer predictor has to compete with the temporal predictor,
and especially for sequences with slow motion and high spatial
detail, the temporal prediction signal typically represents a better
approximation of the original signal than the upsampled lower layer
reconstruction. In order to improve the coding efficiency for
spatial scalable coding, two additional inter-layer prediction
concepts have been added in SVC: prediction of macroblock modes and
associated motion parameters and prediction of the residual signal.
All inter-layer prediction tools can be chosen on a macroblock or
sub-macroblock basis allowing an encoder to select the coding mode
that gives the highest coding efficiency.
 |
Inter-layer intra prediction: For SVC enhancement layers, an
additional macroblock coding mode (signalled by the syntax element
base_mode_flag equal to 1) is provided, in which the macroblock
prediction signal is completely inferred from co-located blocks in
the reference layer without transmitting any additional side
information. When the co-located reference layer blocks are
intra-coded, the prediction signal is built by the up-sampled
reconstructed intra signal of the reference layer – a prediction
method also referred to as inter-layer intra prediction. |
| |
|
 |
Inter-layer macroblock mode and motion prediction: When
base_mode_flag is equal to 1 and at least one of the co-located
reference layer blocks is not intra-coded, the enhancement layer
macroblock is inter-picture predicted as in single-layer H.264/AVC
coding, but the macroblock partitioning – specifying the
decomposition into smaller block with different motion parameters –
and the associated motion parameters are completely derived from the
co-located blocks in the reference layer. This concept is also
referred to as inter-layer motion prediction. For the conventional
inter-coded macroblock types of H.264/AVC, the scaled motion vector
of the reference layer blocks can also be used as replacement for
usual spatial motion vector predictor. |
| |
|
 |
Inter-layer residual prediction:
A further inter-layer prediction
tool referred to as inter-layer residual prediction targets a
reduction of the bit rate required for transmitting the residual
signal of inter-coded macroblocks. With the usage of residual
prediction (signalled by the syntax element residual_prediction_flag
equal to 1), the up-sampled residual of the co-located reference
layer blocks is subtracted from the enhancement layer residual (difference
between the original and the inter-picture prediction signal) and
only the resulting difference, which often has a smaller energy then
the original residual signal, is encoded using transform coding as
specified in H.264/AVC. |

Fig. 7.
Illustration of
inter-layer prediction tools: (left) upsampling of intra-coded
macroblock for inter-layer intra prediction, (middle) upsampling of
macroblock partition in dyadic spatial scalability for inter-layer
prediction of macroblock modes, (right) upsampling of residual
signal for inter-layer residual prediction.
As an important feature of the SVC design, each spatial enhancement
layer can be decoded with a single motion compensation loop. For the
employed reference layers, only the intra-coded macroblocks and
residual blocks that are used for inter-layer prediction need to be
reconstructed (including the deblocking filter operation) and the
motion vectors need to be decoded. The computationally complex
operations of motion-compensated prediction and the deblocking of
inter-picture predicted macroblocks only need to be performed for
the target layer to be displayed.
Similar to H.262/MPEG-2 Video and MPEG-4 Visual, SVC supports
spatial scalable coding with arbitrary resolution ratios. The only
restriction is that neither the horizontal nor the vertical
resolution can decrease from one layer to the next. The SVC design
further includes the possibility that an enhancement layer picture
represents only a selected rectangular area of its corresponding
reference layer picture, which is coded with a higher or identical
spatial resolution. Alternatively, the enhancement layer picture may
contain additional parts beyond the borders of the reference layer
picture. This reference and enhancement layer cropping, which may
also be combined, can even be modified on a picture-by-picture basis.
Quality
scalability
Quality
scalability can be considered as a special case of spatial
scalability with identical picture sizes for base and enhancement
layer. This case, which is also referred to as coarse-grain quality
scalable coding (CGS), is supported by the general concept for
spatial scalable coding as described above. The same inter-layer
prediction mechanisms are employed, but without using the
corresponding upsampling operations. When utilizing inter-layer
prediction, a refinement of texture information is typically
achieved by re-quantizing the residual texture signal in the
enhancement layer with a smaller quantization step size relative to
that used for the preceding CGS layer. As a specific feature of this
configuration, the deblocking of the reference layer intra signal
for inter-layer intra prediction is omitted. Furthermore,
inter-layer intra and residual prediction are directly performed in
the transform coefficient domain in order to reduce the decoding
complexity.
The CGS concept only allows a few selected bit rates to be supported
in a scalable bit stream. In general, the number of supported rate
points is identical to the number of layers. Switching between
different CGS layers can only be done at defined points in the bit
stream. Furthermore, the CGS concept becomes less efficient, when
the relative rate difference between successive CGS layers gets
smaller. Especially for increasing the flexibility of bit stream
adaptation and error robustness, but also for improving the coding
efficiency for bit streams that have to provide a variety of bit
rates, a variation of the CGS approach, which is also referred to as
medium-grain quality scalability (MGS), is included in the SVC
design. The differences to the CGS concept are a modified high-level
signalling, which allows a switching between different MGS layers in
any access unit, and the so-called key picture concept, which allows
the adjustment of a suitable trade-off between drift and enhancement
layer coding efficiency for hierarchical prediction structures.

Fig. 8. Various concepts for trading off enhancement layer coding
efficiency and drift for packet-based quality scalable coding: (a)
base layer only control, (b) enhancement layer only control, (c)
two-loop control, (d) key picture concept of SVC for hierarchical
prediction structures, where key pictures are marked by the
black-framed boxes.
Drift describes the effect that the motion-compensated prediction
loops at encoder and decoder are not synchronized, e.g., because
quality refinement packets are discarded from a bit stream. Fig. 8 illustrates different concepts for trading off enhancement layer
coding efficiency and drift for packet-based quality scalable coding.
 |
Base layer only control: For fine-grain quality scalable (FGS)
coding in MPEG-4 Visual, the prediction structure was chosen in a
way that drift is completely omitted. As illustrated in Fig. 8a, motion compensation in MPEG-4 FGS is only performed using the base
layer reconstruction as reference, and thus any loss or modification
of a quality refinement packet doesn’t have any impact on the motion
compensation loop. The drawback of this approach, however, is that
it significantly decreases enhancement layer coding efficiency in
comparison to single-layer coding. Since only base layer
reconstruction signals are used for motion-compensated prediction,
the portion of bit rate that is spent for encoding MPEG-4 FGS
enhancement layers of a picture cannot be exploited for the coding
of following pictures that use this picture as reference. |
| |
|
 |
Enhancement layer only control: For quality scalable coding in
H.262/MPEG-2 Video, the other extreme case of possible prediction
structures was specified. Here, the reference with the highest
available quality is always employed for motion-compensated
prediction as depicted in Fig. 8b. This enables highly efficient
enhancement layer coding and ensures low complexity, since only a
single reference picture needs to be stored for each time instant.
However, any loss of quality refinement packets results in a drift
that can only be controlled by intra updates. – It should be noted
that H.262/MPEG-2 Video does not allow partial discarding of quality
refinement packets inside a video sequence, and thus the drift issue
can be completely avoided in conforming H.262/MPEG-2 Video bit
streams by controlling the reconstruction quality of both the base
and the enhancement layer during encoding. |
| |
|
 |
Two-loop control: As an alternative, a concept with two motion
compensation loops as illustrated in Fig. 8c could be employed. This
concept is similar to spatial scalable coding as specified in
H.262/MPEG-2 Video, H.263, and MPEG-4 Visual. Although the base
layer is not influenced by packet losses in the enhancement layer,
any loss of a quality refinement packet results in a drift for the
enhancement layer reconstruction. |
| |
|
 |
SVC key picture concept: For MGS coding in SVC an alternative
approach using so-called key pictures (see Fig. 8d) has been
introduced. For each picture a flag is transmitted, which signals
whether the base quality reconstruction or the enhancement layer
reconstruction of the reference pictures is employed for
motion-compensated prediction. In order to limit the memory
requirements, a second syntax element signals whether the base
quality representation of a picture is additionally reconstructed
and stored in the decoded picture buffer. In order to limit the
decoding overhead for such key pictures, SVC specifies that motion
parameters must not change between the base and enhancement layer
representations of key pictures, and thus also for key pictures, the
decoding can be done with a single motion-compensation loop. |
Fig. 8d illustrates how the key picture concept can be efficiently
combined with hierarchical prediction structures. All pictures of
the coarsest temporal layer are transmitted as key pictures, and
only for these pictures the base quality reconstruction is inserted
in the decoded picture buffer. Thus, no drift is introduced in the
motion compensation loop of the coarsest temporal layer. In contrast
to that, all temporal refinement pictures typically use the
reference with the highest available quality for motion-compensated
prediction, which enables a high coding efficiency for these
pictures. Since the key pictures serve as re-synchronization points
between encoder and decoder reconstruction, drift propagation is
efficiently limited to neighbouring pictures of higher temporal
layers. The trade-off between enhancement layer coding efficiency
and drift can be adjusted by the choice of the GOP size or the
number of hierarchy stages. It should be noted that both the quality
scalability structure in H.262/MPEG-2 Video (no picture is coded as
key picture) and the FGS coding approach in MPEG-4 Visual (all pictures are coded as key pictures) basically represent special
cases of the SVC key picture concept.

Fig. 8.
Example for a
partitioning of transform coefficients.
With the MGS concept, any enhancement layer NAL unit can be
discarded from a quality scalable bit stream, and thus packet-based
quality scalable coding is provided. In addition, SVC supports the
following two features for quality scalable coding:
 |
Partitioning of transform coefficients: SVC provides the
possibility to distribute the enhancement layer transform
coefficients among several slices. To this end, the first and the
last scan index for transform coefficients are signalled in the
slice headers, and the slice data only include transform coefficient
levels for scan indices inside the signalled range. Thus, the
information for a quality refinement picture that corresponds to a
certain quantization steps size can be distributed over several NAL
units corresponding to different quality refinement layers with each
of them containing refinement coefficients for particular transform
basis functions only and the granularity of quality scalable coding
can be increased. |
| |
|
 |
SVC-to-AVC rewriting: The SVC design also supports the creation of
quality scalable bit streams that can be converted into bit streams
that conform to one of the non-scalable H.264/AVC profiles by using
a low-complexity rewriting process. For this mode of quality
scalability, the same syntax as for CGS or MGS is used, but two
aspects of the decoding process are modified: |
| |
|
|
| |
(1) |
For the inter-layer intra prediction, the prediction signal is
not formed by the upsampled intra signal of the reference layer, but
instead the spatial intra prediction modes are inferred from the
co-located reference layer blocks, and a spatial intra prediction as
in single-layer H.264/AVC coding is performed in the target layer,
i.e., the highest quality refinement layer that is decoded for a
picture. Additionally, the residual signal is predicted as for
motion-compensated macroblock types. |
| |
|
|
| |
(2) |
The residual prediction for inter-coded macroblocks and for
inter-layer intra-coded macroblocks (base_mode_flag is equal to 1
and the co-located reference layer blocks are intra-coded) is
performed in the transform coefficient level domain, i.e., not the
scaled transform coefficients, but the quantization levels for
transform coefficients are scaled and accumulated. |
| |
|
|
| |
These two modifications ensure that such a quality scalable bit
stream can be converted into a non-scalable H.264/AVC bit stream
that yields exactly the same decoding result as the quality scalable
SVC bit stream. The conversion can be achieved by a rewriting
process which is significantly less complex than transcoding the SVC
bit stream. The usage of the modified decoding process in terms of
inter-layer prediction is signalled by a flag in the slice header of
the enhancement layer slices. |
SVC High-Level
Design
In the SVC
extension of H.264/AVC, the basic concepts for temporal, spatial,
and quality scalability can be combined. However, an SVC bit stream
does not need to provide all types of scalability. Since the support
of quality and spatial scalability usually comes along with a loss
in coding efficiency relative to single-layer coding, the trade-off
between coding efficiency and the provided degree of scalability can
be adjusted according to the needs of an application.
As described above, temporal scalability is provided on the basis of
access units (set of packets that correspond to a time instant), and
each access unit is associated with a temporal layer identifier T.
Inside an access unit, the coding structure is organized in
dependency layers as illustrated in Fig. 10. A dependency layer
usually represents a specific spatial resolution. In an extreme case
it is also possible that the spatial resolution for two dependency
layers is identical, in which case the different layers provide
coarse-grain scalability (CGS) in terms of quality. Dependency
layers are identified by a dependency identifier D. Each dependency
layer contains one or more quality layers, which represent the video
source for a time instant with a specific spatial resolution and a
specific fidelity. All quality layers inside a dependency layer
correspond to the same spatial resolution and are identified by a
quality identifier Q. For quality refinement layers with a quality
identifier Q > 0, always the preceding quality layer with a quality
identifier Q – 1 is employed for inter-layer prediction. For quality
layers with Q = 0, any present quality layer of a lower dependency
layer can be selected as reference layer for inter-layer prediction.

Fig. 10. Example for the structure of an SVC access unit.
One important difference between the concept of dependency layers
and quality refinements is that switching between different
dependency layers is only envisaged at defined switching point (IDR
access units). However, switching between different quality
refinement layers is virtually possible in any access unit. Quality
refinements can either be transmitted as new dependency layers (CGS:
different D) or as additional quality refinement layers (MGS: same
D, different Q). This does not change the basic decoding process.
Only the high-level signalling and the error-detection capabilities
are influenced. When quality refinements are coded inside a
dependency layer (same D, different Q), the decoder cannot detect
whether a quality refinement packet is missing or has been
intentionally discarded. This configuration is mainly suitable in
connection with hierarchical prediction structures and the usage of
key pictures in order to enable efficient packet-based quality
scalable coding.
In the SVC context, the classification of an access unit as IDR
access unit, and thus as a switching point between different
dependency layer identifiers D, generally depends on the target
layer. An IDR access unit for a dependency layer D signals that the
reconstruction of layer D for the current and all following access
units is independent of all previously transmitted access units.
Thus, it is always possible to switch to the dependency layer (or to
start the decoding of the dependency layer) for which the current
access unit represents an IDR access unit. But it is not required
that the decoding of any other dependency layer can be started at
that point. IDR access units only provide random access points for a
specific dependency layer. For instance, when an access unit
represents an IDR access unit for an enhancement layer and thus no
motion-compensated prediction can be used, it is still possible to
employ motion-compensated prediction in the lower layers in order to
improve their coding efficiency.
System interface
In order to
assist easy bit stream manipulations, the 1-byte header of H.264/AVC
is extended by additional 3 bytes for SVC NAL unit types. This
extended header includes the identifiers D, Q, and T as well as
additional information assisting bit stream adaptations. One of the
additional syntax elements is a priority identifier P, which signals
the importance of a NAL unit. It can be used either for simple bit
stream adaptations with a single comparison per NAL unit.
Each SVC bit stream includes a sub-stream, which is compliant to a
non-scalable profile of H.264/AVC. Standard H.264/AVC NAL units (non-SVC
NAL units) do not include the extended SVC NAL unit header. However,
these data are not only useful for bit stream adaptations, but some
of them are also required for the SVC decoding process. In order to
attach this SVC related information to non-SVC NAL units, so-called
prefix NAL units are introduced. These NAL units directly precede
all non-SVC VCL NAL units in an SVC bit stream and contain the SVC
NAL unit header extension.
SVC also specifies additional SEI messages (SEI – Supplemental
Enhancement Information), which for example contain information like
spatial resolution or bit rate of the layers that are included in an
SVC bit stream and which can further assist the bit stream
adaptation process.
Profiles
The SVC
Amendment of H.264/AVC specifies three profiles for scalable video
coding:
 |
The Scalable Baseline profile is mainly targeted for mobile
broadcast, conversational and surveillance applications that require
a low decoding complexity. In this profile, the support for spatial
scalable coding is restricted to resolution ratios of 1.5 and 2
between successive spatial layers in both horizontal and vertical
direction and to macroblock-aligned cropping. Furthermore, the
coding tools for interlaced sources are not included in this profile.
Bit streams conforming to the Scalable Baseline profile contain a
base layer bit stream that conforms to the restricted Baseline
profile H.264/AVC. It should be noted that the Scalable Baseline
profile supports B slices, weighted prediction, the CABAC entropy
coding, and the 8×8 luma transform in enhancement layers (CABAC and
the 8×8 transform are only supported for certain levels), although
the base layer has to conform to the restricted Baseline profile,
which does not support these tools. |
| |
|
 |
The Scalable High profile was designed for broadcast, streaming,
and storage applications. In this profile, the restrictions of the
Scalable Baseline profile are removed and spatial scalable coding
with arbitrary resolution ratios and cropping parameters is
supported. Bit streams conforming to the Scalable High profile
contain a base layer bit stream that conforms to the High profile of
H.264/AVC. |
| |
|
 |
The Scalable High Intra profile was mainly designed for
professional applications. Bit streams conforming to this profile
contain only IDR pictures (for all layers). Beside that, the same
set of coding tools as for the Scalable High profile is supported. |
| |
|
Performance
The diagrams
in Fig. 11 show two examples, where the coding efficiency of SVC is
compared with that of single-layer H.264/AVC coding. The coding
efficiency is measured in terms of bit rate and average luma peak
signal-to-noise ratio (PSNR) of the video frames. Temporal
scalability with 5 levels is provided by all bit streams used for
comparison; it does not have any negative impact on the
rate-distortion results and is already supported in single-layer
H.264/AVC. For both the SVC bit streams and the H.264/AVC bit
streams the same basic encoder configuration was used and a similar
type of encoder optimization was applied. For the SVC bit streams,
an additional cross-layer optimization was applied that enabled to
trade off the coding efficiency of base and enhancement layer.

Fig. 11. Rate-distortion comparison of SVC, single-layer H.264/AVC
coding and simulcast with H.264/AVC using the test sequence Soccer:
(a) SVC with quality scalability, (b) SVC with dyadic spatial
scalability. CIF (Common Intermediate Format) and 4CIF denote sizes
of pictures equal to 352x288 and 704x576 luma samples, respectively.
In Fig. 11(a), SVC with quality scalability is compared to H.264/AVC
single layer coding. All SVC rate-distortion points are extracted
from one single bit stream, while for single layer coding each
rate-distortion point represents a separate non-scalable bit stream.
The diagram additionally shows a rate-distortion curve representing
the simulcast of the single-layer H.264/AVC bit stream with the
fidelity of the SVC base layer and another single-layer H.264/AVC
bit stream with the fidelity specified in the diagram.
In Fig. 11(b), SVC with spatial scalability is compared to H.264/AVC
single layer coding and simulcast of two spatial resolutions with
H.264/AVC. Two spatial layers with a dyadic relation are provided.
This means that the horizontal and vertical resolutions of the
enhancement layer (4CIF) doubles the horizontal and vertical
resolutions of the base layer (CIF). For each point on the 4CIF
curve, the spatial scalable bit stream comprises the corresponding
point on the CIF curve with about 1/2 to 1/3 of the overall bit
rate.
The comparison shows that SVC can provide a suitable degree of
scalability at the cost of approximately 10% bit rate increase in
comparison to the bit rate of single-layer H.264/AVC coding. This
bit rate increase usually depends on the degree of scalability, the
bit rate range, and the spatial resolution of the included
representations. The comparison also shows that SVC is clearly
superior to simulcasting single-layer H.264/AVC streams for
different spatial resolutions or bit rates.
Further simulation results for evaluating the coding efficiency of
SVC in relation to single-layer H.264/AVC coding that also include
subjective test results are presented at our page on the
MPEG
Verification Test for SVC.
Conclusion
In comparison
to the scalable profiles of prior video coding standards, the H.264/AVC
extension for scalable video coding (SVC) provides various tools for
reducing the loss in coding efficiency relative to single-layer
coding. The most important differences are:
 |
The possibility to employ hierarchical prediction structures for
providing temporal scalability with several layers while improving
the coding efficiency and increasing the effectiveness of quality
and spatial scalable coding. |
| |
|
 |
New methods for inter-layer prediction of macroblock modes, motion,
and residual improving the coding efficiency of spatial scalable and
quality scalable coding. |
| |
|
 |
The concept of key pictures for efficiently controlling the drift
for packet-based quality scalable coding with hierarchical
prediction structures. |
| |
|
 |
Single motion compensation loop decoding for spatial and quality
scalable coding providing a decoder complexity close to that of
single-layer coding. |
| |
|
 |
The support of a modified decoding process that allows a lossless
and low-complexity rewriting of a quality scalable bit stream into a
bit stream that conforms to a non-scalable H.264/AVC profile. |
These new features provide SVC with a competitive rate-distortion
performance while only requiring a single motion compensation loop
at the decoder side. Our experiments further illustrate that:
 |
Temporal scalability: can be typically achieved without losses in
rate-distortion performance. |
| |
|
 |
Spatial scalability: when applying a suitable SVC encoder control,
the bit rate increase relative to non-scalable H.264/AVC coding at
the same fidelity can be as low as 10% for dyadic spatial
scalability. It should be noted that the results typically become
worse as spatial resolution of both layers decreases and results
improve as spatial resolution increases. |
| |
|
 |
Quality scalability: when applying a suitable SVC encoder control,
the bit rate increase relative to non-scalable H.264/AVC coding at
the same fidelity can be as low as 10% for all supported rate points
when spanning a bit rate range with a factor of 2-3 between the
lowest and highest supported rate point. |
Resources
Contact
Contributions of
the Image Communication Group (Fraunhofer HHI)
- First SVC
Model that became the first Working Draft
- SVC inter-layer prediction tools for quality and dyadic spatial
scalable coding
- SVC key picture concept for controlling the drift in packet-based
quality scalable coding
- Decoding of SVC enhancement layer with a single motion
compensation loop
- Investigation regarding the coding with hierarchical B pictures
Publications of
the Image Communication Group (Fraunhofer HHI)
Publications in
Journals
- Heiko
Schwarz and Mathias Wien:
The Scalable Video Coding Extension of the H.264/AVC Standard
{Standards in a nutshell}, IEEE Signal Processing Magazine,
Vol. 25, No. 2, pp. 135-141, Mar. 2008, Invited Paper. (Download as
pdf)
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Overview of the Scalable Video Coding Extension of the H.264/AVC
Standard, IEEE Transactions on Circuits and Systems for Video Technology,
Special Issue on Scalable Video Coding, Vol. 17, No. 9, pp.
1103-1120, September 2007, Invited Paper. (Download as
pdf)
- Mathias Wien, Heiko Schwarz, and Tobias Oelbaum:
Performance Analysis of SVC, IEEE Transactions on Circuits and Systems for Video Technology,
Special Issue on Scalable Video Coding, Vol. 17, No. 9, pp.
1194-1203, September 2007, Invited Paper. (Download as
pdf)
- Stephan Wenger,
Ye-Kui Wang, and Thomas Schierl:
Transport and Signaling of SVC in IP Networks,
IEEE Transactions on Circuits and Systems for Video Technology,
Special Issue on Scalable Video Coding, Vol. 17, No. 9, pp.
1164-1173, September 2007, Invited Paper.
(Download as
pdf)
- Thomas Schierl,
Thomas Stockhammer, and Thomas Wiegand:
Mobile Video Transmission Using Scalable Video Coding,
IEEE Transactions on Circuits and Systems for Video Technology,
Special Issue on Scalable Video Coding, Vol. 17, No. 9, pp.
1204-1217, September 2007, Invited Paper.
(Download as
pdf)
Publications in Conference Proceedings
- Heiner Kirchhoffer, Detlev Marpe, Heiko Schwarz, and Thomas
Wiegand:
A low-complexity approach for increasing the granularity of
packet-based fidelity scalability in Scalable Video Coding, Picture Coding Symposium (PCS'07), Lisboa, Portugal, November 2007.
- Martin Winken, Detlev Marpe, Heiko Schwarz, and Thomas Wiegand:
Bit-Depth Scalable Video Coding,
IEEE International Conference on Image Processing (ICIP'07), San
Antonio, TX, USA, September 2007.
- Heiner Kirchhoffer, Detlev Marpe, and Thomas Wiegand:
A Context Modeling Scheme for Coding of Texture Refinement
Information, IEEE International Conference on Image Processing (ICIP'07), San
Antonio, TX, USA, September 2007.
- Heiko Schwarz and Thomas Wiegand:
R-D Optimized Multi-Layer Encoder Control for SVC,
IEEE International Conference on Image Processing (ICIP'07), San
Antonio, TX, USA, September 2007.
- Heiner Kirchhoffer, Detlev Marpe, and Thomas Wiegand:
A generic context model for uniform-reconstruction based
SNR-scalable representations of residual texture signals, European Signal Processing Conference (EUSIPCO 2007), Poznan,
Poland, September 3, 2007.
- Thomas Wiegand, Heiko Schwarz, and Detlev Marpe:
Overview of the Scalable Video Coding Extension of H.264/AVC, Proceedings of 12. Dortmunder Fernsehseminar, ITG/FKTG-Fachtagung,
Dortmund, Germany, March 20-21, 2007.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Overview of the Scalable H.264/MPEG4-AVC Extension,
IEEE International Conference on Image Processing (ICIP'06),
Atlanta, GA, USA, October 2006, Invited Paper.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Analysis of hierarchical B pictures and MCTF,
IEEE International Conference on Multimedia and Expo (ICME'06),
Toronto, Ontario, Canada, July 2006.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SVC: The future Scalable Video Coding Standard,
2nd International Mobile Multimedia Communications Conference
(MobiMedia'06), Aleghero, Sardinia, Italy, September 2006, Invited
Paper.
- Thomas Wiegand, Thomas Schierl, and Karsten Grüneberg:
Skalierbare Video Codierung,
22. Jahrestagung der Fernseh- und Kinotechnischen Gesellschaft (FKTG),
Potsdam, Germany, May 2006.
- Heiko Schwarz, Tobias Hinz, Detlev Marpe, and Thomas Wiegand:
Constrained Inter-Layer Prediction for Single-Loop Decoding in
Spatial Scalability, IEEE International Conference on Image Processing (ICIP'05), Genova,
Italy, September 2005.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Combined Scability Support for the Scalable Extension of H.264/AVC,
IEEE International Conference on Multimedia and Expo (ICME'05),
Amsterdam, The Netherlands, July 2005.
- Ralf Schäfer, Heiko Schwarz, Detlev Marpe, Thomas Schierl, and
Thomas Wiegand:
MCTF and Scalability Extension of H.264/AVC and its Application to
Video Transmission, Storage, and Surveillance, IS&T/SPIE Symposium on Visual Communications and Image Processing
(VCIP'05), Bejing, China, July 2005.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Basic concepts for supporting spatial and SNR scalability in the
scalable H.264/MPEG4 AVC extension, IEEE International Workshop on Systems, Signals, and Image
Processing (IWSSIP'05), Chalkida, Greece, Sep. 2005.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
MCTF and Scalability Extension of H.264/AVC,
Picture Coding Symposium (PCS'04), San Francisco, CA, USA, December
2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SNR-Scalable Extension of H.264/AVC,
IEEE International Conference on Image Processing (ICIP'04),
Singapore, October 2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SNR-Scalable Video Coding Based on H.264/AVC,
IEEE International Workshop on Systems, Signals and Image Processing
(IWSSIP 2004), Poznan, Poland, September 2004.
Contributions to Standardization
- Martin Winken, Heiko Schwarz, and Thomas Wiegand:
SVC CE1: Results for bit-depth scalability, Joint Video Team of ISO/IEC MPEG & ITU-T VCEG, Doc. JVT-Y039,
Shenzhen, China, October 2007.
- Martin Winken, Heiko Schwarz, and Thomas Wiegand:
Bit-depth SVC: New results for Freeway sequence, Joint Video Team of ISO/IEC MPEG & ITU-T VCEG, Doc. JVT-Y032,
Shenzhen, China, October 2007.
- Martin Winken, Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
CE2: SVC bit-depth scalable coding, Joint Video Team of ISO/IEC MPEG & ITU-T VCEG, Doc. JVT-X057, Geneva,
Switzerland, July 2007.
- Heiko Schwarz, Mathias Wien, Thomas Wiegand, and Gary J. Sullivan:
SVC editors input for JD text, Joint Video Team of ISO/IEC MPEG & ITU-T VCEG, Doc. JVT-X081, Geneva,
Switzerland, July 2007.
- Heiko Schwarz and Mathias Wien:
Editors input for JD and JSVM text, Doc. JVT-W070, Joint Video Team, San Jose, CA, USA, April 2007.
- Heiko Schwarz and Thomas Wiegand:
Further results for an rd-optimized multi-loop SVC encoder, Doc. JVT-W071, Joint Video Team, San Jose, CA, USA, April 2007.
- Heiko Schwarz:
Simulation results comparing JSVM, 4-tap, and RCDO motion
compensation for luma, Doc. JVT-W072, Joint Video Team, San Jose, CA, USA, April 2007.
- Heiner Kirchhoffer, Heiko Schwarz, and Thomas Wiegand:
CE1: Simplified FGS, Doc. JVT-W090, Joint Video Team, San Jose, CA, USA, April 2007.
- Martin Winken, Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SVC bit depth scalability, Joint Video Team, Marrakech, Morocco, Doc. JVT-V078, January 2007.
- Heiko Schwarz and Thomas Wiegand:
Verification of H.241/RCDO implementations in the JSVM software, Joint Video Team, Marrakech, Morocco, Doc. JVT-V125, January 2007.
- Heiko Schwarz and Thomas Wiegand:
Implementation and performance of FGS, MGS, and CGS, Joint Video Team, Marrakech, Morocco, Doc. JVT-V126, January 2007.
- Heiner Kirchhoffer, Detlev Marpe, and Thomas Wiegand:
CE3 Improved CABAC for PR slices, Joint Video Team, Hangzhou, China, Doc. JVT-U082, October 2006.
- M. Wien, Heiko Schwarz, and T. Oelbaum:
SVC performance analysis, Joint Video Team, Hangzhou, China, Doc. JVT-U141, October 2006.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SVC overview, Joint Video Team, Hangzhou, China, Doc. JVT-U145, October 2006.
- Detlev Marpe, Heiner Kirchhoffer, Martin Winken, and Thomas
Wiegand:
Improved CABAC for PR slices, Joint Video Team, Klagenfurt, Austria, Doc. JVT-T077, July 2006.
- Heiko Schwarz and Thomas Wiegand:
Updated results for independent parsing of spatial and CGS layers, Joint Video Team, Klagenfurt, Austria, Doc. JVT-T079, July 2006.
- Heiko Schwarz and Thomas Wiegand:
Preliminary results for an r-d optimized multi-loop SVC encoder, Joint Video Team, Klagenfurt, Austria, Doc. JVT-T080, July 2006.
- Heiko Schwarz, Detlev Marpe, Thomas Wiegand, Julien Reichel, and
Matthias Wien:
Skip mode for SVC slice data syntax, Joint Video Team (JVT), Geneva, Switzerland, Doc. JVT-S068, April
2006.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Independent parsing of spatial and CGS layers,
Joint Video Team (JVT), Geneva, Switzerland, Doc. JVT-S069, April
2006.
- M. Winken, Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Adaptive Motion Refinement for FGS slices,
Joint Video Team (JVT), Bangkok, Thailand, JVT-R022, January 2006.
- Tobias Hinz, Heiko Schwarz, and Thomas Wiegand:
MBAFF implementation in the JSVM Software,
Joint Video Team (JVT), Bangkok, Thailand, JVT-R061, January 2006.
- Tobias Hinz, Heiko Schwarz, and Thomas Wiegand:
FGS for field pictures and MBAFF frames,
Joint Video Team (JVT), Bangkok, Thailand, JVT-R062, January 2006.
- Tobias Hinz, Heiko Schwarz, and Thomas Wiegand:
First concepts for inter-layer prediction with MBAFF frames,
Joint Video Team (JVT), Bangkok, Thailand, JVT-R063, January 2006.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Closed-loop coding with quality layers,
Joint Video Team (JVT), Nice, France, Doc. JVT-Q030, October 2005.
- Martin Winken, Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Adaptive motion refinement for FGS slices,
Joint Video Team (JVT), Nice, France, Doc. JVT-Q031, October 2005.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Hierarchical B pictures,
Joint Video Team (JVT), Poznan, Poland, Doc. JVT-P014, July 2005.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Comparison of MCTF and closed-loop hierarchical B pictures,
Joint Video Team (JVT), Poznan, Poland, Doc. JVT-P059, July 2005.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Improvement of the progressive refinement coding for 8x8 luma blocks,
Joint Video Team (JVT), Busan, Korea, Doc. JVT-O073, April 2005.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Further results on constrained inter-layer prediction,
Joint Video Team (JVT), Busan, Korea, Doc. JVT-O074, April 2005.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Scalable extension of H.264,
MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Palma de Mallorca, Spain,
VCEG-X08, October 2004.
- Heiko Schwarz, Tobias Hinz, Detlev Marpe, and Thomas Wiegand:
Further improvements of the HHI proposal for SVC CE1,
MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Palma de Mallorca, Spain,
MPEG04/M11398, October 2004.
- Heiko Schwarz, Tobias Hinz, Detlev Marpe, and Thomas Wiegand:
Technical description of the HHI proposal for SVC CE2,
MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Palma de Mallorca, Spain,
MPEG04/M11245, October 2004.
- Heiko Schwarz, Jikun Shen, Detlev Marpe, and Thomas Wiegand:
Technical description of the HHI proposal for SVC CE3,
MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Palma de Mallorca, Spain,
MPEG04/M11246, October 2004.
- Heiko Schwarz, Heiner Kirchhoffer, Detlev Marpe, and Thomas
Wiegand:
Technical description of the HHI proposal for SVC CE4,
MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Palma de Mallorca, Spain,
MPEG04/M11247, October 2004.
- Heiko Schwarz, Karsten Sühring, Detlev Marpe, and Thomas Wiegand:
Technical description of the HHI proposal for SVC CE7,
MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Palma de Mallorca, Spain,
MPEG04/M11248, October 2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Further improvements of the HHI proposal for SVC CE1,
MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Palma de Mallorca, Spain,
MPEG04/M11398, October 2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Further results for the HHI proposal on combined scalability,
MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Palma de Mallorca, Spain,
MPEG04/M11399, October 2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Further experiments for an MCTF extension of H.264,
VCEG Meeting - ITU-T SG16/Q.6, Redmond, VCEG-W06, July 2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SVC Core Experiment 2.1: Inter-layer Prediction of Motion and
Residual Data, MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Redmond, USA, 11/M11043, July
2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SVC Core Experiment 2.2: Influence of the update step on the coding
efficiency, MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Redmond, USA, 11/M11048, July
2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SVC Core Experiment 2.3: Spatial interpolation,
MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Redmond, USA, 11/M11051, July
2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SVC Core Experiment 2.4: Adaptive spatial transforms,
MPEG Meeting - ISO/IEC JTC1/SC29/WG11, Redmond, USA, 11/M11052, July
2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Subband Extension of H.264/AVC,
Joint Video Team (JVT), Munich, Germany, JVT-K023, March 2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Subband Extension of H.264/AVC,
VCEG meeting - ITU-T SG16/Q.6, Munich, Germany, VCEG-V04, March
2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
Scalable Extension of H.264/AVC, ISO/IEC JTC1/SC29/WG11, Munich, Germany, MPEG04/M10569/S03, March
2004.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SNR-Scalable Extension of H.264/AVC,
Joint Video Team (JVT), Waikoloa, HI, USA, JVT-J035, December 2003.
- Heiko Schwarz, Detlev Marpe, and Thomas Wiegand:
SNR-scalable Extension of H.264/AVC,
Joint Video Team (JVT), San Diego, USA, JVT-I032, September 2003.
|