A Review of Video Streaming over the Internet

Jane Hunter , Varuni Witana , Mark Antoniades

DSTC Technical Report TR97-10, August 1997

Abstract:

Ideally, video and audio are streamed across the Internet from the server to the client in response to a client request for a Web page containing embedded videos. The client plays the incoming multimedia stream in real time as the data is received. Quite a few video streamers are starting to appear and many pseudo-streaming technologies and other potential solutions are also in the pipeline. Generally streaming video solutions may work on a closed-loop intranet, but for mass-market Internet use, they're simply dysfunctional. However current transport protocol, codec and scalability research will eventually make video on the Web a practical reality. Below we have reviewed the currently available commercial products which purport to provide video streaming capabilities over the Internet and outlined their limitations. Then we describe the major research projects curently underway, which are attempting to solve some of these limitations. Finally we compare and evaluate the SuperNOVA project with respect to other research projects and the current commercial products.

1. Introduction

For a long time now, its been very easy to download and play back high-quality audio and video files from the Internet. Current web browsers and servers support full-file transfer mode of document retrieval. However, full file transfer means very long, unacceptable transfer times and playback latency. Ideally, video and audio should be streamed across the Internet from the server to the client in response to a client request for a Web page containing embedded videos. The client plays the incoming multimedia stream in real time as the data is received. Audio streaming is becoming widely accepted and deployed. In particular, Progressive Networks' RealAudio has a wide following. Although streaming audio programs are considerably further along than video, they are still nowhere near typical computer-sound quality. The idea of streaming video over the network has been gaining a lot of interest. The current Internet is a best effort network and interconnects sites with widely varying bandwidth capabililties. In the future the Internet will see the rollout of ATM, RSVP with the ability to control Quality of Services (QoS) and mobile networks with widely varying QoS. Therefore it will remain a very heterogeneous network. In this report firstly we present a brief review of the current video compression standards, evolving standards and techniques and the internet transport protocols being deployed. In addition, issues such as the need for servers, plugins and firewall penetration are discussed. There are many commerical streaming video products becoming available as well as many research projects in this area. We then review the currently available commercial products which purport to provide video streaming capabilities over the Internet and out line their current limitations. Then we describe the major research projects currently underway, which are attempting to solve some of these limitations. Finally we compare and evaluate the SuperNOVA project with respect to other research projects and the cur rent commercial products.

2. Video Compression Standards

The most important video codec standards for streaming video are H.261, H.263, MJPEG, MPEG1, MPEG2 and MPEG4. A brief description of these is given below. Compared to video codecs for CD-ROM or TV broadcast, codecs designed for the Internet require greater scalability, lower computational complexity, greater resiliency to network losses, and lower encode/decode latency for video conferencing. In addition, the codecs must be tightly linked to network delivery software to achieve the highest possible frame rates and picture quality. As one looks at the existing codec standards, it becomes apparent that none are ideal for Internet video. In fact, it is quite clear that over the next few years, we will see a host of new algorithms that are specifically designed for the Internet and are thus more suitable for it. Research is currently underway looking at both new scalable, flexible codecs and ways of scaling existing codecs using transcoding and filters. Section 3 outlines current research in video scalability. New algorithms specifically targeted at Internet video are being developed. Consequently application framework standards such as H323/H.324 for videoconferencing and MPEG4, are being designed that will easily incorporate these new codec innovations into applications being developed today, without significant rework.

H.261

H.261 is also known as P*64 where P is an integer number meant to represent multiples of 64kbit/sec. H.261 was targeted at teleconferencing applications and is intended for carrying video over ISDN - in particular for face-to-face videophone applications and for videoconferencing. The actual encoding algorithm is similar to (but incompatible with) that of MPEG. H.261 needs substantially less CPU power for real-time encoding than MPEG. The algorithm includes a mechanism which optimises bandwidth usage by trading picture quality against motion, so that a quickly-changing picture will have a lower quality than a relatively static picture. H.261 used in this way is thus a constant-bit-rate encoding rather than a constant-quality, variable-bit-rate encoding.

H.263

H.263 is a draft ITU-T standard designed for low bitrate communication. It is expected that the standard will be used for a wide range of bitrates, not just low bitrate applications. It is expected that H.263 will replace H.261 in many applications. The coding algorithm of H.263 is similar to that used by H.261, however with some improvements and changes to improve performance and error recovery. The differences between the H.261 and H.263 coding algorithms are listed below. Half pixel precision is used for H.263 motion compensation whereas H.261 used full pixel precision and a loop filter. Some parts of the hierarchical structure of the datastream are now optional, so the codec can be configured for a lower datarate or better error recovery. There are now four optional negotiable options included to improve performance: Unrestricted Motion Vectors, Syntax-based arithmetic coding, Advance prediction, and forward and backward frame prediction similar to MPEG called P-B frames. H.263 supports five resolutions. In addition to QCIF and CIF that were supported by H.261 there is SQCIF, 4CIF, and 16CIF. SQCIF is approximately half the resolution of QCIF. 4CIF and 16CIF are 4 and 16 times the resolution of CIF respectively. The support of 4CIF and 16CIF means the codec could then compete with other higher bitrate video coding standards such as the MPEG standards.

MJPEG

There is really no such standard as "motion JPEG" or "MJPEG" for video. Various vendors have applied JPEG to individual frames of a video sequence, and have called the result "M-JPEG". JPEG is designed for compressing either full-color or gray-scale images of natural, real-world scenes. It works well on photographs, naturalistic artwork, and similar material; not so well on lettering, simple cartoons, or line drawings. JPEG is a lossy compression algorithm which uses DCT-based encoding. JPEG can typically achieve 10:1 to 20:1 compression without visible loss, 30:1 to 50:1 compression is possible with small to moderate defects, while for very-low-quality purposes such as previews or archive indexes, 100:1 compression is quite feasible. Non-linear video editors are typically used in broadcast TV, commercial post production, and high-end corporate media departments. Low bitrate MPEG-1 quality is unacceptable to these customers, and it is difficult to edit video sequences that use inter-frame compression. Consequently, non-linear editors (e.g., AVID, Matrox, FAST, etc.) will continue to use motion JPEG with low compression factors (e.g., 6:1 to 10:1).

MPEG-1

MPEG 1, 2 and 4 are currently accepted, draft and developing standards respectively, for the bandwidth efficient transmission of video and audio. The MPEG-1 codec targets a bandwidth of 1-1.5 Mbps offering VHS quality video at CIF (352x288) resolution and 30 frames per second. MPEG-1 requires expensive hardware for real-time encoding. While decoding can be done in software, most implementations consume a large fraction of a high-end processor. MPEG-1 does not offer resolution scalability and the video quality is highly susceptible to packet losses, due to the dependencies present in the P (predicted) and B (bi-directionally predicted) frames. The B-frames also introduce latency in the encode process, since encoding frame N needs access to frame N+k, making it less suitable for video conferencing.

MPEG-2

MPEG 2 extends MPEG 1 by including support for higher resolution video and increased audio capabilities. The targeted bit rate for MPEG 2 is 4-15Mbits/s, providing broadcast quality full-screen video. The MPEG 2 draft standard does cater for scalability. Three (3) types of scalability; Signal-to-Noise Ratio (SNR), Spatial and Temporal, and one extension (that can be used to implement scalability) Data Partitioning, have been defined. Compared with MPEG-1, it requires even more expensive hardware to encode and decode. It is also prone to poor video quality in the presence of losses, for the same reasons as MPEG-1. Both MPEG-1 and MPEG-2 are well suited to the purposes for which they were developed. For example, MPEG-1 works very well for playback from CD-ROM, and MPEG-2 is great for high-quality archiving applications and for TV broadcast applications. In the case of satellite broadcasts, MPEG-2 allows >5 digital channels to be encoded using the same bandwidth as used by a single analog channel today, without sacrificing video quality. Given this major advantage, the large encoding costs are really not a factor. However, for existing computer and Internet infrastructures, MPEG-based solutions are too expensive and require too much bandwidth; they were not designed with the Internet in mind.

MPEG-4

The intention of MPEG 4 is to provide a compression scheme suitable for video conferencing, i.e. data rates less 64Kbits/s. MPEG4 will be based on the segmentation of audiovisual scenes into AVOs or "audio/visual objects" which can be multiplexed for transmission over heterogeneous networks. The MPEG-4 framework currently being developed focuses on a language called MSDL (MPEG-4 Syntactic Description Language). MSDL allows applications to construct new codecs by composing more primitive components and providing the ability to dynamically download these components over the Internet. This philosophy is similar to that for the multimedia APIs being developed for Sun Microsystems Java, where it will be possible to dynamically download codec components. This trend is also seen in products from major vendors such as Microsoft and Netscape, where they allow for multiple audio and video codecs to be plugged into their real-time streaming solutions.

3. Scalable Video Compression Techniques

These can be sub-divided into DCT-based schemes (which include H.261, H.263, MPEG1 and MPEG2), wavelet and sub-band schemes, fractal-based schemes and image segmentation/region based compression schemes (MPEG4).

DCT-based Filtering Methods

H.261, H.263, MPEG1 and MPEG2 are all motion-compensated DCT-based schemes. Table 1 below, summarizes all of the different filtering methods which can be applied to DCT-based compressed video. Frame dropping filters, and the hierarchical splitting filter provide in-line adaptive services. All other filter mechanisms provide in-line translative services.

Mechanism Operation Compression State Class Implementation

Simple Frame Dropping Drops a specified percentage of frames. Applies only to intra-coded frames only Fully Compressed Frame Dropping Yes

Prioritised Frame Dropping Drops frames in a way that accounts for any frame interdependencies, e.g. MPEG I, P and B frames Fully Compressed Frame Dropping Yes

Compression Encodes an uncompressed bit-stream Uncompressed CODEC No

Decompression Uncompresses an encoded bit-stream Fully Compressed CODEC Yes

Transcoding Converts a bit-stream encoded using one compression standard to another DCT, Quantisation and Run-length encoded CODEC Yes

Colour to Monochrome Zeros both chrominance blocks DCT, Quantisation and Run-length encoded Colour Reduction Yes

DC Colour Removes all AC coefficients from both chrominance blocks DCT, Quantisation and Run-length encoded Colour Reduction Yes

Dithering Reduces the number of bits per pixel Uncompressed Colour Reduction No

Low Pass Removes from all blocks AC coefficients above an index in the the run-length encoding sequence DCT, Quantisation and Run-length encoded DCT Yes

Re-quantisation Applies a modified quantisation step to DCT blocks DCT DCT Yes

Limiting/Smoothing Converts a VBR to CBR bit-stream using either a dynamically adjusted low-pass or re-quantisation filter, while attempting to maintain the isochronity See low-pass or re-quantisation DCT Yes

Frame Interleaving Multiplexes frames from multiple streams into one stream. Identification of originating stream is necessary Fully Compressed Mixing Yes

Intra-frame Mixing Produces a single stream where each frame is a composite of corresponding frames from multiple streams Dependent on encoding method Mixing Yes

Audio and Video Multiplexing Combines an audio and video stream into one stream, e.g. an MPEG 1 System stream Fully Compressed Mixing No

Audio Mixing Irreversibly averages or sums audio samples from several streams into one Not Specified ( Fully Compressed ) Mixing No

QoS Separates a mixed stream into component streams for the application of sub-stream related QoS parameters Dependent on QoS parameters Splitting No

Hierarchical Subdivides a stream not necessarily hierarchical in nature into a set of streams that are hierarchically related Fully Compressed Splitting No
Table 1: Multimedia Communication Filters

**Table 1:** Multimedia Communication Filters
Mechanism	Operation	Compression State	Class	Implementation
Simple Frame Dropping	Drops a specified percentage of frames. Applies only to intra-coded frames only	Fully Compressed	Frame Dropping	Yes
Prioritised Frame Dropping	Drops frames in a way that accounts for any frame interdependencies, e.g. MPEG I, P and B frames	Fully Compressed	Frame Dropping	Yes
Compression	Encodes an uncompressed bit-stream	Uncompressed	CODEC	No
Decompression	Uncompresses an encoded bit-stream	Fully Compressed	CODEC	Yes
Transcoding	Converts a bit-stream encoded using one compression standard to another	DCT, Quantisation and Run-length encoded	CODEC	Yes
Colour to Monochrome	Zeros both chrominance blocks	DCT, Quantisation and Run-length encoded	Colour Reduction	Yes
DC Colour	Removes all AC coefficients from both chrominance blocks	DCT, Quantisation and Run-length encoded	Colour Reduction	Yes
Dithering	Reduces the number of bits per pixel	Uncompressed	Colour Reduction	No
Low Pass	Removes from all blocks AC coefficients above an index in the the run-length encoding sequence	DCT, Quantisation and Run-length encoded	DCT	Yes
Re-quantisation	Applies a modified quantisation step to DCT blocks	DCT	DCT	Yes
Limiting/Smoothing	Converts a VBR to CBR bit-stream using either a dynamically adjusted low-pass or re-quantisation filter, while attempting to maintain the isochronity	See low-pass or re-quantisation	DCT	Yes
Frame Interleaving	Multiplexes frames from multiple streams into one stream. Identification of originating stream is necessary	Fully Compressed	Mixing	Yes
Intra-frame Mixing	Produces a single stream where each frame is a composite of corresponding frames from multiple streams	Dependent on encoding method	Mixing	Yes
Audio and Video Multiplexing	Combines an audio and video stream into one stream, e.g. an MPEG 1 System stream	Fully Compressed	Mixing	No
Audio Mixing	Irreversibly averages or sums audio samples from several streams into one	Not Specified ( Fully Compressed )	Mixing	No
QoS	Separates a mixed stream into component streams for the application of sub-stream related QoS parameters	Dependent on QoS parameters	Splitting	No
Hierarchical	Subdivides a stream not necessarily hierarchical in nature into a set of streams that are hierarchically related	Fully Compressed	Splitting	No

MPEG-1 Scalability

Temporal scalability is possible in MPEG1 by dropping B frames and possibly P frames. B-frames can have references to past and future frames but aren't used as references themselves. It is possible to decode a sequence at a lower temporal resolution by simply skipping B frames. However because B frames are the most efficiently compressed, only small savings are obtained by omitting them. After the B farmes, P frames can also be dropped, again with relatively small savings. This will leave a stream of I-frames only. Chang and Zakhor [11] have implemented MPEG-1 scalability by storing the frames within a Group-of-Pictures (GOP) in a specific order.

MPEG-2 Scalability

Of the standard codecs, scalability is only addressed in MPEG-2. Three techniques, namely spatial scalability, data partitioning and SNR scalability, can be used. Spatial scalability refers to an approach where the original picture is first decomposed into several lower spatial resolutions. Each resolution is encoded with its own motion-compensated coding loop. Blocks in the higher-resolution layers can be predicted either using motion-compensated temporal prediction or from spatially interpolated blocks of a lower resolution layer. Spatial scalability has attracted considerable interest in the potential of its application to HDTV transmission. Data partitioning splits encoded data into two bit streams - one containing the more important basic data (e.g. low frequency DCT coefficients and motion vectors) and one containing the less important data. A viewable picture of lower visual quality can be decoded from the more important stream. This technique is used for transmitting MPEG2 over ATM networks. Less important packets are discarded first if congestion occurs. SNR scalability allows for encoding of a base layer and an enhancement layer at the same spatial resolution (frame size). The base layer contains coarsely quantized DCT coefficients. The enhancement layer carries information from which finely quantized DCT coefficients can be obtained. SNR scalability is similar to data partitioning and can be used for transmitting MPEG-2 over ATM. There are a number of problems associated with MPEG-2 scalability: A spatial scalability layer increases hardware costs by around 30% There is a loss in picture quality of ~0.5dB for multi-layer scalable MPEG2 compared with single layer coding of the same bit rate. Data partitioning and SNR scalability can cause a drift problem and should only be used if the loss of the higher bit-rate layer lasts for only a few seconds or if I-frames are sent more frequently in the low bit-rate layer Typically a combination of the spatial and SNR scalability is applied to create 3-layer coding of the video signal: the base layer provides the initial resolution, an additional spatial enhancement layer allows for upsampling and hence increase in frame size of the base layer an SNR enhancement layer provides for an increase in visual quality of the base+spatial enhancement layers of the video Table 2 below (taken from [12]) shows the incremental bit rates for each layer of a typical MPEG-2 video stream. In addition, the Peak Signal-to-Noise Ratios (PSNR), which are a measure of visual quality, are shown.

Layer Avg. Bit Rate (Mbps) Frame Size Visual Quality Avg. PSNR (dB)

Base 0.32 304 x 112 VHS 35

Spatial Enhancement 0.832 608 x 224 SuperVHS 34

SNR Enhancement 1.856 608 x 224 SuperVHS 37

Table 2: Comparison of MPEG2 Scalable Layers

**Table 2:** Comparison of MPEG2 Scalable Layers
Layer	Avg. Bit Rate (Mbps)	Frame Size	Visual Quality	Avg. PSNR (dB)
Base	0.32	304 x 112	VHS	35
Spatial Enhancement	0.832	608 x 224	SuperVHS	34
SNR Enhancement	1.856	608 x 224	SuperVHS	37

Subband/Wavelet Coding

The majority of scalable video codecs are based on subband coding techniques of which the most widely used is the wavelet transform. VDONet and Vxtreme use the Wavelet codecs. There is also a lot of work going on in research organisations looking at the application of wavelet and subband coding techniques to scalable video codecs - see sections 7.2, 7.3, 7.4, 7.5.

Fractal Video Coding

Various research groups [13, 14] are investigating the application of fractal compression to scalable video. Iterated Systems have developed a commercial product which has been implemented within Progressive Network's RealVideo product.

Image Segmentation and Object-based Video Coding

A number of research groups are investigating the application of image segmentation to video compression. The approaches involve extracting important subsets of the image content of each frame and only delivering the most important e.g. object boundaries, moving objects. Object-based coding can achieve very high data compression rates while maintaining an acceptable visual quality in the decoded images. However object-based coders are computationally intensive and to be viable as a real time process, an object-based coder would need to have the image segmentation algorithm implemented as a VLSI array.See section 7.6, the UC Davis Image Sequence Processing Group [22] and 7.7 the Video Communication Research Group (VCRG), Uni. of Western Australia [19], and the Bath Video Coding Group [21]. The MPEG4 standard is directly related to this content-based scalable video codec approach.

4. Internet Transport Protocols

TCP Transmission Control Protocol

HTTP (Hypertext Transfer Protocol) uses TCP as the protocol for reliable document transfer. If packets are delayed or damaged, TCP will effectively stop traffic until either the original packets or backup packets arrive. Hence it's unsuitable for video and audio because:

TCP imposes its own flow control and windowing schemes on the data stream, effectively destroying temporal relations between video frames and audio packets
Reliable message delivery is unnecessary for video and audio - losses are tolerable and TCP retransmission causes further jitter and skew.

UDP

UDP (User Datagram Protocol) is the alternative to TCP. RealPlayer, StreamWorks and VDOLive use this approach. (RealPlayer gives you a choice of UDP or TCP, but the former is preferred.) UDP forsakes TCP's error correction and allows packets to drop out if they're late or damaged. When this happens, you'll hear or see a dropout, but the stream will continue. Despite the prospect of dropouts, this approach is arguably better for continuous media delivery. If broadcasting live events, everyone will get the same information simultaneously. One disadvantage to the UDP approach is that many network firewalls block UDP information. While Progressive Networks, Xing and VDOnet offer work-arounds for client sites (revert to TCP), some users simply may not be able to access UDP files.

RTP Real Time Protocol

RTP is the Internet-standard protocol (RFC 1889, 1890) for the transport of real-time data, including audio and video. RTP consists of a data and a control part called RTCP. The data part of RTP is a thin protocol providing support for applications with real-time properties such as continuous media (e.g., audio and video), including timing reconstruction, loss detection, security and content identification. RTCP provides support for real-time conferencing of groups of any size within an internet. This support includes source identification and support for gateways like audio and video bridges as well as multicast-to-unicast translators. It offers quality-of-service feedback from receivers to the multicast group as well as support for the synchronization of different media streams. None of the commercial streaming products uses RTP (Real-time Transport Protocol), a relatively new standard designed to run over UDP. Initially designed for video at T1 or higher bandwidths, it promises more efficient multimedia streaming than UDP. Streaming vendors are expected to adopt RTP, which is used by the MBONE.

VDP

Vosaic uses VDP, which is an augmented RTP i.e. RTP with demand resend. VDP improves the reliability of the data stream by creating two channels between the client and server. One is a control channel the two machines use to coordinate what information is being sent across the network, and the other channel is for the streaming data. When configured in Java, this protocol, like HTTP, is invisible to the network and can stream through firewalls.

RTSP Real Time Streaming Protocol

In October 1996, Progressive Networks and Netscape Communications Corporation announced that 40 companies including Apple Computer, Autodesk/Kinetix, Cisco Systems, Hewlett-Packard, IBM, Silicon Graphics, Sun Microsystems, Macromedia, Narrative Communications, Precept Software and Voxware would support the Real Time Streaming Protocol (RTSP), a proposed open standard for delivery of real-time media over the Internet. RTSP is a communications protocol for control and delivery of real-time media. It defines the connection between streaming media client and server software, and provides a standard way for clients and servers from multiple vendors to stream multimedia content. The first draft of the protocol specification, RTSP 1.0, was submitted to the Internet Engineering Task Force (IETF) on October 9, 1996. RTSP is built on top of Internet standard protocols, including: UDP, TCP/IP, RTP, RTCP, SCP and IP Multicast. Netscape's Media Server and Media Player products use RTSP to stream audio over the Internet.

RSVP

RSVP is an Internet Engineering Task Force (IETF) proposed standard for requesting defined quality-of-service levels over IP networks such as the Internet. The protocol was designed to allow the assignment of priorities to "streaming" applications, such as audio and video, which generate continuous traffic that requires predictable delivery. RSVP works by permitting an application transmitting data over a routed network to request and receive a given level of bandwidth. Two classes of reservation are defined: a controlled load reservation provides service approximating "best effort" service under unloaded conditions; a guaranteed service reservation provides service that guarantees both bandwidth and delay.

5. Other Related Issues

Server or Serverless

Two major approaches are emerging for streaming multimedia content to clients. The first is the server-less approach which uses the standard web-server and the associated HTTP protocol to get the multimedia data to the client. The second is the server-based approach that uses a separate server specialized to the video/multimedia streaming task. The specialization takes many forms, including optimized routines for reading the huge multimedia files from disk, the flexibility to choose any of UDP/TCP/HTTP/Multicast protocols to deliver data, and the option to exploit continuous contact between client and server to dynamically optimize content delivery to the client. The primary advantages of the server-less approach are: (i) there is one less software piece to learn and manage, and (ii) from an economic perspective, there is no video-server to pay for. In contrast, the server-based approach has the advantages that it: (i) makes more efficient use of the network bandwidth, (ii) offers better video quality to the end user, (iii) supports advanced features like admission control and multi-stream multimedia content, (iv) scales to support large number of end users, and (v) protects content copyright. The tradeoffs clearly indicate that for serious providers of streaming multimedia content the server-based approach is the superior solution. RealPlayer, StreamWorks and VDOnet's VDOLive require you to install their A/V server software on your Web server computer. Among other things, this software can tailor the quality and number of streams, and provide detailed reports of who requested which streams. Other programs, such as Shockwave and VivoActive, are serverless. They don't require any special A/V server software beyond your ordinary Web server software. With these programs, you simply link a file on your server's hard drive from a Web page. When someone hits the link, the file starts to download. Serverless programs are simple to incorporate into a Web site but don't have the reporting capabilities of server-based programs. And because they lack both stream- and bandwidth-management features, they may be problematic if you need to support many simultaneous streams.

Java Replayers Replacing Plugins

New solutions are appearing which use Java to eliminate the need to download and install plugins or players. Such an approach will become standard once the Java Media Player APIs being developed by Sun, Silicon Graphics and Intel are available. This approach will also ensure client platform independence. Vosaic appears to be one of the few products with a Java replayer which supports H.263.

FireWalls

Nearly all streaming products require users behind a firewall to have a UDP port opened for the video streams to pass through (1558 for StreamWorks, 7000 for VDOLive, 7070 for RealAudio). Rather than punch security holes in the firewall, Xing/StreamWorks has developed a proxy software package you can compile and use, while VDONet/VDOLIve and Progressive Networks/RealPlayer are approaching leading firewall developers to get support for their streams incorporated into upcoming products. Currently a number of products change from UDP to HTTP or TCP when UDP can't get through firewall restrictions. This reduces the quality of the video. In all cases, it's still a security issue for network managers.

6. Commercial Real Time Video Streamers

MacroMedia's Streaming Shockwave

Shockwave for Director consists of two components. On the HTTP server side, the Afterburner tool compresses Director movies to make them available on the Internet. On the client side, the Shockwave plugin lets the user incorporate Director movies into the page layout of their HTML document. The current Shockwave plugin is not streaming. The entire Director movie must be downloaded before playback. The current release allows for a seperate real-time audio stream which can be encoded at 8,16,32 or 64 kbps, depending on the most likely bandwidth available to users. Macromedia have just released Director 6 Multimedia Studio which supposedly includes new Streaming Shockwave technology. Macromedia and Progressive Networks have also announced the integration of Shockwave Flash, a vector-based animation and graphics system, on top of RealMedia, to enable audio and video streaming of output from Flash.It is a serverless product which relies on the HTTP protocol only. It isn't capable of live feeds and makes no use of IP Multicast, so it can't scale well to support thousands of enterprise customers while efficiently using bandwidth.

Progressive Network's RealVideo

Progressive Networks has recently launched RealVideo, the streaming video version of their well-known RealAudio product. Both server and client versions have been released. In addition Progressive Networks have released a range of video-oriented content development tools, some their own, others developed by third parties. Users need to install the RealServer 4.0 and the RealPlayer Plus 4.0. It uses the RTSP protocol on top of UDP. Users apparently have a choice of either fixed or optimized frame rate encoding in the new RealVideo encoder. Users choose between a number of pre-defined encoding templates which correspond to the most appropriate audio and video formats for a given bandwidth. "Stream thinning" detects poor or congested Internet connections and will dynamically adjust the video frame rate in real-time. This is presumably frame dropping. "Smart networking" automatically delivers audio and video streams via the most efficient network protocol. This is presumable choosing between TCP, UDP or UDP multicast. The choice of TCP would be to deal with firewall restrictions blocking UDP. Progressive Networks have recently licensed in ClearVideo, a fractal-based video compression technology from Iterated Systems (see http://www.iterated.com) to complement their internally-developed compression methods. RealVideo 1.0 provides two codecs RealVideo Standard (developed by Progressive Networks) and RealVideo Fractal (using Clear Video technology from Iterated Systems, Inc.).

Xing Technology's StreamWorks

StreamWorks streams video and audio over the WWW using UDP/IP. Video streams can be MPEG1 while audio can be MPEG1 or MPEG1 private data streams containing MPEG2 LBR audio. Providers encode content at 8.5, 24, 56 or 112 kbps depending on the bandwidth capabilities of the potential users. StreamWorks supports a process called thinning which reduces a high-bandwidth stream so it can be transmitted over a lower bandwidth connection. At low bandwidths, the software maintains a continuous audio stream of 8 to 10 Kbps, and the video stream uses whatever bandwidth is left. The MPEG-based compression allows the software to drop frames from the stream, creating a jerky video sequence with almost no motion, while maintaining a smooth audio playback. The quality of the frames that do get through is still pretty good, just not as fluid as one would expect from real video. StreamWorks' is able to broadcast streams to "relay servers". By using a star configuration, it's possible to provide a video feed from a single server to regional servers that then provide that stream to desktop clients. StreamWorks' technology includes three components: the client software, the server software and a video capture/encoding box called the AVTrans encoder to compress audio and video streams. These streams are transferred to a Unix server running the StreamWorks server software over a TCP/IP network and, from there, are broadcast over the network to client workstations. The AVTrans encoder is capable of creating a range of compressed streams, ranging from an 8.5-Kbps low bit rate format that produces 8-kilohertz mono audio on the client, to a 112-Kbps stream that provides 44 KHz of stereo audio or 30-frames-per-second, quarter-screen video for large bandwidth connections such as Ethernet or a T1. Like the other products examined in this review, StreamWorks requires you to register its mime type in your Web server's configuration file, and you need to open a UDP port (1558) for delivering video to client workstations. The server software can recode the compressed streams on the fly to compensate for large numbers of users and a limited bandwidth. The server is configured from a text file, so you can limit the total bandwidth output, the maximum number of simultaneous streams and the maximum bit rate per stream. The maximum default configuration for the server is 10 Mbps for an Ethernet connection, but that can be adjusted depending on how your client machines are connecting--via 14.4-Kbps modem pool, ISDN hub or 100-Mbps backbone. With a 28.8-Kbps modem connection, the StreamWorks server drops to a much lower frame rate of 2 to 3 frames per second, producing a jerky, halting video image while maintaining continuous audio continuity. Client performance is better than VDOLive.

VDONet's VDOLive

VDONet claim that content providers only need one video source which can be scaled on the fly for both high and low scale connections. They claim to be able to deliver 10-15 fps over a 28.8 kbps modem using a proprietary video compression scheme based in part on wavelet techniques (VDOWave). Under ideal conditions (minimal Internet traffic, no local network overhead, minimal overload on the VDOLive On-Demand Server): with a 14.4 kbps modem: up to 2 to 3 frames per second with a 28.8 kbps modem: from 8 to 12 frames per second with an ISDN line: up to 20 frames per second. VDONet's VDOLive boasts a slightly higher frame rate over a standard 28.8-Kbps modem than StreamWorks because it uses a wavelet compression technology that lets it shave layers of quality off each frame that's transmitted, rather than dropping whole frames. This creates a stream that is smoother at low bit rates, but of lower visual clarity and quality. VDOLive appears to be the only commercial product which tries to estimate bandwidth and adapt dynamically. The image quality is very poor at times but audio is good. VDOLive includes two programs, VDO Capture which lets you capture video streams and VDO Clip which compresses a previously captured video stream and encodes it for delivery from a VDO server. VDO Capture supports seven full-motion video cards that can capture 16- or 24-bit color images at 15 frames per second in a frame size ranging from 64-by-64 pixels to 250-by-176 pixels. Unfortunately, existing AVI files that don't meet these criteria can't be used unless they're converted. The VDOLive client is blunt, but effective. Hitting the play button calls up a window for you to enter an address for the VDOLive meta file that points to the video stream you want to launch. There are also a few user-configurable parameters behind this window. VDOLive is supported by some firewall vendors. However if UDP-based video is blocked by a firewall, VDOLive resorts to TCP-based video instead. VDONet's codec VDOWave has been included in the codecs shipped with Microsoft's NetShow since 1996. Microsoft hold an equity stake in VDONet.

Vosaic

Based on research at the University of Illinois, Vosaic uses the Video Datagram Protocol (VDP) protocol.VDP is basically an augmented RTP. VDP improves reliability by creating two separate channels between the client and server; one is a control channel the two machines use to coordinate what information is being sent across the network, and the other channel is for the streaming data. A server would first send the client what amounts to an inventory of the stream that is about to be broadcast. The client then uses this list to tell the server which segments to deliver, and if a segment of the stream is lost or delayed, the client can simply ask for that segment be resent. The stream itself is buffered on the client side, providing for smooth playback in most cases. VDP also uses adaptive flow control on the server side that can adapt the packet flow based on how well the client is doing. If the client is doing well and receiving all the frames, the server can increase the number of packets being sent out onto the network. If the client is having trouble keeping up or the network is so loaded that packets are being delayed, the server can drop packets from the stream. VDP is designed to preserve network bandwidth in response to both network congestion as well as client CPU load. Vosaic supports video and audio standards including MPEG1, MPEG2, GSM audio, and H.263. To view Vosaic's streaming videos you need the Vosaic plug-in. It also requires you to down load both a VOSAIC client and a server. There is a new version out based on Java which VOSAIC MediaStudio is a JAVA-based authoring application which can convert AVI/ASF formats and MPEG1/2 formats into bandwidth compatible MPEG or H.263 files. The quality (target frame rate, quantisation, MPEG frame sequence(IPBIPB)) need to be pre-set depending on the likely connection bandwidths of your clients. Vosaic appears to be quite similar to SuperNOVA. It uses both feedback and a feedforward scheme to adapt to both network and end system conditions. However it doesn't include end-to-end QoS management with user interaction. Dynamic scaling is only frame-dropping, within the boundaries pre-determined at capture time. It does not support transcoding on the fly. On a T1 link your source is MPEG while on a 28K link your source is H263. On the plus side - they already have a 100% Java H263 player. Vosaic had a lot of audio dropouts compared to VDOLive which maintains audio at all costs. It delivered 8bit video only and suffered from missing blocks due to packets being lost - a consequence of MPEG1 encoded video.

VXtreme

VXtreme consists of a number of WebTheater products: Web Theater Client, Server, Producer, LiveStation, and Personal Edition. VXtreme's software-only compression technology automatically adapts the bandwidth of the video to the network connection. VXtreme's Web Theater software uses RTP (Real Time Protocol) as its network delivery mechanism extended to include mechanisms for packet loss recovery. VXtreme's compression method is non-standard. They claim it offers bandwidth scaling and software-only capability. It is apparently not based on DCT or motion estimation (H.261, H.263, MPEG1,2) or wavelets which they claim are compute-intensive and require hardware-support. For the multicast case, VXtreme uses a layered compression scheme to divide the compressed video into multiple streams with differing priorities (based on importance to visual quality). This layered approach reduces jitter caused by frame dropping and delivers smoother but lower resolution video. They have a bizarre congestion control method which freezes both audio and video and then restarts. Their proprietary encoding method is just as blocky as DCT-based encoding. Microsoft has recently acquired VXtreme's codec to ship with NetShow.

Vivoactive

The VivoActive player supports audio/video streaming of proprietary VIV files over the web with standard HTTP connections. VIV files are compressed (up to 250:1) files created by the VivoActive producer. Presently, the Producer can be downloaded for free. The plug-in works well with VIV files, but not many sites have VIV files.The VIVO format uses H.263 video compression and G.723 audio compression. No separate video server required. Uses HTTP rather than UDP. While Vivo acknowledged that there is some inevitable loss in speed and quality using HTTP vs. UDP, they, argued it is negligible, and that it is more than made up for by the fact that HTTP, which will continue to send streams even when packets are dropped, is more flexible and less of a bandwidth hog than UDP. Not truly scalable - users can control how a video file is compressed and delivered by specifying a bandwidth. You can choose from a variety of predefined settings to optimize your video depending on the type of content you're streaming and the network connection of your audience (modem, ISDN, T1, LAN).lets you customize the data rate, frame rate, output size, audio quality and buffering parameters for your streaming video.

Microsoft's NetShow

Microsoft's NetShow expects the user to first create an ASF (Active Movie Streaming) stream. The user has to choose from a range of audio and video codecs depending on their bandwidth availability. Codecs on offer include MPEG-layer3, Microsoft MPEG-4, Vivo G.723 (audio) and H.263 (video). Content can be produced using VivoActive. It doesn't appear to offer dynamic scalability but relies on the user to choose from a table of codecs depending on whether they are on a 28.8Kbps modem, 56Kbps ISDN or 110Kbps Intranet connection. NetShow will also support the Progressive Networks RealAudio and RealVideo formats. It requires both a client (NetShow Player) and a server (NetShow Server). There is also a set of NetShow Content Creation Tools. It uses the UDP protocol and relies on port 1755 to get through firewalls. A Netscape plug-in is used to replay the video. The major limitation of NetShow is that it doesn't support high quality video formats which would be deliverable over high bandwidth connections. But it does deliver very good quality video (using the latest compression standards, H.263 and MPEG4) at low bandwidths. The advantage of NetShow is its flexibility. It supports a range of audio and video codecs which can simply be plugged into the NetShow architecture to provide a range of video/audio streaming solutions. Codecs on offer include: Duck TrueMotion RT, MPEG-3, Iterated Systems' ClearVideo, Microsoft MPEG4, VDOnet's VDOWave, Vivo H.263, Intel H.263. In addition, they have just acquired Vxtreme. See http://www.microsoft.com/netshow/codecsship.htm

7. Comparison of Commercial Video Streaming Products

The previous section describes the 8 major players in this field. The best ones are those which deliver the highest quality video for a given bandwidth i.e. lowest delay, no jitter (low frame loss), good audio/visual synchronisation, high quality audio and image resolution. In addition, the ability to provide the best possible video quality over a range of networks/bandwidths without content duplication is highly desirable. This characteristic is referred to as scalability.

All commercial products, except ShockWave claim some form of video scalability. Investigation reveals that often the claims of scalability are not what they appear to be or are simply misleading. The scalability more often than not is static and not dynamic, and there is little user control in the visual manifestation of this scalability.

The currently available commercial products offer two types of scalability. Firstly, there is scalability at the encoding stage. Users are given a range of encoding formats to choose from, which correspond to a range of bandwidths. The limitation of this scalability is that users need to know the bandwidth in advance. This is inflexible - any unpredicted load cannot be handled gracefully. Additionally, in a multi-receiver scenario the selected bandwidth must be that of the lowest channel's capacity. This is an unrealistic restriction and a waste of bandwidth for higher capacity receivers. Also forcing an individual to select bandwidth assumes some sort of technical awareness, and does not easily illustrate the related visual quality of the selected video. Multiple formats were not supported from a single source, but rather required the existence of a clip in the desired format. This entails an overhead in administration and storage of audio and video material.

Secondly, some of the products also incorporate some kind of dynamic scalability based on the available bandwidth at the time. Where dynamic scalability is provided it is usually simple frame dropping. This is not ideal because it can cause jerkiness and loss of synch. Alternatively, a layered or hierarchical compression method can be used. Layered compression methods usually lose image quality or resolution but maintain frame rate as the bandwidth drops. VXtreme claims to use a layered compression method but it only supports AVI and MOV file formats.

VOSAIC supports a variety of codecs - H.263, MPEG1 and MPEG2 - to suit the available bandwidth which can range from 28.8Kbps to T1. The bandwidth must be specified at encoding so that the most appropriate codec can be selected. Limited dynamic adaption is possible through frame dropping.

VDOLive is based on a proprietary wavelet encoding which enables 10-15fps, 1/4 screen video replay over 28.8Kbps. It scales dynamically from 14.4Kbps modem to ISDN and Cable modems.

VivoActive offers a very simple solution for low bandwidth connections. It doesn't require a server since it uses HTTP and it simply uses the low bandwidth H.263 and GSM codecs to enable embedded audio/video streaming over 28.8 Kbps modems. But it doesn't support high quality video (MPEG1, MPEG2) over higher bandwidths.

Progressive Network's RealVideo has recently incorporated Iterated Systems fractal compression technology, which will improve its ability to dynamically scale to a range of bandwidths.

The philosophy being adopted by the major vendors such as Sun, Microsoft and Netscape is to provide the ability to dynamically download codec components over the Internet. In the multimedia APIs being developed for Sun Microsystems Java, it will be possible to dynamically download codec components. This trend is also seen in products from major vendors such as Microsoft and Netscape, where they allow for multiple audio and video codecs to be plugged into their real-time streaming solutions. Consequently, Microsoft's NetShow which has been designed to allow a variety of codecs, suited to differing applications, to be easily incorporated, offers flexibility and support for the latest scalable video compression techniques.

8. Commercial Video Servers

High-end database-driven video servers are also available from companies like IBM, Oracle, SGI,Sun and Tektronix. These products should be considered for large scale applications or for serving large numbers of simultaneous streams.

SGI WebForce MediaBase

Sun MediaCenter Servers

IBM VideoCharger and Digital Library

9. Research on Continuous Media Toolkits and QoS Architectures

Below are descriptions of the major research projects investigating the use of scalable video compression to dynamically adapt to variable bandwidth, to ensure multimedia delivery. We have subdivided the research into continuous media testbeds/architectures and scalable video codecs.

Berkeley Continuous Media Toolkit

The Continuous Media Toolkit (CMT) is a toolkit for multimedia applications. It is built on top of Tcl/Tk and Tcl-DP. CMT is freely distributed and is very portable. CMT supports several audio and video encoding formats, including Sparc style audio (8-bit mu-law compressed or 16-bit linear), MPEG video, MJPEG video, and H.261 video. It contains support for a number of audio interfaces including the Sparc, Linux, and Irix devices, as well as DEC's AudioFile. It also contains software MPEG, MJPEG, and H.261 decoders as well as the capability to perform hardware assisted decompression sing the Sun Parallax, SunVideo, DEC J300, or SGI Cosmo board.

The toolkit is implemented as a collection of objects, each of which handles a specific task, for example, reading MPEG encoded video from a file or decoding and displaying MPEG encoded video. Objects can be easily created and connected to build applications. Aside from objects that read or decode audio and video, a number of other interesting objects are available, including: Objects to support the construction of distributed applications, Objects to transmit and receive data across a TCP/IP network using Cyclic-UDP, a best effort protocol. Objects to transmit and receive data using the Real-time Transport Protocol, the protocol used by the MBONE tools, Objects to filter uncompressed video. Objects to display video on the experimental Infopad.

CMT also comes with the CMplayer, a sample CMT application that can be used to play audio and video files locally or from a CMT video file server.The CM Player employs Cyclic-UDP for the transport of video streams between the VOD server and the CM Player client. Frames are prioritized at the server, and clients request resends on detecting frame losses. Cyclic-UDP repeatedly resends high priority frames in order to give them a better chance of reaching the destination. VDP's demand resend algorithm is similar to Cyclic-UDP, except that the client decides which frames get retransmitted. In an MPEG transmission, the client can decide to tolerate the loss of B frames but require the resend of all I frames. Despite implication or reference to transcoding [1, 2], examination of CMT 3.0 Beta 3 source code provides no evidence of such capabilities. One paper [3] explicitly notes the requirement of CMT to support transcoding to "provide end-to-end delivery". One can only conclude that in this context the meaning of transcoding differs to its conventional interpretation.

Application Level Gateway

The Application Level Gateway [4] is an outcome of the Daedalus (Wireless Networking and Mobile Computing) Project proceeding at the University of California, Berkeley. It [5, 6] is an attempt to handle the disparity that exists between end-to-end systems, and the networks that connect them. In particular, video conferencing is a focus. The application level gateway attempts to achieve its goals using dynamic and transparent "bandwidth adaption" of the continuous video bit-stream. The explicit mechanisms being video transcoding and rate control. Video transcoding is the conversion from one encoding format to another. The simplest method involves decoding one format and re-coding it in the new format.This is also computationally the most expensive method. Alternatively, transcoding can occur in the DCT ( or frequency ) domain and is computationally more efficient. The latter is the approach that this project adopts. Due to differences in the usual representation of MJPEG (NTSC and PAL ) and H.261 ( CIF and QCIF ) frames this process involves frame format conversion, i.e. DCT block down-sampling, associated target format dependent resizing, followed by additional chrominance sub-sampling that effectively combines two vertically adjacent DCT blocks into one. Rate control is a simple time based frame dropping algorithm. The next time to transmit a frame is derived from a desired bandwidth and includes the gateway latency. All frames that arrive prior to this calculated time, are dropped. Furthermore, packets that constitute a frame are evenly distributed over the entire duration of the frame. The result is a smoothed but variable frame rate. Prototype gateway software has been built. The most recent version is rtpgw-1.0a20. It provides a user interface written in Tcl/Tk, and video and audio gateways written in C and C++. It integrates the Real-time Transport Protocol ( RTP ) and multicasting support into the implementation. The video gateway transcodes intra-H.261 ( a H.261 subset ) [7], Motion-JPEG ( MJEPG ) and NetVideo ( NV ) to intra-H.261. The audio gateway transcodes pulse coded modulated ( PCM ) audio to linear predictive coded ( LPC ) audio. The gateway is currently used to distribute seminars over the Bay Area Gigabit Network (BAGNET).

The application level gateway is an analogue to network gateway. As such each node in a network needs video gateway software to be individually installed and configured. This requires human intervention, i.e. it is a static entity. Bandwidth adaption only uses transcoding and rate-shaping. There are other operations that can be performed on a ( continuous media ) stream that can alter its bandwidth while retaining an acceptable degree of perceived ( visual ) quality. While transparent adaption is a desirable feature, application level gateway also prevents any user control on the quality of the media delivered.

Distributed Realtime MPEG Video Audio Player

To study the effectiveness of software feedback mechanisms for client/server synchronization, dynamic QoS control and system adaptiveness on the Internet, and to investigate the toolkit approach, the Dept. of Computer Science and Engineering, Oregon Graduate Institute of Science & Technology, have constructed a distributed real-time MPEG video and audio player [8]. The player consists of a client and audio and video servers which can be distributed across the Internet. It supports variable play speed and random positioning as well as common VCR functions. The salient features of the player include: (a) real-time, synchronized playback of MPEG video and audio streams, (b) user specification of desired presentation quality, (c) QoS adaptation to variations in the environment, and (d) a toolkit approach to building software feedback mechanisms. While their system is a stand alone distributed MPEG player like Berkeley's CM Player, it shares many ideas in common regarding feedback in order to preserve network bandwidth. Currently it appears only to support the dropping of frames to cope with reduced bandwidth availability.

Multimedia Communication Filters

The Multimedia Communication Filters [9] project of the Multimedia Project Group at Lancaster University, like the UCB application level gateway attempts to address hetereogenity in end-to-end continuous media communications, in a dynamically adaptive way. The primary context of this work is a continuous multimedia quality of service ( QoS ) architecture. Quality of service is attained via an entity, similar in nature to the application level gateway, called a filter. Conceptually a filter can be considered a container for a range of mechanisms that can perform specific operations on a continuous media stream. A filter can provide three types of services; end-to-end, in-line adaptive and in-line translative. An end-to-end filter is essentially source based filtering. In-line adaptive filters alter data stream characteristics. Such operations require little or no decompression and are therefore not computationally expensive. In-line translators change the form of encoded data, and are computationally intensive, minimally requiring partial or even full decompression. The latter two filtering services are usually provided by network entities. It should be also noted that any adaptive or translative filter entity can also provide end-to-end filtering services. Additional adaptability is obtained by the ability of network filter entities to combine and propagate to upstream nodes including the source. Information on specific filtering mechanisms appears in tables 1 and 2, Multimedia Communication Filters.

Image and Advanced TV Lab, Columbia

Columbia's VOD and Multimedia Research Testbed [10] supports delivery of MPEG1 and MPEG2 audio/video stored as transport streams over a variety of heterogeneous networks e.g. ATM, Ethernet and wireless. A single server is capable of providing different QoS depending on the client's capabilities. Client access is via a Web browser. A software encoder is used to generate scalable hybrid 3-layer MPEG2 transport streams i.e.

a base layer which has a small spatial size and is suitable for browsing and previewing
the spatial enhancement layer which has a larger frame size than the base layer
the SNR enhancement layer, which improves the signal quality as well as the spatial resolution

They have implemented a technique called "Dynamic Rate Shaping" (DRS) which is concerned with both the difficulties of bandwidth estimation and shaping compressed MPEG-2 streams to a continuum of possible bandwidths. DRS selectively drops coefficients from the bit stream which are of least importance to image quality.In addition, they are developing methods for mapping MPEG-2 bitstreams to RTP payload since RTP doesn't specify how this should be done.

Applications being developed on top of the testbed include Columbia's Interactive Electronic News System, Digital Libraries and Interactive Video Courses on Demand.

DSTC SuperNOVA

This project defines an end-end QoS framework for heterogeneous networks and end-systems. It provides the application builder with an abstraction of the underlying network and end-system resources which can have varying QoS management capabilities. This allows multimedia applications to operate transparently over different types of networks such as ATM, Ethernet and mobile. This framework is realised as a Java class library. SuperNOVA is a video on demand application built using this framework. A block diagram is shown in Figure 1 below. It provides the user with the ability to control QoS preferences and cost. Depending on the underlying network, either network resources are reserved (ATM) and the application configured, or the application dynamically adapts (IP) to the available bandwidth. This adaptation (or configuration) is performed according to the users preferences for various user level QoS parameters such as frame rate, video quality size etc. In order to be able to dynamically adapt the video, a high quality video file is filtered on the fly to enable delivery over reduced bandwidths. Scalability mechanisms currently implemented include transcoding from MJPEG to intra-H.261, downward variation of the encoding quality and frame dropping. Other scalability mechanisms are targeted, e.g. colour reduction, re-quantisation, etc., well as support for MPEG and H.263 and transcoding between other formats. The scalable video server presents a CORBA interface which allows the client to control the video and control QoS. This allows for more complex applications. News-on-demand, video library and distance learning applications have been developed on top of the SuperNOVA architecture.

10. Research Projects on Scalable Video Codecs

Below are descriptions of some of the major research projects currently investigating the problem of adaptive video scaling.

Lancaster Filter System

Software known as the Lancaster Filter System (LFS) exists to demonstrate some of the filtering mechanisms. The latest release available is version 2.0. The LFS is implemented in C, with two interfaces written in Tcl/Tk. One interface allows explicit control of a specific filter's operations, the other is a mock client that incorporates a modified Berkeley MPEG 1 software decoder that is network aware. The filter system relies on an associated underlying QoS protocol suite. A component, the Continuous Media Protocol(CMP) provides an application level framing service. The LFS operates on MPEG 1 compressed video streams and transcodes MJPEG to MPEG I-frames only. Tables 1 and 2 indicate the filter mechanisms implemented. Filter demonstrations are also available on the Web [15].

The filtering concept approaches the idea of intelligent network agents. Each network node has a large repository of operations that can be deployed in the most appropriate place and combined when necessary. Manual intervention is minimised. Support for multicasting is not stated. Transcoding, frame interleaving and intra-frame mixing filters however, have only been implemented as file based entities, i.e. they are providing source based services. Consequently, the implementation's efficiency may not be optimal, as the processing time constraints applicable to network based entities is no longer critical. A possible attitude towards these filters types is of format conversion and presentation rather than bandwidth adaption. Intra-coded only translation is another limitation of several filter types. Colour to Monochrome, DC Colour, Low Pass Re-quantisation and transcoding is only applicable to I-frames. Finally, it should be noted that the Limiting/Smoothing filter in its quest for a CBR incurs a trade-off of varying resolution.

VIP Lab, UCB

The VIP Lab are investigating scalable video coding using both MPEG-1 and subband coding. Chang and Zakhor [11] have implemented MPEG-1 scalability. By storing the frames within a GOP in a specific order, they can exploit scalability by a specific choice of frame dropping. Assuming a standard GOP pattern of IB1B2P1B3B4P2B5B6P3B7B8P4B9B10, then the following four non-zero scalable bit rates can be extracted: I frames only = 12.7 blocks/sec; IP1P2P3P4 = 30.4 blocks/sec; IB1P1B3P2B5P3B7P4B9 = 34.6 blocks/sec; IB1B2P1B3B4P2B5B6P3B7B8P4B9B10 = 38.8 blocks/sec. The strategy is to group all of the frames for each rate, together in each read unit. By doing this, the scalable storage pattern becomes: IP1P2P3P4B1B3B5B7B9B2B4B6B8B10. Taubman and Zakhor [16] developed a scalable multirate video codec based on the wavelet transform and released a software implementation of it. It supports a range of playback rates (3.75, 7.5, 15, 30 fps) and resolutions (22x15, 44x30, 88x60, 176x120, 352x240) in either colour or monochrome. The performance of this codec in terms of compression ratio for a specified quality level, was found to be equivalent to or in excess of MPEG-1.

Wavelet Strategic Research Programme, NUS

Tham et.al. [17] introduce a highly scalable video compression system for very low bit rate videoconferencing and telephony applications around 10-30 Kbps. They incorporate a high degree of video scalability into the codec by combining the layered/progressive coding strategy with the concept of embedded resolution block coding. With scalable algorithms, only one original compressed video bit stream is generated. Different subsets of the bit stream can then be selected at the decoder to support a multitude of display specifications such as bit rate, quality level, spatial resolution, frame rate, decoding hardware complexity, and end-to-end coding delay. The proposed video codec also allows precise bit rate control at both the encoder and decoder, and this can be achieved independently of the other video scaling parameters. Such a scheme is very useful for both constant and variable bit rate transmission over mobile communication channels, as well as video distribution over heterogeneous multicast networks. Simulations demonstrated comparable objective and subjective performance when compared to the ITU-T H.263 video coding standard, while providing both multirate and multiresolution video scalability.

VIPER

The VIPER group are studying a control scheme for a rate scalable video codec. They are investigating a wavelet based video codec with motion compensation used to reduce temporal redundancy. The prediction error frames are encoded using an embedded zerotree wavelet (EZW) approach which allows data rate scalability. Since motion compensation is used in the algorithm, the quality of the decoded video may decay due to the propagation of errors in the temporal domain. An adaptive motion compensation scheme has been developed to address this problem. They show that by using a control scheme the quality of the decoded video can be maintained at any data rate.

Telecommunications Institute, Uni Erlangen-Nurenberg

ScalVico is a software-only SCALable VIdeo COdec based on a spatio-temporal resolution pyramid with lattice vector quantization(VQ) for efficient compression. It's not DCT based and it was developed by Uwe Horn at the Telecommunications Institute, University Erlangen-Nuremberg [19].

Video Communication Research Group, UWA

Lee, Ngan etc. are looking at a range of video coding techniques including scalable subband image coding [20] and also region based segmentation of images for video coding and transmission[21].

Bath Video Coding Group

The Bath Video Coding Group have developed a scalable adaptive software-only codec based on a fractal compression method known as the Bath Fractal Transform (BFT). In addition they have done a lot of research on other DCT, wavelet and fractal compression techniques optimized for video compression [22,23].

Image Sequence Processing Group, UC Davis

This group is looking at both joint spatiotemporal/ spatiotemporal- frequency representations and video compression based on image segmentation into regions. Because many aspects of visual perception can be best understood in the frequency domain, and because visual perception is spatio- temporally local, the use of joint spatiotemporal/ spatiotemporal- frequency representations is a promising approach. Examples include the Wigner distribution, the Gabor transform, and members of the class of representations based on wavelets. They are also investigating a second approach (following the contour-texture theory of vision) which is to identify uniform regions in the image or sequence via segmentation, forming a representation based on the boundaries and interiors of these regions [24].

11. Conclusions

11.1 General Conclusions

Streaming video (and audio) across networks is an effort that is attracting many participants. This is evidenced by the eight primary commercial and thirteen research organisations involved with this technology in various ways. A key characteristic of both the commercial products and research demonstrators is the diversity in technological infrastructure e.g. networks, protocols, compression standards supported.

All the commercial video products reviewed in this report are optimised for low bandwidth modem or ISDN connections and are not designed to scale to higher bandwidth networks. The video needs to be pre-encoded with the target audience in mind.

The commercial products have either adopted/developed their own proprietary standards, embraced the currently accepted standards (e.g. MPEG) or implemented a combination of the two. Compatibility between the commercial products has been limited because of these proprietary standards. However recent products such as Sun's MediaFramework API and MicroSoft's NetShow have been designed to enable new and various codecs to be easily incorporated into their framework.

H.263 and MPEG-4 are going to become the defacto standards for video delivery over low bandwidths. But broadband standards such as MPEG-1 and MPEG-2, which are useful for many types of broadcast and CD-ROM applications, are unsuitable for the Internet. Although MPEG-2 has had scalability enhancements, these will not be exploitable until the availability of reasonably priced hardware encoders and decoders which support scalable MPEG2.

Codecs designed for the Internet require greater bandwidth scalability, lower computational complexity, greater resilience to network losses, and lower encode/decode latency for interactive applications. These requirements imply codecs designed specifically for the diversity and heterogeneity of Internet delivery. The research on Internet codecs has broadly taken two directions. DCT based and non-DCT based. DCT based video delivery, except for MPEG 2, possesses no inherent scalability. To achieve adaptivity various operations can be applied to the (semi) compressed data stream to reduce its bit rate. Amongst these operations is transcoding, the conversion of one compression standard to another. The beauty of the DCT based approach is that it is compatible with current and imminent draft compression standards. Furthermore it allows re-use of existing audio and video archives without explicitly re-coding them to cater for all possible formats. Existing viewers also maintain their currency.

Non-DCT based compression techniques, e.g. layered, sub-band, wavelet etc., are intrinsically scalable. This is their great attraction. Unfortunately although several CODECs exist, they are still experimental in nature and often suffer from performance problems. In addition, existing movie libraries would need to be re-coded, by no means a trivial task.

The research projects reviewed in this chapter broadly fall into two categories, one group is developing scalable video CODECs mainly using sub band coding. The other group is looking at scalable video in the context of QoS. There is consensus in the research community that the key to efficient delivery of continuous media over heterogeneous networks is dynamic bandwidth adaption. Of these groups the research carried out at Columbia both in the video-on-demand testbed seem the most significant work in this area this research is similar to SuperNOVA in some areas and complementary in others.

11.2 How SuperNOVA Compares

So how does SuperNOVA [25] compare with the other research projects going on in this arena and which issues should SuperNOVA be focussing on?

The objectives of the SuperNOVA project are two-fold, firstly it serves as a test bed for conducting end-to-end QoS research. With this aim we are looking at an architecture for supporting multimedia applications over heterogeneous networks.There are many open research issues in the area of building a simple, implementable QoS architecture. Within DSTC, SuperNOVA concepts should be integrated into the Hector environment for distributed object programming [27]. Validating SuperNOVA by looking at more complex applications such as wOrlds [28] is also a potential direction.

When we consider the SuperNOVA VOD application, there seem to be a large number of existing video libraries, whose users are reluctant to encode into a different format. While there has been some work on DCT-based scaling, this has not been incorporated into any application. The only support for DCT-based scaling in commercial applications seems to be frame-dropping. Therefore the application of DCT-based scaling to existing video in order to generate low bandwidth streams is an issue which should attract some interest.The prospect of real time transcoding of video from a DCT to non-DCT based (and inherently more scalable) format is also an option that deserves further investigation.

The other research area associated with the SuperNOVA project is the actual indexing, storage, browsing and querying of the video information. The Resource Discovery Unit is looking at building tools for the automatic generation of video metadata and video metadata standards to enable querying across heterogeneous distributed video databases. This work is being developed on top of the SuperNOVA architecture to provide realtime delivery of the retrieved video.

References

[1] Jackson M. H., Baldeschwieler J. E. and Rowe L. A., "Berkeley CMT Media Toolkit API" , U. C. Berkeley, (submitted for publication).

[2] Mayer-Patel K., Simpson D., Wu D., and Rowe L. A. "Synchronized Continuous Media Playback Through the World Wide Web" ,U.C. Berkeley, Computer Science Division, Soda Hall, Berkeley, CA 94720

[3] Rowe L. A., "Continuous Media Applications", Multipoint Workshop held in conjunction with ACM Multimedia 1994, San Francisco, CA, November 1994.

[4] Long, A. C. "Full-motion Video for Portable Multimedia Terminals", A project report submitted in partial satisfaction of the requirements for the degree of Master of Science in Computer Science, University of California, Berkeley, 1996.

[5] Amir E.,"RTPGW: An Application Level RTP Gateway" , July 1997.

[6] Amir E., McCanne S., and Zhang H., "An Application Level Video Gateway", Proceedings of the ACM Multimedia Conference `95, San Francisco, California, November 1995.

[7] McCanne S., and Jacobson V., "vic: A Flexible Framework for Packet Video", Proceedings of ACM Multimedia `95, November 1995.

[8] Shanwei Cen, Calton Pu, Richard Staehli, Crispin Cowan and Jonathan Walpole. A Distributed Real-Time MPEG Video Audio Player. In Fifth International Workshop on Network and Operating System Support of Digital Audio and Video (NOSSDAV'95). April 18-21, 1995. Durham, New Hampshire, USA.

[9] Yeadon N., "Quality of Service Filters for Multimedia Communications", PhD Thesis, Lancaster University, Lancaster, May 1996.

[10] S.-F. Chang, A. Eleftheriadis, D. Anastassiou, S. Jacobs, H. Kalva, and J. Zamora, "Columbia's VOD and Multimedia Research Testbed With Heterogeneous Network Support," Intern. J. Multimedia Tools and Applications, Kluwer Academic Publishers, Sept, 1997. to appear. (special issue on "Video on Demand Systems: Technology, Interoperability, and Trials")

[11] E. Chang and A. Zakhor, Variable Bit Rate MPEG Video Storage on Parallel Disk Arrays,First International Workshop on Community Networking Integrated Multimedia Services to the Home, San Francisco, July 1994, pp 127-137.

[12] S. Paek, P. Bochek, S-F. Chang, "Scalable MPEG-2 Video Servers with Heterogeneous QoS on Parallel Disk Arrays", 5th IEEE Workshop on Network & Operating System Support for Digital Audio & Video, New Hampshire, USA, April, 1995.

[13] Lazar, M.S. and Bruton, L.T., "Fractal block coding of digital video", IEEE Trans. on Circuits and Systems for Video Technology (Special Issue on Very-Low Bit Rate Video Coding), Vol. 4, No. 3, pp.297-308, June 1994.

[14] A. Bogdan, "Multiscale (inter/intra-frame) fractal video coding" , Proc. ICIP-94 IEEE International Conference on Image Processing, Austin, Texas, Nov.1994.

[15] Yeadon N., Garcia F., Shepherd D. and Hutchison D., "Filtering for Multipeer Communications DEMO" , July 1997.

[16] Taubman, D. & Zakhor, A. (1994), "Multirate 3-D subband coding of video", IEEE Transactions on Image Processing 3(5), 572-588.

[17] J. Y. Tham, S. Ranganath, A. K. Kassim, "Highly Scalable Wavelet-Based Video Codec for Very Low Bit Rate Environment", to be published in IEEE Journal on Selected Areas in Communications -- Very Low Bit-rate Video Coding, 1997.

[18] K. Shen and E. J. Delp,"A Control Scheme for a Data Rate Scalable Video Codec", Proceedings of the IEEE International Conference on Image Processing, September 16-19, 1996, Lausanne, Switzerland, pp. 69-72.

[19] U. Horn, B. Girod, "A Scalable Codec for Internet Video Streaming" , DSP'97, Jul., 1997, Santorini.

[20] M.H.Lee, K.N.Hgan, G. Crebbin,"Scalable coding of subband images with quadtree- based classified vector quantization", IEEE TENCON'96, Perth, Australia, November 1996, pp 788-792.

[21] H.J.Kim, M. Chan and K.N.Ngan, "Region-based segmentation and motion estimation in object-oriented analysis-synthesis coding", Picture Coding Symposium (PCS'96), Melbourne, Australia, April 1996, pp 589-594.

[22] Nicholls, J.A. and Monro, D.M., "Scalable Video By Software", Proc. ICASSP 1996, Atlanta, May 1996.

[23] D.M. Monro, H. Li and J.A. Nicholls, "Object Based Video With Progressive Foreground", to appear: Proc. ICIP 1997.

[24] G.K. Wu and T.R. Reed. "3-D segmentation-based video processing." To appear in Proceedingsof the Thirtieth Annual Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, November 3-6, 1996.

[25] Witana, V. and Richards, A., "A QoS Framework for Heterogeneous Environments", DSTC Symposium, Australia, 1997.

[26] Mitchell, J. L., Pennebaker, W. B., Fogg, C.E., and LeGall D. J., (eds), "MPEG Video Compression Standard", Chapman and Hall, New York, 1997, p177.

[27]Hector

[28]The wOrlds Project

Jane Hunter

Last modified: Wed Apr 19 16:59:00 EST 2000