High-Speed Video Encoding System Based on H.264 and DSP

In this paper, a high-speed H.264 encoder based on ADI BF548 Blackfin DSP is proposed. In order to speed up the process of motion estimation (ME) module in H.264, we propose a two-step bit-transform-based normalized partial distortion search (TSB-NPDS) algorithm for fast ME by using the characteristics of pattern similarity matching errors. An initial standard compliant raw-C encoder has been optimized in speed for target processor. In addition, the parallelism between algorithm execution and data movement has been fully exploited using DMA. Experimental results demonstrate that the encoding rate can reach above 30 fps as using QCIF video.


Introduction
In H.264 encoding system, motion estimation (ME) module is a core process within a video image coding scheme, because it enables the transmission of video signals while using a lower bit rate.Recently, efficient block matching algorithms (BMA) have been adopted in modern video coding standards such as MPEG-1/2/4 and H.261/263/264.From the experimental results conducted in [1], we can find that the ME module consumes 60% (1 reference frame) to 80% (5 reference frames) of the total encoding time of H.264 codec.To speed up H.264 encoder, we adopts a two-step bittransform-based NPDS (TSB-NPDS) algorithm to improve the searching efficiency of ME module.
Non-programmable encoders offer enough computational power at a low cost but cannot cope with the quick evolution of the video encoding algorithms.On the other hand, the latest generation of digital signal processors (DSPs) [2][3][4] can support very flexible encoders like the multi-format one proposed in [5] at a relative low cost.As a result, optimization of applications to the maximum possible extent is necessary.

Fast ME modules
The full search algorithm (FSA) is the most straightforward BMA for ME module, which provides an optimal solution by matching all the candidate blocks inside a search window.However, the computational complexity of FSA is always too high for real-time applications.Therefore, many fast ME modules have been proposed to reduce the computation of FSA [6][7].Among these improved methods, the normalized partial distortion search (NPDS) [6] is a good approach.But, there exists some problems in NPDS such that the searching speed cannot be still satisfied in practical applications.To overcome the problem of NPDS, we adopt a two-step bit-transform-based NPDS (TSB-NPDS) algorithm to improve the searching efficiency of ME module.NPDS.The ME module is to obtain a motion vector (MV) for a target macro block (MB) by using the block matching technique, which minimizes a measure of matching distortion between the target MB in the current frame and a candidate MB within a search window in a reference frame.The displacement between the candidate MB with the smallest distortion and the target MB will be selected as the resulting MV.One of the most frequently used criteria to measure the matching distortion is the sum of absolute difference (SAD).The SAD between a target MB at position is the offset of the upper left corner point of the pth partial distortion from the upper left corner point of the candidate block.The pth accumulated partial distortion is defined as The NPDS matches all the search points inside the search window as that in the FSA.Proposed TSB-NPDS.From the study of NPDS, we find that the proper matching scan is an important factors that affect the searching speed to eliminate the impossible candidate MV.However, the NPDS uses the evenly dithering order which takes account of pixels being evenly distributed on the block.Therefore, the speedup ratio to find MVs is limited due to uniform calculation order.
To speed up NPDS, we make use of the fact that the image blocks with lower pattern similarity have larger matching errors on average.Therefore, the motivation of the proposed algorithm is to use two significant features of image block pattern extracted by 2BT from a MB to find the impossible candidates faster.In addition, the key idea of the proposed method is to design a simple and low-complexity hardware machine to calculate approximate distortion order of each candidate block.For simplicity, we take the mean value as the threshold variables after several experiments in our work.2BT can be expressed as B chooses the other threshold further offset half of mean to pick out binary contrast pattern.If the bit-plane pattern of a sub-block is dissimilar to that of another sub-block, the two blocks are unlikely to have similar image characteristics, and then result in a large matching error.

Advanced Engineering Forum Vols. 6-7 417
With the above-observation, we propose a two-step binary pattern matching scan algorithm based on these characteristics to achieve more computational reduction of NPDS.In the first step, the Hamming distance between two bit-planes of sub-blocks in the target MB and the candidate MB of the initial searching point is used as the measurement of binary direction pattern similarity.H is first used to determine the order of matching priority.In addition, it is noted that the different contrast between gray-levels in the block maybe have the same direction patterns such that result in the same bit-plane and 1 H .In order to further separate the pattern similarity of gray-level between two sub-blocks with the same 1 H , the binary contrast pattern 2 B is adopted in the second step.The Hamming distance 2 H of the second step is defined as Then, the values of 2 H for the same 1 H are re-sorted in the descending order.The first step mainly computes edge directional differences using binary pattern matching for all pixels in the center block of the search window, that is to say the one with the null candidate motion vector.The magnitudes of Hamming distances of sub-blocks in a MB are sorted in descending order.In the second step, if the sub-blocks have the same Hamming distances, then the 2 H magnitudes of these sub-blocks are resorted in descending order.The positions are first sorted in descending order according to our proposed two-step matching scan method, and then used to determine the order of matching priority for all the candidate blocks in the search window.According to the arranged subblock, we find the best MV using the same search procedure as the NPDS.
Figure 1 shows that the average partial distortions versus calculation order determined by the proposed two-step bit-transform-based matching scan.We can find that matching errors is proportional to the Hamming distances between two blocks.This indicates that larger matching errors can be obtained by calculating matching distortions of the image area with large Hamming distance.Therefore, we can prove that it is true the lower calculation order the higher matching distortion in our method.The first partial distortions are much larger than those for evenly dithering order.
Fig. 1 The SAD versus calculation order determined by the TSB-NPDS matching scan.

418
Information Technology for Manufacturing Systems III

The architecture of ADSP-BF548
A simplified block diagram of the DSP internal architecture is shown in Fig. 2. The CPU is a Blackin processor with a performance of up to 4800 MIP@600 MHz.ADI BF548 is a 16/32-bit processor and gives high performance for low power consumption, which makes it attractive for embedded applications.A 192 KB internal SRAM can be divided into 64 KB level-1 cache for program (L1P) and 64 KB for data (L1D).A leve-2 cache (L2) unified for data/program is 128 KB.A 64 MB external memory DDRRAM level-3 (L3) can be accessed through a dedicated interface using a 64-bit data interface.The other peripherals are a dynamic memory access (DMA) controller, video ports, an Ethernet port (EMAC), an output audio interface and several general-purpose I/O pins.The DMA controller allows moving data between memory and peripherals.
BF548 is based on RISC principles, and the normalized instruction set and related decode mechanism are much simpler.This simplicity results in a high instruction throughput and impressive real-time interrupt response from a small and cost-effective chip.Pipelining is employed so that all parts of the processing and memory systems can operate continuously.Typically, while one instruction is being executed, its successor is being decoded, and a third instruction is being fetched from memory.The memory interface has been designed to allow the performance potential to be realized without incurring high costs in the memory system.Unlike traditional video embedded solutions that utilize two processor cores to provide video functionality, the BF548 provides a convergent solution in a unified core architecture that allows voice and video signal processing concurrent with RISC MCU processing to handle network and user-interface demands.This unique ability to offer full functionality on a single convergent processor provides for a unified software development environment, faster system debugging and deployment, and lower overall system cost.

H.264 Encoding Implementation
The encoder implements BP of H.264 video coding standard [8].The starting point was a standard compliant C encoder fully tested first in a PC environment and moved o the DSP environment afterwards.This initial code was optimized to increase the execution speed in about two orders of magnitude.
× sub-block in the current frame and the reference frame, respectively; and ⊕ denotes the modulo-2 addition (XOR logic operation).The Hamming distance for each sub-block between the target MB and the candidate MB of the initial searching point is carried out, and the values are sorted in descending order.Therefore, the 1

Table 1 :
The BP profile information of the encoding process.