Home > News

20

Will FPGA be the Main Chip in Robot Industry

The author perfected this article on January, 13, 2020

Abstract

Ten years ago, Bill Gates, founder of Microsoft, put forward his vision for the future in his article "A Robot in Every Home." Robots will enter each family like a personal computer and serve human beings. With the rapid development of artificial intelligence and intelligent hardware in the past few years, by 2016, I firmly believe that the technology has matured, intelligent robots quickly enter the era of commercialization, Gates's vision is also very likely to be true in the 5 to 10 years.

To make it an intelligent robot, we must first give it the ability to perceive. Perceptual computing, especially visual and deep learning, is often computationally intensive and requires high performance. However, the battery capacity of the robot is limited and the energy that can be allocated to the calculation is relatively low. In addition, due to the continuous development of perceptual algorithms, we also need to continuously update the robot's perceptual processor. Compared with other processors, FPGA with low power consumption, high performance and programmable features, is very suitable for perceptual computing. This paper first analyzes the characteristics of FPGA, and then introduces the FPGA acceleration algorithm for sensing and energy conservation, and finally talk about robot operating system on the FPGA support.

Basics of FPGA


Catalog

I FPGA: High Performance, Low Power, Programmable

II FPGA High performance

III Advantage of FPGA--Low Energy Consumption

IV Hardware  Programming

V Aware Computing Acceleration on the FPGA

VI Feature Extraction and Location Tracking

VII Learn More Knowledge about FPGA

VIII FPGA and Robot Operating System(ROS) Combination

IX Future of FPGA

X Book Recommendation


I FPGA: High Performance, Low Power, Programmable

Compared with other computing carriers such as CPU and GPU, FPGA has the characteristics of high performance, low power consumption and hardware programming. Figure 1 describes the FPGA hardware architecture, each FPGA consists of three main components: input and output logic, which is mainly used for FPGA and other external components, such as the sensor communication. Computing logic components is mainly used to build computing modules. And Programming connection network is mainly used to connect different computing logic components to form a calculator. During programming, we can map computational logic to hardware and connect different logic components by adjusting network connections to complete a computational task. For example, to complete an image feature extraction task, we will connect the input logic of the FPGA and the camera's output logic, so that the picture can enter the FPGA. Then, connect the input logic of FPGA and a plurality of computational logic parts, let these computational logic parts extract the characteristic point of each picture area in parallel. Finally, we can connect the output logic of the computational logic with the FPGA and output the summary of the feature points. Thus, FPGA usually writes the algorithm data flow and execution instruction in hardware logic, so as to avoid the instruction fetch and instruction decode of CPU.

1a.jpg

Figure 1. FPGA hardware architecture


II FPGA High performance

Although the FPGA frequency is generally lower than the CPU, but you can use FPGA to achieve a great degree of parallel hardware calculator. For example, a typical CPU can handle only 4 to 8 instructions at a time. Using data parallelism on an FPGA can handle 256 or more instructions at a time, allowing the FPGA to handle much more data than a CPU. In addition, as noted above, Instruction Fetch and Instruction Decode are generally not required in FPGAs, reducing the computational time associated with these pipeline processes.

In order to give readers a better understanding of FPGA acceleration, we summarize the 2010 Microsoft Research Accelerated FPGA research on BLAS. BLAS is the underlying library of matrix operations, is widely used in high performance computing, machine learning and other fields. In this research, Microsoft researchers analyzed the acceleration and energy consumption of BLAS by CPU, GPU and FPGA. Figure 2 compares the timing of each iteration of the GaxPy algorithm with the FPGA and the CPU. The GPU achieves a 60% acceleration of both the GPU and the FPGA relative to the CPU. The figure shows a small matrix operation, as the matrix increases, GPU and FPGA relative to the CPU speedup will be more apparent.

GaxPy algorithm performance comparison (unit: microseconds)

Figure 2. GaxPy Algorithm Performance Comparison (unit: microseconds)


III Advantage of FPGA--Low Energy Consumption

There are two main reasons why FPGAs have significant power consumption advantages over CPUs and GPUs. First of all, there is no Instruction Fetch and Instruction Decode in the FPGA. In the CPU of Intel, only Decoder accounts for 50% of the chip's energy consumption due to the CISC architecture. In the GPU, Fetch and Decode also consume 10% 20% of energy. Second, the clock speed of FPGA is much lower than that of CPU and GPU. CPU and GPU are usually between 1GHz and 3GHz, while the frequency of FPGA is below 500MHz. Such a large frequency difference makes the FPGA consume much less energy than the CPU and GPU.

Figure 3 compares the energy consumption of each iteration of the GaxPy algorithm with the FPGA and the CPU, GPU. You can see that CPU and GPU energy consumption is similar, and FPGA energy consumption is only about 8% of CPU and GPU. Thus, FPGA computing speed is faster than the CPU by 60%, while the energy consumption is only 1/12 of the CPU, a considerable advantage, especially in the case of energy constraints, the use of FPGA will extend the battery life a lot.

GaxPy algorithm energy consumption comparison (unit: millijoules)

Figure 3. GaxPy algorithm energy consumption comparison (unit: millijoules)


IV Hardware Programming

Because FPGAs are hardware-programmable, the use of FPGAs can iteratively update hardware logic relative to ASICs. However, FPGAs are also criticized because writing algorithms to FPGA hardware is not an easy process, and the development cycle is much longer than the threshold of programming techniques on CPUs and GPUs.

traditional FPGA development process and C-to-FPGA development process

Figure 4. traditional FPGA development process and C-to-FPGA development process

Figure 4 shows a comparison of the traditional FPGA development process and the C-to-FPGA development process. In the traditional FPGA development process, we need to translate the algorithms written in C / C ++ into Verilog-based hardware languages one by one, then compile Verilog and write the logic to the hardware. With the development of FPGA technology in recent years, the technology of directly compiling to FPGA from C has gradually matured and has been widely used in Baidu. In the C-to-FPGA development process, we can add Pragma to the C \ C ++ code to indicate which Kernel calculation should be accelerated and then the C-to-FPGA engine will automatically compile the code into hardware. In our experience, using a traditional development process takes about six months to complete a project. With a C-to-FPGA development process, a project can be completed in about two weeks and the efficiency is more than 10 times higher.


V Aware Computing Acceleration on the FPGA

The next section focuses on the acceleration of robot-aware computing on FPGAs, specifically the computation of feature extraction and location tracking (which can be thought of as the robot's eye), as well as deep learning computations (which can be thought of as the robot's brain). When the robot has eyes and brain, you can move in space and position yourself to recognize what you see when you move.


VI Feature Extraction and Location Tracking

The main algorithms for feature extraction and position tracking include SIFT, SURF and SLAM. SIFT is an algorithm for detecting local features. By obtaining the feature points in a map and its description of the scale and direction, SIFT is used to obtain the features and match the image feature points. SIFT feature matching algorithm can deal with the matching problem between two images in the case of translation, rotation and affine transformation, and has a strong matching ability. The SIFT algorithm has three main processes: 1. Extracting key points; 2. Attaching detailed information to the key points (local features), also known as descriptors; 3. By using the two feature points (the key points with the feature vectors) Two comparison to find out a number of pairs of feature points that match each other, also established the corresponding relationship between the scenery. SURF algorithm is an improvement of SIFT algorithm, mainly through the integral image Haar derivation to improve the execution efficiency of SIFT algorithm. SLAM is to simultaneously locate and reconstruct the map. The purpose is to create a map of the route while the robot is moving, and simultaneously determine the position of the robot in the map. With this technology, robots can be positioned without the aid of external signals (WIFI, Beacon, GPS) and are particularly useful in indoor positioning scenarios. The main method of localization is to use Kalman filter to fuse different sensor information (pictures, gyroscopes) to infer the robot's current position.

To help readers gain an understanding of FPGA acceleration and energy savings for feature extraction and location tracking, let's look at a UCLA study of accelerated feature extraction and SLAM algorithms on FPGAs. Figure 5 shows the speedup of FPGAs relative to the CPU in performing SIFT feature-matching, SURF feature-matching, and SLAM algorithms. After using the FPGA, SIFT and SURF feature-matching achieved 30 times and 9 times the acceleration, and SLAM algorithm also achieved 15 times the speedup. Assuming that the photo enters the calculator at 30FPS, the perception and localization algorithm needs to process an image within 33 milliseconds, meaning that a feature extraction and SLAM calculation is done within 33 milliseconds, which causes the CPU to cause a lot of pressure. With the FPGA, the entire processing speed up more than 10 times, so that high-frame-rate data processing becomes possible.

perceived algorithm performance comparison (unit: speedup)

Figure 5. perceived algorithm performance comparison (unit: speedup)

Figure 6 shows the energy-saving ratio of FPGA relative to CPU in performing SIFT, SURF, and SLAM algorithms. After using the FPGA, SIFT and SURF achieved 1.5 times and 1.9 times the energy-saving ratio, while the SLAM algorithm achieved 14 times the energy-saving ratio. In our experience, if the robot uses a cell phone battery for a multi-core Mobile CPU to run this set of sensing algorithms, the battery will run out in about 40 minutes. However, if calculated using an FPGA, the cell phone battery is sufficient to support more than six hours, which can achieve about 10 times the overall energy savings (because SLAM is much more computationally intensive than feature extraction).

perceived algorithm energy consumption comparison (unit: energy saving ratio)

Figure 6. perceived algorithm energy consumption comparison (unit: energy saving ratio)

To summarize the data, using FPGA for perceptual location can not only improve the perceived frame rate, make the sensing more accurate, but also save energy and let the calculation last for several hours. When the sensing algorithm is determined and the demand for the chip reaches a certain amount, we can also design the FPGA chip into an ASIC to further improve the performance and reduce the power consumption.


VII Learn More Knowledge about FPGA

Deep neural network is a kind of neural network with at least one hidden layer. Similar to shallow neural networks, deep neural networks can also provide modeling for complex nonlinear systems, but the extra layers provide a higher level of abstraction for the model, thereby enhancing the model's capabilities. In the past few years, Convolutional Depth Neural Network (CNN) has made great progress in the field of computer vision and automatic speech recognition. Visually, Google, Microsoft and Facebook are constantly refreshing recognition rates at ImageNet contests. In speech recognition, compared with the previous system, Baidu's DeepSpeech 2 system has significantly improved the word recognition rate, reducing the word recognition error rate to about 7%.

In order to let the reader know about FPGA acceleration and energy saving for deep learning, let's focus on a collaborative research between Peking University and the University of California on FPGA-accelerated CNN algorithm. Figure 7 shows the time-consuming comparison of FPGA and CPU when implementing CNN. When running an iteration, using a CPU takes 375 milliseconds and using an FPGA takes only 21 milliseconds and achieves about 18 times the speedup. Assuming that the CNN operation has real-time requirements, such as the need to keep up with the camera frame rate (33 ms / frame), the CPU will not be able to meet the computational requirements, but after FPGA acceleration, the CNN computation will keep up with the camera frame rate. Analyze every frame.

CNN performance comparison (unit: milliseconds)

Figure 7. CNN performance comparison (unit: milliseconds)

Figure 8 shows the FPGA and CPU in the implementation of CNN power consumption comparison. In the implementation of a CNN operation, the use of CPU power 36 Jiao, and the use of FPGA only consume 10 Jiao, made about 3.5 times the energy-saving ratio. Similar to SLAM calculations, deep learning real-time calculations run more easily on the move by accelerating and conserving power with FPGAs.

CNN energy consumption comparison (unit: coke)

Figure 8. CNN energy consumption comparison (unit: coke)


VIII FPGA and Robot Operating System(ROS) Combination

The above describes the FPGA acceleration of the sensing algorithm and energy saving, we can see that FPGAs have a huge advantage over CPU and GPU in perceptual computing. This section describes the use of FPGAs in today's robotics industry, especially when FPGAs are used in the ROS robot operating system.

Robotic Operating System (ROS), a set of operating system architectures designed specifically for robot software development. It provides OS-like services, including hardware abstraction, underlying driver management, execution of common functions, inter-program messaging, and program distribution management. It also provides tools and libraries for getting, building, writing, and executing Multi-machine integration process. The primary design goal of ROS is to improve code reuse in robotics development. ROS is a distributed processing framework (aka Nodes). This allows executables to be designed separately and loosely coupled at run time. These processes can be encapsulated into Packages and Stacks for sharing and distribution. ROS also supports a federated system of code bases that enables collaboration to be distributed as well. ROS is now widely used in a variety of robots, and gradually become the robot's standard operating system. In the 2015 DARPA Robotics Challenge, over half of the participating robots used ROS.

As FPGAs evolve, more and more robots are using FPGAs and more and more voices in the ROS community require ROS-compliant FPGAs. One example is Sandia Hand, a robotic arm at Sandia National Laboratory in the United States. As shown in Figure 9, Sandia Hand uses FPGAs to preprocess the information returned by the camera and the robot's palm and then passes the result of the preprocessing to other compute nodes of ROS.

ROS support for FPGAs in Sandia Hand

Figure 9. ROS support for FPGAs in Sandia Hand

Sandia Hand uses the Rosbridge mechanism to connect ROS to the FPGA. Rosbridge connects ROS and non-ROS programs through the JSON API. For example, a ROS program can connect to a non-ROS network front end through the JSON API. In Sandia Hand's design, a ROS Node connects to the FPGA calculator through the JSON API, the FPGA passes data and initiates computation instructions, then retrieves the results from the FPGA.

Rosbridge provides a mechanism for communication between ROS and FPGAs, but with this mechanism, ROS Node does not run on FPGAs and introduces some loss of performance through the JSON API's connection mechanism. To make FPGAs and ROS better coupled, researchers in Japan recently proposed the design of ROS-Compliant FPGAs so that ROS Node can run directly on the FPGA. As shown in Figure 10, in this design, the FPGA implements an input interface that subscribes directly to the ROS topic so that data can flow seamlessly into the FPGA computational unit. In addition, an output interface is implemented on the FPGA, allowing ROS nodes on the FPGA to publish data directly, allowing other ROS nodes subscribing to this topic to directly consume data from the FPGA. In this design, developers simply plug in their own FPGA calculator into a ROS-compliant FPGA framework to seamlessly connect with other ROS nodes.

FPGA as part of ROS

Figure 10. FPGA as part of ROS

In a recent interview with Open Source Robotics Foundation, a ROS operating organization, more and more robot developers are using FPGAs as sensor computing units and controllers, and there is an increasing need to incorporate FPGAs into the ROS. I believe ROS will soon come up with a solution tightly coupled with the FPGA.


IX Future of FPGA

FPGA with low power consumption, high performance and programmable features, is very suitable for perceptual computing. Especially in energy-constrained situations, FPGAs have significant performance and power benefits over CPUs and GPUs. In addition, due to the continuous development of perceptual algorithms, we need to constantly update the robot's perceptual processor. FPGAs have the advantage of hardware upgrades and iterations over ASICs. For these reasons, I firmly believe that FPGA in the robot era will be one of the most important chip. Due to the low power consumption of FPGAs, FPGAs are well suited for sensor data preprocessing. It is foreseeable that the tight integration of FPGAs and sensors will soon be widespread. Then with the constant optimization of vision, voice and deep learning algorithms on the FPGA, FPGA will gradually replace the GPU and CPU to become the main chip on the robot.


Book Recommendation

FPGA Based Visual Robot Control 

(1) Summary: Robot control with image processing application which is not efficient for microprocessors has been done using this feature of the FPGA-based systems.

(2) Verilog by Example: A Concise Introduction for FPGA Design, by Blaine Readler 

(3) Summary: Starting with a simple but workable design sample, increasingly more complex fundamentals of the language are introduced until all major features of verilog are brought to light. Included in the coverage are state machines, modular design, FPGA-based memories, clock management, specialized I/O, and an introduction to techniques of simulation. 

--Barış Çelik, Ayça Ak, Vedat TOPUZ


You May Also Like:

Ten Principles of DC/DC Conversion Circuit Design

The Sino-US comparison in Information Technology

Why do Internet Giants Want to Self-develop Chips?