How to make HW accelerators for Machine Learning

Posts

Showing posts from February, 2022

RNN (6) - Results

- February 25, 2022

In Pynq, SW Driver can be developed in Python Jupyter notebook environment. It can be implemented with relatively few lines as the linked source below. In order to understand how SW operates, it is necessary to understand the driving principle of the RTL block, which will be explained in detail in a separate post. If you check the results, - Function: If you look at the graph below, there is no big difference between the results of the System Simulator and the results of RTL, and it runs exactly. However, there is a difference in the number system used by the System Simulator and RTL, and in the process of dividing the precision after multiplication, RTL simply used a bit shift operation, and the System Simulator used Python's division. That can bring the difference on the results. - Performance: The performance is compared with the total time the RNN is run. Of course, there may be a problem with the precision of the measurement time with Python, but since the order o...

RNN (5) - FPGA System Design

- February 25, 2022

This is the last post on hardware implementation. The RTL codes of the previous post were designed for HW Accelerator IP to perform RNN, and this post describes how to design the whole system for targetting FPGA. The FPGA board used here is an FPGA Board using Xilinx ZYNQ-7000 as open hardware called Pynq-z2. (For Pynq-z2, please refer to the link below.) - http://www.pynq.io/ - https://www.tulembedded.com/FPGA/ProductsPYNQ-Z2.html - https://pynq.readthedocs.io/en/latest/ First, the RNN Accelerator must open a channel to communicate with the ARM-Cortex A9 where SW is running. IP Packaging is performed using Vivado of Xilinx, and I used the AXI interface template provided by Vivado. RTL codes on the right of the figure below in the code window shows how to combine RNN IP with AXI interface. IP Packaging is possible easily through a few steps in Vivado. (Details will be posted separately.) After IP Packaging, you need to create a bit file to downl...

RNN (4) - RTL Design

- February 24, 2022

I designed the RTL in the same way as the System Simulator designed in the previous post. The major considerations for RTL design are as follows. (The detailed code description will be posted in a separate post later. Also, there are many good tutorials on how to use Intel Quartus Modelsim, so you'd better search for it. I will post the explanation of simulating this code in a separate post. ) - I/O method: The most common method is to save the input in the memory, drive the RNN, save the predicted value to output memory, and then send the completion signal. At this time, the memory can be selected in various ways. Various methods such as FIFO, Single Port, Dual Port, etc. can be used. Here, Single Port Memory is selected as the simplest. - Timing: RTL stands for Register Transfer Level. I don't know who put it on, but it seems to have been put together relatively well. Register is a component whose output is changed at the edge of the clock. Since there is some logic ...

RNN (3) - System Simulator

- February 24, 2022

The Pytorch reference in the previous post uses the optimized library and floating point number, so the performance can be considered the most ideal. In the system simulator for Verilog design, some restrictions follow. - Fixed Point Number: Because the range of numbers is limited, performance restrictions follow. - Suboptimal Matrix Operation: The existing Pytorch model has been optimized for efficient and accurate operation in the Python environment. There is a difference in performance in the System Simulator that is actually implemented. - Non-linear activation function: Here, tanh (hyperbolic tangent) is used, and linear approximation is performed for convenience of implementation. Here, a 16-bit fixed point number is used. In fact, all 16 bits are not used and only 7 decimal places are used for easy calculation. Unused redundant bits should be optimized and removed in the future, but the truncation so that prevent overflow during every Addition/Multiplicati...

RNN (2) - Reference model

- February 24, 2022

RNN has been increasingly used in recent years, and it is being used in various ways, such as language translation, automatic sentence completion, card fraud prevention, and prediction of time series data such as stocks/weather. Here, we will use a model that predicts time series data. Before RNN was used, these data were predicted using a statistical method called ARIMA. Using a kind of filter technology, relatively near future were predicted, and with the advent of RNNs, more accurate predictions were made possible by learning Big Data. In particular, in the case of periodicity, it can be predicted with ARIMA because it has a temporal autocorrelation structure. Anyway, we'll implement this model for a relatively simple and easy-to-check example. Of course, there is a way to build this model using CNN, but it does not show optimal performance in terms of performance because it has no choice but to receive the input as a fixed. If there is an opportunity in the future, I will try ...

RNN (Recurrent Neural Network) (1) - Example, Source repository

- February 24, 2022

CNN already has too many tutorials and is composed of various layers such as convolution and max-pooling in the middle. When you think of RNN, you think of language translators, and then you think of LSTMs (Long Short Term Memory) or GRUs (Gated Recurrent Unit). Since it can also take considerable time and effort to implement in hardware, I selected an RNN example for Sinewave Prediction that is as simple as possible and can show the characteristics of RNN well. Through this example, I would like to explain the whole process of how an ML model made in Python can be implemented as a hardware accelerator. The development outline is as below 1. Reference model development for Wave Prediction (Pytorch) 2. System Simulator development for RTL design from Pytorch reference model (Python) 3. RTL IP development (Modelsim Simulation) 4. Complete system design and FPGA bit file creation using Xilinx Vivado 5. Write overlay driver in Python on Pynq-Z2 board and verify...

What is Machine Learning? (From a hardware engineer's point of view)

- February 24, 2022

T here are too many articles or blogs on this topic, so I think that explaining from the scratch would add unnecessary traffic to the Internet. Usually, the most referenced site is Google's Tensorflow tutorial or Pytorch tutorial, and those who want to look into it, they will meet Andrew Ng's Stanford course (Coursera provides the cool lectures of Andrew Ng, so I strongly recommend you take the course.) Most of the tutorials start with saying that anyone can use AI very easily without knowing the mathematical details. I think some are right and some are wrong. You should know at least basic linear algebra such as matrix addition/multiplication, derivertives, etc., so you can go one step further. Even though I completed tutorials (like MNIST) without any hurdle, I felt get so astrayed that I never catched what to do next. Most hardware engineers might feel the same feeling as mine. So, I decided to dig out what's inside of Machine Learning to turn it into hardware implementa...