How to make HW accelerators for Machine Learning

Posts

CFU Playground (2) - CNN - tensorflow lite model

- April 08, 2022

It is not very difficult to change the code from Pytorch to Tensorflow one-to-one, but there are restrictions to use Tensorflow lite micro. Tensorflow lite micro provides various libraries, utilities, and examples in C++ so that it can be run on a microcontroller using ARM core or Tensillica. However, not all neural networks are possible. What is definitely available is a CNN, and a basic fully connected network library is possible. LSTM is on the Kernel, but when converting to Tensorflow lite converter, it was confirmed that errors occurred. Separately, I have seen Github of a person who achieved a degree by implementing LSTM in C. ( https://github.com/lephong/lstm-rnn ) If you look closely, you can refer to it, but in order to minimize uncertainty, I choose to convert the RNN model into a CNN model. Source code: https://github.com/bxk218/RNN_wave_predictor_verilog/blob/main/tf_2D_CNN_wavePrediction.ipynb After changing the times series sequence to a 6x6 array, it loo...

CFU Playground (1) - Introduction

- April 08, 2022

I t is not an official open source project of Google, but a very interesting project is in progress. (Refer to the site below for details) https://cfu-playground.readthedocs.io/en/latest/index.html To explain the concep briefly, it is a concept that puts a RISC-V soft-core in the FPGA and allows ML to run on the soft-core. The interesting point is that we can easily customize the core with accelerators on FPGA and also add more custom instructions. CFU playgound framework is composed of many good opensource products. So, it's free. We can practice it through many projects of Pete Warden, who devised TinyML using Tensorflow lite micro Framework a few years ago. And we can refer a few more interesting projects. One of important enablers for CFU playground would be Tensorflow lite micro in conjuction with TinyML. The purpose of TinyML is to implement ML using only a few hundred KByte of memory in an ARM Cortex-M-class microcontroller. Tensorflow lite is already used a lot by people w...

RNN (6) - Results

- February 25, 2022

In Pynq, SW Driver can be developed in Python Jupyter notebook environment. It can be implemented with relatively few lines as the linked source below. In order to understand how SW operates, it is necessary to understand the driving principle of the RTL block, which will be explained in detail in a separate post. If you check the results, - Function: If you look at the graph below, there is no big difference between the results of the System Simulator and the results of RTL, and it runs exactly. However, there is a difference in the number system used by the System Simulator and RTL, and in the process of dividing the precision after multiplication, RTL simply used a bit shift operation, and the System Simulator used Python's division. That can bring the difference on the results. - Performance: The performance is compared with the total time the RNN is run. Of course, there may be a problem with the precision of the measurement time with Python, but since the order o...

RNN (5) - FPGA System Design

- February 25, 2022

This is the last post on hardware implementation. The RTL codes of the previous post were designed for HW Accelerator IP to perform RNN, and this post describes how to design the whole system for targetting FPGA. The FPGA board used here is an FPGA Board using Xilinx ZYNQ-7000 as open hardware called Pynq-z2. (For Pynq-z2, please refer to the link below.) - http://www.pynq.io/ - https://www.tulembedded.com/FPGA/ProductsPYNQ-Z2.html - https://pynq.readthedocs.io/en/latest/ First, the RNN Accelerator must open a channel to communicate with the ARM-Cortex A9 where SW is running. IP Packaging is performed using Vivado of Xilinx, and I used the AXI interface template provided by Vivado. RTL codes on the right of the figure below in the code window shows how to combine RNN IP with AXI interface. IP Packaging is possible easily through a few steps in Vivado. (Details will be posted separately.) After IP Packaging, you need to create a bit file to downl...

RNN (4) - RTL Design

- February 24, 2022

I designed the RTL in the same way as the System Simulator designed in the previous post. The major considerations for RTL design are as follows. (The detailed code description will be posted in a separate post later. Also, there are many good tutorials on how to use Intel Quartus Modelsim, so you'd better search for it. I will post the explanation of simulating this code in a separate post. ) - I/O method: The most common method is to save the input in the memory, drive the RNN, save the predicted value to output memory, and then send the completion signal. At this time, the memory can be selected in various ways. Various methods such as FIFO, Single Port, Dual Port, etc. can be used. Here, Single Port Memory is selected as the simplest. - Timing: RTL stands for Register Transfer Level. I don't know who put it on, but it seems to have been put together relatively well. Register is a component whose output is changed at the edge of the clock. Since there is some logic ...

RNN (3) - System Simulator

- February 24, 2022

The Pytorch reference in the previous post uses the optimized library and floating point number, so the performance can be considered the most ideal. In the system simulator for Verilog design, some restrictions follow. - Fixed Point Number: Because the range of numbers is limited, performance restrictions follow. - Suboptimal Matrix Operation: The existing Pytorch model has been optimized for efficient and accurate operation in the Python environment. There is a difference in performance in the System Simulator that is actually implemented. - Non-linear activation function: Here, tanh (hyperbolic tangent) is used, and linear approximation is performed for convenience of implementation. Here, a 16-bit fixed point number is used. In fact, all 16 bits are not used and only 7 decimal places are used for easy calculation. Unused redundant bits should be optimized and removed in the future, but the truncation so that prevent overflow during every Addition/Multiplicati...

Search This Blog

How to make HW accelerators for Machine Learning

Posts

CFU Playground (3) - ML on a board (Arduino nano 33 ble )

CFU Playground (2) - CNN - tensorflow lite model

CFU Playground (1) - Introduction

RNN (6) - Results

RNN (5) - FPGA System Design

RNN (4) - RTL Design

RNN (3) - System Simulator