Posts

CFU Playground (3) - ML on a board (Arduino nano 33 ble )

Image
I am going through this step without running Tensorflow lite micro in the Risc-V environment in the CFU Playground. It doesn't matter if you are familiar with Tensorflow lite micro, but if not like me, you need to port your own model and check if it works properly. The reason for choosing Arduino nano 33 ble is that it has luxurious specifications compared to Arduino uno. It has Cortex-M4, 64Mhz, 256KB SRAM, 1MB Flash Memory, and sensors such as a gyroscope built-in. So, you can do interesting experiments in the future. Starting with a board with too little memory and poor CPU performance can incur many complicated errors. And you have to spend a lot of effort to optimize your code for fitting into the memory. https://store-usa.arduino.cc/products/arduino-nano-33-ble?selectedStore=us When it comes to price, ESP32 DevKit can also be an alternative. But Tensorflow Lite Micro in ESP32 is supported only on ESP-IDE. The ESP-IDE is less convient than Arduino IDE. Above all, the advantage...

CFU Playground (2) - CNN - tensorflow lite model

 It is not very difficult to change the code from Pytorch to Tensorflow one-to-one, but there are restrictions to use Tensorflow lite micro. Tensorflow lite micro provides various libraries, utilities, and examples in C++ so that it can be run on a microcontroller using ARM core or Tensillica. However, not all neural networks are possible. What is definitely available is a CNN, and a basic fully connected network library is possible. LSTM is on the Kernel, but when converting to Tensorflow lite converter, it was confirmed that errors occurred. Separately, I have seen Github of a person who achieved a degree by implementing LSTM in C. ( https://github.com/lephong/lstm-rnn )  If you look closely, you can refer to it, but in order to minimize uncertainty, I choose to convert the RNN model into a CNN model. Source code:  https://github.com/bxk218/RNN_wave_predictor_verilog/blob/main/tf_2D_CNN_wavePrediction.ipynb After changing the times series sequence to a 6x6 array, it loo...

CFU Playground (1) - Introduction

I t is not an official open source project of Google, but a very interesting project is in progress. (Refer to the site below for details) https://cfu-playground.readthedocs.io/en/latest/index.html To explain the concep briefly, it is a concept that puts a RISC-V soft-core in the FPGA and allows ML to run on the soft-core. The interesting point is that we can easily customize the core with accelerators on FPGA and also add more custom instructions. CFU playgound framework is composed of many good opensource products. So, it's free. We can practice it through many projects of Pete Warden, who devised TinyML using Tensorflow lite micro Framework a few years ago. And we can refer a few more interesting projects.  One of important enablers for CFU playground would be Tensorflow lite micro in conjuction with TinyML. The purpose of TinyML is to implement ML using only a few hundred KByte of memory in an ARM Cortex-M-class microcontroller. Tensorflow lite is already used a lot by people w...

RNN (6) - Results

Image
In Pynq, SW Driver can be developed in Python Jupyter notebook environment. It can be implemented with relatively few lines as the linked source below. In order to understand how SW operates, it is necessary to understand the driving principle of the RTL block, which will be explained in detail in a separate post.  If you check the results,  - Function: If you look at the graph below, there is no big difference between the results of the System Simulator and the results of RTL, and it runs exactly. However, there is a difference in the number system used by the System Simulator and RTL, and in the process of dividing the precision after multiplication, RTL simply used a bit shift operation, and the System Simulator used Python's division. That can bring the difference on the results.  - Performance: The performance is compared with the total time the RNN is run. Of course, there may be a problem with the precision of the measurement time with Python, but since the order o...

RNN (5) - FPGA System Design

Image
This is the last post on hardware implementation. The RTL codes of the previous post were designed for HW Accelerator IP to perform RNN, and this post describes how to design the whole system for targetting FPGA. The FPGA board used here is an FPGA Board using Xilinx ZYNQ-7000 as open hardware called Pynq-z2. (For Pynq-z2, please refer to the link below.) -  http://www.pynq.io/ -  https://www.tulembedded.com/FPGA/ProductsPYNQ-Z2.html -  https://pynq.readthedocs.io/en/latest/ First, the RNN Accelerator must open a channel to communicate with the ARM-Cortex A9 where SW is running. IP Packaging is performed using Vivado of Xilinx, and I used the AXI interface template provided by Vivado. RTL codes  on the right of the figure below  in the code window  shows how to  combine RNN IP with AXI interface. IP Packaging is possible easily through a few steps in Vivado. (Details will be posted separately.) After IP Packaging, you need to create a bit file to downl...

RNN (4) - RTL Design

Image
I designed the RTL in the same way as the System Simulator designed in the previous post. The major considerations for RTL design are as follows.  (The detailed code description will be posted in a separate post later.  Also, there are many good tutorials on how to use Intel Quartus Modelsim, so you'd better search for it. I will post the explanation of simulating this code in a separate post. ) - I/O method: The most common method is to save the input in the memory, drive the RNN, save the predicted value to output memory, and then send the completion signal. At this time, the memory can be selected in various ways. Various methods such as FIFO, Single Port, Dual Port, etc. can be used. Here, Single Port Memory is selected as the simplest.  - Timing: RTL stands for Register Transfer Level. I don't know who put it on, but it seems to have been put together relatively well. Register is a component whose output is changed at the edge of the clock. Since there is some logic ...

RNN (3) - System Simulator

Image
The Pytorch reference in the previous post uses the optimized library and floating point number, so the performance can be considered the most ideal. In the system simulator for Verilog design, some restrictions follow.  - Fixed Point Number: Because the range of numbers is limited, performance restrictions follow.  - Suboptimal Matrix Operation: The existing Pytorch model has been optimized for efficient and accurate operation in the Python environment. There is a difference in performance in the System Simulator that is actually implemented.  - Non-linear activation function: Here, tanh (hyperbolic tangent) is used, and linear approximation is performed for convenience of implementation.  Here, a 16-bit fixed point number is used. In fact, all 16 bits are not used and only 7 decimal places are used for easy calculation. Unused redundant bits should be optimized and removed in the future, but the truncation so that prevent overflow during every Addition/Multiplicati...