RNN (3) - System Simulator

The Pytorch reference in the previous post uses the optimized library and floating point number, so the performance can be considered the most ideal. In the system simulator for Verilog design, some restrictions follow.

- Fixed Point Number: Because the range of numbers is limited, performance restrictions follow.

- Suboptimal Matrix Operation: The existing Pytorch model has been optimized for efficient and accurate operation in the Python environment. There is a difference in performance in the System Simulator that is actually implemented.

- Non-linear activation function: Here, tanh (hyperbolic tangent) is used, and linear approximation is performed for convenience of implementation.

Here, a 16-bit fixed point number is used. In fact, all 16 bits are not used and only 7 decimal places are used for easy calculation. Unused redundant bits should be optimized and removed in the future, but the truncation so that prevent overflow during every Addition/Multiplication is time-consuming. It can be simply hacked by multiplying the actual parameter by 128 and make it an 7-bit integer. The final result is obtained by dividing the final result by 128 again. By doing so, it is possible to reduce the truncate work for each operation. This part is difficult to understand because it is a bit complicated to explain in words, so let's take a simple example.

A = 0.35, B = -0.20 -> AxB = ?

Suppose we solve this problem. then,

1) Convert A and B to fixed point

A = 0.0101100110011001101 B = -0.00110011001100110011

2) Multiply each by 128 to treat A and B. (Actually, it is the same as 7-bit shift left.)

A = 16'b0000000000101100 (44), B = -16'b0000000000011001 (-25)

3) AxB = -10001001100 (1100)

4) Divide the above result by 128x128 (14 bit shift right)

AxB = -0.00010001001100 (-0.067)

The reason why such a complicated process is necessary is that floating operations are rarely used in Verilog considering the implementation area. Therefore, we get a result of -0.067, which is actually less than -0.07 that we expected. It can be seen as a kind of quantization error. The System Simulator defines a constant called "Precision" to determine how many decimal places can be cut off. 128, which is the line that the system performance is judged to be appropriate, is set, and if a value smaller than that is specified, many decimal points of the parameter are truncated and the performance deteriorates. When it was raised above that, there was no significant performance improvement, so it was defined as 128 (7-bit shift). And, if you specify the number like this, you can use Verilog's 2's complement, so you can calculate conveniently without worrying about sign extension one by one. (For the concept of 2's compliment, sign extension, please check the wiki.) Once the number system is established, it is enough to figure out which operations are performed sequentially. A summary of the parameters of the reference model is as follows. (If you run the last line of Reference, you can get each parameter and size.)

- rnn.weight_ih_l0 param.shape: torch.Size([9, 1]) => Wih
- rnn.weight_hh_l0 param.shape: torch.Size([9, 9]) => Whh
- rnn.bias_ih_l0 param.shape: torch.Size([9]) => bih
- rnn.bias_hh_l0 param.shape: torch.Size([9]) => bhh
- out.weight param.shape: torch.Size([1, 9]) => Who
- out.bias param.shape: torch.Size([1]) => bho

Now, if we perform the operation in the following order, we can obtain the desired inference.

i) Xo = Wih x Xi + bih
ii) Hi = Whh x Hi-1 + bhh (H0 = 0)
iii) Ho = tanh(Xo + Hi)
iv) Ypred = Who x Ho + bho

If you execute the code in the link below, you can see the performance of the reference model and the system simulator as shown in the figure below. There is a clear performance difference due to the three factors mentioned above, and once the trend is following well, if the result is scaled, the prediction result can be compensated to some extent. So, the System Simulator is frozen here. It is also used to make a test vector necessary for Verilog development, and is also used to compare with the final RTL result. (* I am a hardware engineer, so Python code may be not nice from a SW engineer's point of view. Since it was written with an algorithm that is as close to Verilog as possible. It would be appreciated if someone could verify whether there is any other reason other than the quantization error. Smart truncation can be applied while monitor the performance not to degrade the performance. But it's out of the scope)

- data = torch.sin(time+torch.sin(time))

- data = torch.sin(time+torch.sin(time)+torch.cos(time))

source: https://github.com/bxk218/RNN_wave_predictor_verilog/blob/main/systemSimulator.ipynb

Search This Blog

How to make HW accelerators for Machine Learning

RNN (3) - System Simulator

Comments

Post a Comment

Popular posts from this blog

RNN (5) - FPGA System Design

RNN (4) - RTL Design