RNN (3) - System Simulator
The Pytorch reference in the previous post uses the optimized library and floating point number, so the performance can be considered the most ideal. In the system simulator for Verilog design, some restrictions follow.
- Fixed Point Number: Because the range of numbers is limited, performance restrictions follow.
- Suboptimal Matrix Operation: The existing Pytorch model has been optimized for efficient and accurate operation in the Python environment. There is a difference in performance in the System Simulator that is actually implemented.
- Non-linear activation function: Here, tanh (hyperbolic tangent) is used, and linear approximation is performed for convenience of implementation.
Here, a 16-bit fixed point number is used. In fact, all 16 bits are not used and only 7 decimal places are used for easy calculation. Unused redundant bits should be optimized and removed in the future, but the truncation so that prevent overflow during every Addition/Multiplication is time-consuming. It can be simply hacked by multiplying the actual parameter by 128 and make it an 7-bit integer. The final result is obtained by dividing the final result by 128 again. By doing so, it is possible to reduce the truncate work for each operation. This part is difficult to understand because it is a bit complicated to explain in words, so let's take a simple example.
A = 0.35, B = -0.20 -> AxB = ?
Suppose we solve this problem. then,
1) Convert A and B to fixed point
A = 0.0101100110011001101 B = -0.00110011001100110011
2) Multiply each by 128 to treat A and B. (Actually, it is the same as 7-bit shift left.)
A = 16'b0000000000101100 (44), B = -16'b0000000000011001 (-25)
3) AxB = -10001001100 (1100)
4) Divide the above result by 128x128 (14 bit shift right)
AxB = -0.00010001001100 (-0.067)
The reason why such a complicated process is necessary is that floating operations are rarely used in Verilog considering the implementation area. Therefore, we get a result of -0.067, which is actually less than -0.07 that we expected. It can be seen as a kind of quantization error. The System Simulator defines a constant called "Precision" to determine how many decimal places can be cut off. 128, which is the line that the system performance is judged to be appropriate, is set, and if a value smaller than that is specified, many decimal points of the parameter are truncated and the performance deteriorates. When it was raised above that, there was no significant performance improvement, so it was defined as 128 (7-bit shift). And, if you specify the number like this, you can use Verilog's 2's complement, so you can calculate conveniently without worrying about sign extension one by one. (For the concept of 2's compliment, sign extension, please check the wiki.) Once the number system is established, it is enough to figure out which operations are performed sequentially. A summary of the parameters of the reference model is as follows. (If you run the last line of Reference, you can get each parameter and size.)
- rnn.weight_ih_l0 param.shape: torch.Size([9, 1]) => Wih
- rnn.weight_hh_l0 param.shape: torch.Size([9, 9]) => Whh
- rnn.bias_ih_l0 param.shape: torch.Size([9]) => bih
- rnn.bias_hh_l0 param.shape: torch.Size([9]) => bhh
- out.weight param.shape: torch.Size([1, 9]) => Who
- out.bias param.shape: torch.Size([1]) => bho
Now, if we perform the operation in the following order, we can obtain the desired inference.
ii) Hi = Whh x Hi-1 + bhh (H0 = 0)
iii) Ho = tanh(Xo + Hi)
iv) Ypred = Who x Ho + bho
Comments
Post a Comment