Date: Oct 26, 2018
This is the first post I created.
#include <stdio.h>
int main() {
printf("Hello, World!\n");
return 0;
}
hello.c
Notes from the original paper “RoFormer: Enhanced Transformer with Rotary Position Embedding” (Su et al., 2021) and its mathematical derivation.
Rotary Position Embedding (RoPE) Mathematical Derivation begins with the basic rotary matrix definition. For a 2D rotation matrix:
\[R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}\]For position m, frequency base $\theta$: \(\theta_m = m\theta\)
Subsequently, in terms of complex number representation, RoPE can be expressed using complex numbers. For a complex number representation:
\[q\cdot e^{im\theta} = (q_r + iq_i)(cos(m\theta) + i\sin(m\theta))\]This expands to: \((q_r\cos(m\theta) - q_i\sin(m\theta)) + i(q_r\sin(m\theta) + q_i\cos(m\theta))\)
Moving on to rotary embedding for query and key vectors, for a vector pair $(q, k)$ at position $m$:
\(\Phi(q, m) = q \cdot e^{im\theta}\) \(\Phi(k, m) = k \cdot e^{im\theta}\)
The dot product becomes: \(\Phi(q,m)^T\Phi(k,m) = (q \cdot e^{im\theta})^T(k \cdot e^{im\theta}) = q^Tk\)
Furthermore, regarding frequency band definition, for dimension d and position m, the frequency band is:
\[\theta_d = 10000^{-2d/D}\]where D is the total embedding dimension.
The position-dependent frequency becomes: \(f_d(m) = m\theta_d = m \cdot 10000^{-2d/D}\)
In the full rotary embedding implementation, for a vector $x$ of dimension $D$, RoPE pairs adjacent dimensions $(2d,2d+1)$:
\[\begin{align*} y_{2d} &= x_{2d}\cos(m\theta_d) - x_{2d+1}\sin(m\theta_d) \\ y_{2d+1} &= x_{2d}\sin(m\theta_d) + x_{2d+1}\cos(m\theta_d) \end{align*}\]As for the relative position property, for positions $m$ and $n$, the relative attention is preserved:
\[\Phi(q,m)^T\Phi(k,n) = q^T\begin{pmatrix} \cos((m-n)\theta) & -\sin((m-n)\theta) \\ \sin((m-n)\theta) & \cos((m-n)\theta) \end{pmatrix}k\]This shows that the attention score only depends on relative position $(m-n)$.
Finally, considering key properties and advantages, the time complexity for sequence length $L$ and dimension $D$ is: $O(LD)$
While the memory usage is independent of sequence length: $O(1)$
This mathematical formulation shows that RoPE:
- Encodes absolute positions
- Naturally captures relative positions in attention computation
- Has linear complexity
- Requires no additional memory for position tables
- Maintains rotational invariance in the embedding space
The implementation combines rotary matrices with frequency-based position encoding, providing an efficient way to encode positional information in transformer architectures.
Last updated: Sep 2023