Implement the GELU (Gaussian Error Linear Unit) activation from “Gaussian Error Linear Units (GELUs)” (Hendrycks & Gimpel, 2016).
GELU is the default activation in BERT, GPT-2, and most modern transformers.
The exact formula is: $$\text{GELU}(x) = x \cdot \Phi(x)$$
where $\Phi(x)$ is the CDF of the standard normal distribution.
Use the commonly-used approximation: $$\text{GELU}(x) \approx 0.5 \cdot x \cdot \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}} \cdot (x + 0.044715 \cdot x^3)\right)\right)$$
Input: Tensor x of any shape.
Output: Tensor of same shape.