Implement a simple linear forward pass using mixed precision: perform the matrix multiplication in float16 (half precision) for speed, then cast the result back to float32.
Compute: output = (x_fp16 @ W_fp16 + bias_fp32) cast to float32.
Input:
x: A 2D input tensor of shape (batch, in_features) in float32 W: A 2D weight tensor of shape (in_features, out_features) in float32 bias: A 1D bias tensor of shape (out_features,) in float32
Output: A 2D tensor of shape (batch, out_features) in float32.
Steps:
x and W to float16 API Reference:
.half(), .float(), torch.float16 .astype(jnp.float16), .astype(jnp.float32)