Flax

Modules, layers from scratch, attention, transformer architectures, training loops, lifted transforms. Production model code in JAX (Linen API).

0 / 100 solved Continue →

1. ○ Module with @nn.compact
2. ○ Module with setup() (alternative to compact)
3. ○ Custom Parameter Initializer
4. ○ init() and apply() Round-Trip
5. ○ nn.Sequential Composition
6. ○ Module with Multiple Named Sub-Modules
7. ○ Module That Branches on a Config Flag
8. ○ Three Levels of Module Nesting
9. ○ Multiple PRNG Streams (params, dropout)
10. ○ Train vs Eval Branches via train Flag
11. ○ Implement Dense from Scratch
12. ○ Implement Conv1D from Scratch
13. ○ Implement Conv2D with Stride and Padding
14. ○ Implement Transposed Convolution
15. ○ Implement Depthwise-Separable Convolution
16. ○ Implement LayerNorm with γ/β
17. ○ Implement BatchNorm with Mutable batch_stats
18. ○ Implement GroupNorm
19. ○ Implement RMSNorm (Modern LLM Norm)
20. ○ Implement Dropout with RNG Threading
21. ○ Scaled Dot-Product Attention
22. ○ Multi-Head Self-Attention with Flax
23. ○ Causal Multi-Head Self-Attention
24. ○ Cross-Attention with Flax MHA
25. ○ Multi-Head Attention with KV Cache
26. ○ Grouped-Query Attention (GQA)
27. ○ Multi-Query Attention (MQA)
28. ○ Sliding-Window Attention (Mistral-style)
29. ○ ALiBi: Attention with Linear Biases
30. ○ Block-Diagonal Attention Mask
31. ○ Token Embedding with Flax
32. ○ Sinusoidal Position Encoding
33. ○ Learned Position Embedding
34. ○ Rotary Position Embedding (RoPE)
35. ○ ALiBi Bias Matrix
36. ○ T5 Relative Position Bucketing
37. ○ Tied Input/Output Embedding
38. ○ ViT Patch Embedding
39. ○ Transformer Encoder Block (Pre-LN)
40. ○ Transformer Decoder Block (Pre-LN)
41. ○ Pre-LN vs Post-LN Residual Pattern
42. ○ Mini GPT — Decoder-Only Language Model
43. ○ Mini BERT — Encoder-Only Hidden States
44. ○ Mini T5 — Encoder-Decoder with RMSNorm and Tied Embeddings
45. ○ Vision Transformer (Mean-Pool Variant)
46. ○ Vision Transformer with [CLS] Token
47. ○ DeiT — Data-Efficient Image Transformer
48. ○ SwiGLU Feed-Forward Network
49. ○ ResNet Basic Block
50. ○ ResNet Bottleneck Block
51. ○ Tiny ResNet Classifier
52. ○ Tiny U-Net
53. ○ GRU Cell Step
54. ○ LSTM Cell Step
55. ○ Bidirectional RNN
56. ○ Mixture-of-Experts FFN
57. ○ Squeeze-and-Excitation Block
58. ○ Vision-Language Fusion
59. ○ TrainState — One Step
60. ○ train_step with value_and_grad
61. ○ eval_step — Forward + Metrics
62. ○ Label-Smoothed Cross-Entropy
63. ○ Mixed-Precision Training Step
64. ○ Train with Mutable batch_stats
65. ○ Multi-Task Two-Head Loss
66. ○ Sharded Eval Loss
67. ○ Warmup-Cosine LR at Step
68. ○ Gradient Accumulation Step
69. ○ EMA of Parameters
70. ○ Orbax Save (Tree-Leaf Count)
71. ○ Orbax Load (Restore via Template)
72. ○ HF Weight Load (Kernel Transpose)
73. ○ Pre-train then Fine-tune (Frozen Trunk)
74. ○ Per-Param Weight Decay Mask
75. ○ Per-Param Learning Rate Multipliers
76. ○ Param Freezing via Grad Zeroing
77. ○ Test-Time Augmentation Aggregation
78. ○ Distributed Checkpoint (Sharding Math)
79. ○ nn.scan over an RNN cell
80. ○ nn.scan over layers
81. ○ nn.vmap with shared params
82. ○ nn.checkpoint (gradient checkpointing)
83. ○ nn.jit (Flax-aware JIT lift)
84. ○ nn.remat with checkpoint policies
85. ○ Composed lifts: nn.scan + nn.vmap
86. ○ Batched init via jax.vmap
87. ○ jax.lax.scan inside a Flax Module
88. ○ Custom lift: roll your own ensemble
89. ○ PartitionSpec Layout
90. ○ with_sharding_constraint Annotation
91. ○ nn.with_partitioning Annotation
92. ○ flax.struct.dataclass — Pytree-Friendly State
93. ○ Param Surgery — Kernel Replace
94. ○ Param Surgery — Zero Last Layer
95. ○ Param Surgery — Freeze First Dense
96. ○ Partial Init — Warm-Start From Smaller Checkpoint
97. ○ Multiple Mutable Collections
98. ○ Param Sharing — One Module, Two Call Sites
99. ○ shard_map Simulation — Manual SPMD
100. ○ Mini LM Capstone — Putting It All Together