We can't find the internet
Attempting to reconnect
Something went wrong!
Attempting to reconnect
Vision-Language Fusion
Why this matters
Multimodal models — CLIP, Flamingo, BLIP, GPT-4o — combine representations from at least two modalities (vision, language, audio). The simplest, most pervasive trick they share is early fusion: project each modality into a shared space, mix them with a non-linearity, and let downstream layers do the heavy lifting.
Before joint Transformers became standard for VLMs, simple fusion modules like this were the ENTIRE multimodal head — used in VQA (visual question answering), image captioning, retrieval. They’re still the building block at the boundary where modalities first meet.
The recipe
Two 1-D inputs (e.g., a CLIP image embedding and a text encoder output):
image_features ∈ R^{D_v}
text_features ∈ R^{D_t}
The fusion module:
-
Project each modality to a shared
hiddendim:v = Dense_v(image_features)→(hidden,),t = Dense_t(text_features)→(hidden,). -
Mix with element-wise add and a non-linearity:
fused = tanh(v + t). -
Reproject with one more
Dense(hidden)so the mixed representation has a chance to recombine its features.
Element-wise add is the simplest fusion. More sophisticated methods (concatenation, gating, bilinear pooling, cross-attention) extend this; they’re all extensions of “project to shared space, combine.”
Why tanh, not relu?
tanh outputs to (-1, 1) — it lets a “negative agreement”
between modalities propagate (v says +0.5, t says -0.5,
sum is 0; v says -0.5, t says -0.5, sum is -1, fused has
a strong negative signal). relu would zero out anything
negative — fine, but loses sign information at the fusion point.
Most early-fusion modules in the literature use tanh.
Worked walk-through
image_features shape (6,), text_features shape (4,),
hidden = 4:
-
v = Dense(hidden=4)(image_features)→(4,). -
t = Dense(hidden=4)(text_features)→(4,). -
fused = tanh(v + t)→(4,). -
out = Dense(hidden=4)(fused)→(4,).
Note that D_v and D_t can differ; the projections handle the
width mismatch. After fusion, downstream layers see a uniform
hidden-dim vector.
Common pitfalls
- Sharing one Dense for both modalities: each needs its OWN projection. The point of early fusion is to learn modality- specific maps into the shared space.
-
No non-linearity at fusion:
out = Dense(v + t)is just a linear map of(v, t). Withouttanh(or some non-linear activation), the fusion can be folded into a single linear layer at training time — you’ve lost capacity. -
Forgetting the final
Dense(hidden): the fused vector goes BACK through one more learned projection. This is the “reproject after mixing” step that lets the network shape the fused representation.
Problem
Implement vision_language_fusion_forward(seed, image_features, text_features, hidden):
-
VLFusion(nn.Module)withhiddenfield. -
Inside
@nn.compact:-
v = nn.Dense(hidden)(image_features), -
t = nn.Dense(hidden)(text_features), -
fused = jnp.tanh(v + t), -
out = nn.Dense(hidden)(fused).
-
-
Return
outflattened.
Inputs:
-
seed: int. -
image_features: 1-D(D_v,). -
text_features: 1-D(D_t,). -
hidden: int.
Output: 1-D, length hidden.
Hints
Sign in to attempt this problem and view the solution.