See real DistilBERT attention weights — all 6 transformer layers and 12 specialized heads.
Each layer reveals a different part of how the model parses language: from surface word order to long-range syntax and coreference.
Select an example to explore.
Curated examples — each showcases a different linguistic phenomenon