Preprint, follow-up to NeuroSymbolicDG. We factor visual recognition into visual primitives plus their relational composition: a CNN backbone, a concept-bottleneck layer that maps features to primitive heatmaps with spatial coordinates, and a structural scoring layer that evaluates candidate spatial relations among detected primitives. Class probability comes from joint evidence over relational compositions, giving a domain-invariant classifier head for fine-grained recognition. arXiv.