Classically, the perception of a visual target is explained by small local circuits, which sample information from a nar- row region around the target. For example, in visual crowding it was proposed that flankers can interfere with the target only when they are presented within Bouma’s window. Here, we first show that the perception of an element in the visual field depends on almost all other elements in the visual field as well as their specific configuration. Further, we show that the perception of such an element does not only depend on the large-scale spatial but also temporal context within a window of up to half a second. We argue that processing such large-scale spatio-temporal information is necessary to solve the ill-posed problems of vision and to establish perceptual identity across space and time. Further, we show that these observations can be modeled by the con- catenation of small visual circuits when allowing for time- consuming computations.