Techniques for text-based reference image generation are described that support generation of reference digital images of a three-dimensional representation of a digital environment. In an example, a processing device receives a text-based input that describes a feature of a three-dimensional representation of a digital environment. The processing device generates a reference digital image for output that depicts a view of the feature based on a perceptual similarity between the reference digital image and semantic properties of the text-based input. The processing device is further operable to apply one or more edits to the reference digital image based on features of the digital environment as well as on additional user inputs.