Structurally Controllable Text-to-Image Generation for Architectural Images Using Structural Consistency Loss
Abstract
Interior scene design is a comprehensive field involving space planning, color matching, furniture arrangement, texture expression and other aspects, aiming at designing an aesthetically pleasing and functional interior environment. The existing Vincentian graphical models often have problems such as chaotic spatial layout and disproportionate components when dealing with architectural images with strict structural requirements. In this study, we propose a structural consistency loss function to realize implicit structural control by constraining the spatial distribution and semantic alignment of the cross-attention graph of the Qwen-Image model. Specifically, it includes 1) designing the spatial concentration loss to induce the attention regions corresponding to architectural components to be more compact and focused, and 2) introducing the semantic alignment loss to reduce the similarity between the attention maps correspondingĀ to different components, and to enhance the discriminative power of visual-semantic correspondence. This loss function is jointly optimized with the base loss to drive the model to spontaneously learn from the text and follow the underlying laws of the building structure. Experiments on the MMIS dataset show that the final model achieves an optimal performance of FID 11.06 and IS 34.82 and also performs best on the CLIPScore (0.869), a measure of graphic alignment. There is a significant improvement in the actual generation of building images in terms of scale coordination, rationality of component location and overall structural realism. Unlike the methods relying on external conditions, this method provides a scalable solution to achieve structure-aware image generation by relying only on textual cues, which promotes the practicalization of generative AI in the field of professional design, and provides effective technical ideas and methodological references for the in-depth application of text-generated image technology in the field of strong structural requirements such as architecture and design.
Keywords

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
![]() |

Journal of Computing and Information Technology
