Towards Robust Urban Spatial Recognition and Dynamic Optimization: A Multimodal Spatiotemporal Neural Network Approach
Abstract
Urban environments remain challenging to manage due to noisy sensor streams, incomplete multimodal coverage, and the need for rapid responses under dynamic conditions. To address these issues, we propose the Multimodal Spatiotemporal Neural Network (MSTN), a unified end-to-end framework that integrates data preprocessing, modality-specific feature extraction, adaptive multimodal fusion, and differentiable optimization. MSTN employs hybrid attention mechanisms and dynamic gating to balance heterogeneous inputs, ensuring temporal consistency and robustness to missing or corrupted data. Evaluated on two established urban perception benchmarks, Cityscapes for dense scene understanding and nuScenes for multimodal trajectory prediction, MSTN achieves an average 18.7% improvement in recognition accuracy over Faster R-CNN and a 23.5% reduction in pose estimation error compared to ST-GCN, while exhibiting
faster convergence and lower computational overhead. Robustness tests show stable performance under up to 30% sensor corruption and improved generalization across city environments. While MSTN demonstrates strong empirical performance, its reliance on synchronized multimodal inputs and quadratic attention complexity may limit deployment in highly resource-constrained settings. Nonetheless, MSTN offers a practical and scalable architecture for real-world applications in intelligent transportation, emergency response, and adaptive urban management.
Keywords
Full Text:
PDF
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
![]() |

Journal of Computing and Information Technology
