Pedestrian behavior prediction is one of the main challenges faced by urban environment intelligent vehicle decision planning system. It is of great significance to improve the prediction accuracy of pedestrian crossing intention for driving safety. In view of the problems that the existing methods rely too much on the location information of pedestrian boundary box, and rarely consider the environmental information in traffic scenes and the interaction between traffic objects, a pedestrian crossing intention prediction method based on multi-modal feature fusion is proposed. In this paper, a new global scene context information extraction module and a local scene spatiotemporal feature extraction module are constructed by combining multiple attention mechanisms to enhance its ability to extract spatiotemporal features of the scene around the vehicle, and rely on the semantic analysis results of the scene to capture the interaction between pedestrians and their surroundings, which solves the problem of insufficient application of the interactive information between the context information of the traffic environment and the traffic objects. In addition, a multimodal feature fusion module based on hybrid fusion strategy is designed in this paper, which realizes the joint reasoning of visual features and motion features according to the complexity of different information sources, and provides reliable information for pedestrian crossing intention prediction module. The test based on JAAD dataset shows that the prediction accuracy of the proposed method is 0.84, which is 10.5 % higher than that of the baseline method. Compared with existing models of the same type, the proposed method has the best comprehensive performance and has a wider application scenario.