Non-driving behavior identification is one of the important ways to improve the safety of driving. The current recognition method based on skeleton sequence and image fusion has the problems of large model calculation and the difficulty of feature fusion. To address the above problems, the skeleton-image based behavior recognition network (SIBBR-Net) is proposed in this paper, which is based on the multi-scale skeleton graph and the local visual context. SIBBR-Net fully extracts motion and appearance features through a graph convolution network based on multi-scale skeleton graphs and a convolutional neural network based on local vision and attention mechanisms, and better balances the relationship between model representation capabilities and model calculation. The feature bidirectional guided learning strategy based on hand motion, an adaptive feature fusion module and an auxiliary loss on the static feature space can guide mutual guidance and updating between motion and appearance features to achieve adaptive fusion. SIBBR-Net is finally tested on the Drive & Act dataset, and the average accuracy is 61.78% for dynamic labels and 80.42% for static labels. The Floating-point Operations per Second (FLOPS) of SIBBR-Net is 25.92G, which is 76.96% lower than that of the optimal method.