The recent advancements in artificial intelligence (AI) and deep learning have enabled smart products, such as smart toys and robotic dogs, to interact with humans more intelligently and express emotions. As a result, such products become intensively sensorized and integrate multi-modal interaction techniques to detect and infer emotions from spoken utterances, motions, pointing gestures and observed objects, and to plan their actions. However, even for the predictive purposes, a practical challenge for these smart products is that deep learning algorithms typically require high computing power, especially when applying a multimodal method. Moreover, the memory needs for deep learning models usually surpass the limit of many low-end mobile computing devices as their complexities boost up. In this study, we explore the application of lightweight deep neural networks, SqueezeDet model and Single Shot Multi-Box Detector (SSD) model with MobileNet as the backbone, to detect canine beloved objects. These lightweight models are expected to be integrated into a multi-modal emotional support robotics system designed for a smart robot dog. We also introduce our future research works in this direction.