Delivering information across sensory modalities is often supported by the independent nature of multi-modal information processing, which assumes that there is no interference between tasks and thus no degradation in performance. However, research in cognitive psychology shows that visual and auditory perceptual processing is closely linked. Problems related to memory and cognitive workload are found in current applications with voice-based interface. For instance, mental integration of disparate information tends to cause a heavy cognitive memory load, and switching attention between modalities may be slow and have a high cost. This study focuses on how to design a visual-auditory information presentation to: (1) minimize the interference in information processing between the visual and auditory channels; and (2) improve the effectiveness of mental integration of information from different modalities. Baddeley's working memory model suggests that imagery spatial information and verbal information can be concurrently held in different subsystems within human working memory. Based on this model and research on human attention, this study proposes a method to convert textual information into a “graphics + voice” representation and hypothesizes that this dual-modal presentation will result in superior comprehension performance and higher satisfaction as compared to pure textual display. Simple T-tests will be used to test the hypothesis. Results of this study will benefit interface design of generic computer systems by alleviating information overload in the visual display. Findings may also help to address usability problems associated with hand-held devices.