Abstract

Identifying cultural and societal factors in large-scale consumer reviews (e.g., Amazon, Yelp, eBay) is critical for understanding customer behavior (Kim et al., 2023) and for informing effective business strategies based on user-generated content. However, manually identifying and labeling such factors for machine learning models is challenging due to the large volume of review data (Kaplan et al., 2020). Large language models (LLMs) are considered effective tools for accelerating the labeling of unstructured text (Ding et al., 2023); however, their consistency in capturing culturally grounded patterns in consumer reviews remains insufficiently understood. The existing approaches often rely on single model annotations, which limits the ability to observe variation across different systems and may introduce model-specific biases in interpretation (Ding et al., 2023). This study seeks to evaluate multiple generative models, including widely used systems such as ChatGPT, Gemini, and Claude, under a unified prompting framework for assigning structured cultural labels to consumer reviews. Differences will be analyzed in terms of inter-model agreement, labeling consistency, and the ability to capture multilingual and indirectly expressed opinions. In particular, the study aims to examine whether these models differ in interpreting indirectly expressed opinions and contextual cues embedded in consumer reviews. The findings of this study are expected to provide insights into the reliability and limitations of LLM-based annotation processes, highlighting the need for multi-model validation and careful evaluation when deploying large language models in socio-cultural annotation pipelines.

Share

COinS