Paper Type

Research-in-Progress Paper

Description

Sizes of datasets used in IS research are growing quickly due to data available from digital technologies such as mobile, RFID, sensors, online markets, and more. It is not uncommon to see studies using tens and hundreds of thousands or even millions of records. \ \ Linear regression is among the most popular statistical model in social sciences research. Linear probability models, which are linear regression models applied to a binary outcome, are commonly used in many social science disciplines, despite criticisms of such usage. Surprisingly, LPMs are rare in the IS literature, where logit and probit regression models are typically used for binary outcomes. \ \ Whether LPMs provide value or constitute an abuse has been discussed only for specific aspects. A thorough and broad evaluation of their pros and cost for different goals in different scenarios is missing. We carry out an extensive study to evaluate the advantages and dangers of LPMs, especially in the realm of Big Data that now affects IS research, where large samples and many variables are available. We evaluate performance in terms of coefficient estimation as well as predictive power. We compare performance to alternatives suggested in the literature. We find that the LPM is beneficial for explanatory modeling when the outcome is naturally binary, whereas it is beneficial for predictive modeling when the outcome is binary by dichotomization. In large-sample studies IS researchers should consider LPMs for purposes of coefficient estimation if the outcome is naturally binary, but not if it is dichotomized. For predictive purposes, LPM should be considered even with small samples. We motivate and illustrate our study using a large dataset on online auctions from eBay.

Share

COinS
 

LINEAR PROBABILITY MODELS IN INFORMATION SYSTEMS RESEARCH

Sizes of datasets used in IS research are growing quickly due to data available from digital technologies such as mobile, RFID, sensors, online markets, and more. It is not uncommon to see studies using tens and hundreds of thousands or even millions of records. \ \ Linear regression is among the most popular statistical model in social sciences research. Linear probability models, which are linear regression models applied to a binary outcome, are commonly used in many social science disciplines, despite criticisms of such usage. Surprisingly, LPMs are rare in the IS literature, where logit and probit regression models are typically used for binary outcomes. \ \ Whether LPMs provide value or constitute an abuse has been discussed only for specific aspects. A thorough and broad evaluation of their pros and cost for different goals in different scenarios is missing. We carry out an extensive study to evaluate the advantages and dangers of LPMs, especially in the realm of Big Data that now affects IS research, where large samples and many variables are available. We evaluate performance in terms of coefficient estimation as well as predictive power. We compare performance to alternatives suggested in the literature. We find that the LPM is beneficial for explanatory modeling when the outcome is naturally binary, whereas it is beneficial for predictive modeling when the outcome is binary by dichotomization. In large-sample studies IS researchers should consider LPMs for purposes of coefficient estimation if the outcome is naturally binary, but not if it is dichotomized. For predictive purposes, LPM should be considered even with small samples. We motivate and illustrate our study using a large dataset on online auctions from eBay.