Files
Abstract
In the field of choice modeling, the availability of ever-larger datasets has the potential to significantly expand our understanding of human behavior, but this prospect is limited by the poor scalability of discrete choice models. Specifically, as sample sizes increase the computational cost of maximum likelihood estimation quickly becomes intractable for anything but trivial model structures. Efforts to tackle this issue have mainly been dedicated to improving the optimization methods used for estimating discrete choice models, but an equally promising approach consists in sampling datasets so as to reduce their size. This paper proposes a simple dataset reduction method that is specifically designed topreserve the diversity of observations originally present in the dataset. Our approach leverages locality-sensitive hashing to create clusters of similar observations, from which representative observations are then sampled. We demonstrate the efficacy of our approach by applying it on a real-world mode choice dataset; the obtained preliminary results seem to confirm that a carefully selected subsample of observations is capable of providing close-to-identical estimation results while being, by definition, less computationally demanding.