skip to main content
10.1145/2939672.2939785acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

XGBoost: A Scalable Tree Boosting System

Published:13 August 2016Publication History

ABSTRACT

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

Skip Supplemental Material Section

Supplemental Material

kdd2016_chen_boosting_system_01-acm.mp4

mp4

396.1 MB

References

  1. R. Bekkerman. The present and the future of the kdd cup competition: an outsider's perspective.Google ScholarGoogle Scholar
  2. R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Bennett and S. Lanning. The netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3--6, New York, Aug. 2007.Google ScholarGoogle Scholar
  4. L. Breiman. Random forests. Maching Learning, 45(1):5--32, Oct. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23--581, 2010.Google ScholarGoogle Scholar
  6. O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning Research - W & CP, 14:1--24, 2011.Google ScholarGoogle Scholar
  7. T. Chen, H. Li, Q. Yang, and Y. Yu. General functional matrix factorization using gradient boosting. In Proceeding of 30th International Conference on Machine Learning (ICML'13), volume 1, pages 436--444, 2013.Google ScholarGoogle Scholar
  8. T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient second-order gradient boosting for conditional random fields. In Proceeding of 18th Artificial Intelligence and Statistics Conference (AISTATS'15), volume 1, 2015.Google ScholarGoogle Scholar
  9. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189--1232, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  11. J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367--378, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337--407, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles, 2003.Google ScholarGoogle Scholar
  14. M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 58--66, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD'14, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Li. Robust Logitboost and adaptive base class (ABC) Logitboost. In Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI'10), pages 302--311, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In Advances in Neural Information Processing Systems 20, pages 897--904. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1--7, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. Proceeding of VLDB Endowment, 2(2):1426--1437, Aug. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Ridgeway. Generalized Boosted Models: A guide to the gbm package.Google ScholarGoogle Scholar
  22. S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking. In Proceedings of the 20th international conference on World wide web, pages 387--396. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM '09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Q. Zhang and W. Wang. A fast algorithm for approximate quantiles in high speed data streams. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Zhang and R. Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.Google ScholarGoogle Scholar

Index Terms

  1. XGBoost: A Scalable Tree Boosting System

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
        August 2016
        2176 pages
        ISBN:9781450342322
        DOI:10.1145/2939672

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 August 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader