NVIDIA reveals an important milestone in scalable machine learning: XGBoost 3.0, which is now able to train gradient-enhanced decision tree (GBDT) models from gigabytes, can be trained on a single GH200 Grace Grace Hopper Superper Superchip 1 Keep Trabyte (TB). This breakthrough allows companies to handle huge data sets of applications such as fraud detection, credit risk modeling and algorithmic trading, simplifying the complex process of the once-complex machine learning ML pipeline.
Break the trabyte barrier
The core of this progress is New external memory is engraved with dmatrix In XGBoost 3.0. Traditionally, GPU training is limited by available GPU memory, the dataset size that can be realized or forces teams to adapt to complex multi-node frameworks. The new version takes advantage of Grace Hopper Superchip’s Coherent memory architecture And super fast 900GB/S NVLINK-C2C bandwidth. This causes direct streaming data from host RAM to GPU directly from host RAM, overcoming bottlenecks and memory constraints that previously required Ram-Monster servers or large GPU clusters.
Real-world benefits: speed, simplicity and cost savings
Royal Bank of Canada (RBC) and other institutions have reported 16 times faster and Total cost of ownership decreased by 94% (TCO) conducts model training by transferring its predictive analytics pipeline to GPU-driven XGBoost. This leap in efficiency is critical to workflows that use continuous model adjustments and rapidly changing data volumes, allowing banks and businesses to optimize functionality faster as data grows.
How it works: external memory complies with XGBoost
New external memory methods introduce several innovations:
- External memory quantile dmatrix: Pre-type all functions into quantile buckets, compress data into host RAM, and stream as needed, thereby maintaining accuracy while reducing GPU memory load.
- Scalability on a single chip: A GH200 SuperChip with 80GB HBM3 GPU RAM plus 480GB LPDDR5X system RAM, can now handle a full TB-level dataset, previously only using TASKS on multi-GPU clusters.
- Simpler integration: For data science teams using Rapids, activation of the new method is a straightforward pour that requires minimal code changes.
Technical best practices
- use
grow_policy='depthwise'
For tree builds, perform best on external memory. - Use CUDA 12.8+ and a HMM-enabled driver to provide full elegant Hopper support.
- Data Shape Problem: The number of rows (labels) is the main limiter for scaling – higher or higher tables produce comparable performance on the GPU.
upgrade
Other highlights in XGBoost 3.0 include:
- Experimental support Distributed external memory Cross GPU cluster.
- Reduces memory requirements and initialization time, especially for most dense data.
- Supports classification features, quantile regression and explanatory under external memory patterns.
Industry impact
By bringing Terabyte-Scale GBDT training to a single chip, NVIDIA democratizes the number of large-scale machine learning visits for financial and enterprise users, paving the way for faster iteration, lower costs and reduced IT complexity.
Together, XGBOOST 3.0 and GRACE HOPPER SUPERCHIP mark a significant leap in scalable accelerated machine learning.
Check Technical details. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.
Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels in transforming complex data sets into actionable insights.