This AI paper introduces parscale (parallel scaling): a parallel computing method for efficient and scalable language model deployment

Over time, the pursuit of better performance in language models has prompted researchers to scale up, which often involves increasing the number of parameters or expanding their computing power. As a result, the development and deployment of language models now depend heavily on the availability of large amounts of computing resources and memory.

Despite the progress, increasing the size of the model or generating more tokens to enhance reasoning poses significant challenges. Parameter scaling methods such as intensive scaling rate and expert scaling involve increasing the number of trainable weights and requiring larger memory resources. Meanwhile, inference time scaling, on the other hand, requires the model to generate longer sequences or perform multiple inference steps, which introduces latency and makes deployment slower. While effective, these approaches are inadequate in all cases and cannot address the deployment efficiency of low resource settings such as mobile devices or embedded systems.

Researchers at Qianjiang University and Alibaba Group have proposed a new method called Parscale, which represents parallel scales. This method shifts the focus from increasing the model size or output length to adding parallel calculations of the model during training and inference. By applying multiple learnable transformations to the input, the model performs multiple forward passes in parallel and dynamically summarizes its outputs. Parscale retains the original parameter count of the model and improves computational diversity, making it an adaptive solution for a variety of tasks and model architectures without the need for specialized datasets or training protocol changes.

On a technical level, parscale appends several different learnable prefixes to the same input, resulting in multiple parallel versions. The model processes these models simultaneously and summarizes the output using a dynamic weighted sum calculated by a multi-layer perceptron. This structure introduces only about 0.2% of the extra parameters per stream, which is a smaller addition compared to full parameter scaling. The model uses prefix adjustment to distinguish each parallel stream through a unique key-value library, allowing memory reuse effectively. The approach also benefits from GPU-friendly parallelization, which helps to keep the latency low despite additional computations. The design ensures scalability without modifying the core architecture and enables the application only by training new prefixes and aggregation parameters.

The researchers conducted extensive experiments on models from 0.5b to 4.4b parameters, and the parallel flow P was set to 1 to 8. When trained with 42 billion tokens, the P = 8 model exhibits the equivalent of a model with up to 4.4 billion parameters, but the memory and delay required. Specifically, compared with parameter scaling with the same performance, Parscale uses a smaller memory increase of 22× while the delay increases by 6× on the 1.6B model. On downstream tasks, parscale’s GSM8K is improved by 34% and MMLU is improved by 23%. Significant improvement in coding performance – the model with 1.6B parameters and p = 8 achieves results comparable to the 4.4B parameter model. This method also proves that it is effective after training and during the fine-tuning of parameters that are effective, and can maintain high performance even if the core model parameters remain unchanged.

This article describes a strategy to rethink how to scale the language model. Instead of exaggerating the model size or reasoning steps, it focuses on effectively repeating existing calculations. The researchers’ approach addresses time and memory inefficiency while maintaining or improving performance. This demonstrates a compelling change in the scaling method and sets the orientation for the direction of effectively deploying advanced models in constrained environments using parallel computing.

View paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.