Kirill Solodskih, Co-founder and CEO of Thestage AI – Interview Series

Dr. Kirill Solodskih is the co-founder and CEO of Thestage AI and an experienced AI researcher and entrepreneur with more than a decade of experience to optimize neural networks in the real world. In 2024, he co-founded TETAGE AI, which received $4.5 million in funding to fully automate neural network acceleration on any hardware platform.
Previously, as Huawei’s team leader, Kirill led the acceleration of Qualcomm NPU’s AI camera application, contributed to the performance of P50 and P60 smartphones, and obtained multiple patents for his innovation. His research has won awards at leading conferences such as CVPR and ECCV, and has been awarded and industry-wide recognition. He also hosts podcasts on AI Optimization and Inference.
What inspired you to TERESTAGE AI, how did you transition from academia and research to reasoning optimization as a founder of entrepreneurs?
What ultimately became the foundation of ecology started with my work at Huawei, where I deeply automate and optimize neural networks. These initiatives become the basis for some of our groundbreaking innovations, and that’s where I see the real challenge. Training a model is one thing, but having it run effectively in the real world and making it accessible to users is another. Deployment is a bottleneck that can bring many great ideas to life. To make something easy to use like chatgpt, there are many backend challenges involved. From a technical point of view, neural network optimization is about minimizing parameters while maintaining high performance. This is a tricky math problem with plenty of room for innovation.
Manual inference optimization has long been a bottleneck in AI. Can you explain how AI automates this process and why it is a game-changer?
Thestage AI deals with the main bottleneck in AI: manual compression and acceleration of neural networks. Neural networks have billions of parameters and can figure out at hand which parameters to remove for performance improvements. Anna (automated neural network analyzer) automates this process, determining which layers are excluded from optimization, similar to how ZIP compression is automated first.
This has changed the game by making AI adoption faster and more affordable. Startups can automatically optimize models instead of relying on expensive manual processes. This technology gives businesses a clear understanding of performance and cost, ensuring efficiency and scalability without guessing.
Thestage AI claims to reduce inference costs to 5 times – is your optimization technique so effective compared to traditional methods?
Thestage AI uses optimization methods beyond traditional methods to reduce output costs by 5 times. Instead of applying the same algorithm to the entire neural network, Anna breaks it down into smaller layers and decides which algorithm is applied to each part to provide the required compression while maximizing the quality of the model. By combining intelligent mathematical heuristics with effective approximations, our approach is highly scalable, making AI adoption easier to deliver to enterprises of all sizes. We also integrate flexible compiler settings to optimize networking for specific hardware like iPhone or Nvidia GPU. This gives us greater control, allowing for fine-tuning performance, increasing speed without losing quality.
How accelerated is AI’s reasoning compared to Pytorch’s native compiler, and what advantages does it provide to AI developers?
Thestage AI accelerates output far beyond the natural Pytorch compiler. Pytorch uses a “formal” assembly method that compiles the model every time it runs. This can lead to longer startup times, sometimes taking several minutes or even longer. In scalable environments, this can create inefficiencies, especially when new GPUs are required to handle the increased user load, resulting in latency that affects the user experience.
By contrast, Thestage AI allows precompilation of the model, so it can be deployed immediately once it is ready. This leads to faster promotion, improved service efficiency and cost savings. Developers can deploy and scale AI models faster without the bottlenecks of traditional assembly, making them more efficient and responsive to high-demand use cases.
Can you share more information about TESAGE AI’s QLIP toolkit and how it can enhance model performance while maintaining quality?
Qlip, the toolkit for Thestage AI, is a Python library that provides a set of required original libraries for quickly building new optimization algorithms tailored to different hardware such as GPUs and NPUs. The toolkit includes components such as quantization, trimming, specifications, compilation and services that are critical to developing effective, scalable AI systems.
What makes QLIP different is its flexibility. It allows AI engineers to implement new algorithms using only a few lines of code. For example, within a few minutes, you can use the primitives of QLIP to convert AI conference papers on quantized neural networks into working algorithms. This makes it easy for developers to integrate the latest research into their models without being blocked by a strict framework.
Unlike traditional open source frameworks that limit you to fixed algorithms, QLIP allows anyone to add new optimization techniques. This adaptability helps teams stay ahead of the fast-growing AI landscape, improving performance while ensuring flexibility for future innovations.
You have contributed to the AI quantization framework used in Huawei P50 and P60 cameras. How does this experience affect your AI optimization approach?
My experience working on the AI quantization framework of Huawei P50 and P60 has given me valuable insights into how to simplify and scale optimizations. When I first started using Pytorch, the full execution graph using a neural network was rigid and the quantization algorithm had to be implemented manually on a layer. At Huawei, I have built a framework to automate processes. You just enter the model and it will automatically generate quantitative code, eliminating manual work.
This made me realize that automation in AI optimization is achieved speed without sacrificing quality. One of the algorithms I developed and patented was crucial to Huawei, especially when sanctions had to transition from Killing processors to Qualcomm due to sanctions. It enables teams to quickly adapt neural networks to Qualcomm’s architecture without losing performance or accuracy.
By simplifying and automating processes, we have reduced development time from over a year to just a few months. This has had a huge impact on millions of products used and has shaped my optimization approach, focusing on minimal losses in speed, efficiency and quality. That is the mentality I brought to Anna today.
Your research has been exhibited in CVPR and ECCV – What are the key breakthroughs in AI efficiency that you are most proud of?
When asked about my achievements in AI efficiency, I always recall the papers we selected for verbal introductions at CVPR 2023. It is rare to choose oral introductions at this conference because only 12 papers were selected. This adds to the fact that generative AI usually dominates the spotlight, and our paper adopts a different approach, focusing on the mathematical aspect, especially the analysis and compression of neural networks.
We have developed a method that can help us understand how many parameters a neural network really needs to operate effectively. By applying the techniques of functional analysis and from discrete to continuous formulas, we are able to obtain good compression results while maintaining the ability to reintegrate these changes into the model. The paper also introduces several new algorithms that are not used by the community and further applied.
This is one of my first papers in the field of AI and, importantly, it is the result of a collective effort of our team (including my co-founder). This is an important milestone for all of us.
Can you explain how overall neural networks (Inns) work and why they are important innovations in deep learning?
Traditional neural networks use fixed matrices, similar to Excel tables, where sizes and parameters are pre-determined. However, Inns describes the network as a continuous function, providing greater flexibility. Think of it as a blanket with pins of different heights, which represents continuous waves.
The reason for exciting hotels is their ability to dynamically “compress” or “scaling” based on available resources, similar to how analog signals are digitized into sound. You can shrink the network without sacrificing quality and scale it up when needed without retraining.
We tested this and although traditional compression methods resulted in huge mass losses, Inns maintained close to the original mass even under extreme compression. The math behind it is more unconventional for the AI community, but the real value is that it can deliver solid practical results with minimal effort.
TERESAGE AI is already working in quantum annealing algorithms – How do you think quantum computing plays a role in AI optimization in the near future?
When it comes to quantum computing and its role in AI optimization, the key point is that quantum systems provide completely different ways to solve problems such as optimization. Although we did not invent quantum annealing algorithms from scratch, companies like D-Wave offer Python libraries to build quantum algorithms specifically for discrete optimization tasks, which is ideal for quantum computers.
The idea here is that we are not directly loading neural networks into quantum computers. The current architecture is impossible. Instead, we approximate how neural networks behave under different types of degradation to fit systems quantum chips can handle.
In the future, quantum systems can scale and optimize networks with precision that traditional systems are difficult to match. The advantage of quantum systems is their built-in parallelism, and classical systems can only be simulated using other resources. This means that quantum computing can significantly speed up the optimization process, especially when we figure out how to effectively model larger, more complex networks.
The real potential is to use quantum computing to solve huge, complex optimization tasks and break down parameters into smaller, more manageable groups. With technologies such as quantum and optical computing, the possibility of optimizing AI is far beyond what traditional computing can provide.
What is your long-term vision for AI? Where will you see reasoning optimization in the next 5-10 years?
In the long run, the goal of AI is to be a global model center where anyone can easily access an optimized neural network with the required features, whether it is a smartphone or any other device. The purpose is to provide a drag-and-drop experience, the user enters its parameters, and the system automatically generates the network. If the network does not exist yet, it will be created automatically using Anna.
Our goal is to enable neural networks to run directly on user devices, reducing costs by 20 to 30 times. In the future, this will almost completely eliminate the cost, as the user’s device will process the computing rather than relying on cloud servers. Combining advances in model compression and hardware acceleration can make AI deployment more efficient.
We also plan to combine our technology with hardware solutions such as sensors, chips and robotics to apply in areas such as autonomous driving and robotics. For example, we aim to build AI cameras that can operate in any environment, in space, or in extreme conditions such as dark or dust. This will make AI available in a wide range of applications and allows us to create custom solutions for specific hardware and use cases.
Thanks for your excellent interview, and readers who hope to learn more should visit Thestage AI.