Phantom: Multi-modal reasoning in VLM without rendering images
Although VLMs are good at understanding text and images, they usually rely solely on text when reasoning, limiting their ability to solve tasks that require visual thinking, such as spatial puzzles. People will naturally...