LLMs with multimodal capabilities can be leveraged to read and solve CAPTCHAs in agentic setups, where they are part of an automated system interacting with external environments. When integrated with vision models, these LLMs can process CAPTCHA images, extract text, and even bypass certain security mechanisms meant to differentiate humans from bots. In an agentic setup, the model can coordinate with other tools—such as browser automation scripts or APIs—to input solved CAPTCHA responses dynamically, enabling persistent, automated access to restricted systems. Advanced setups may even use reinforcement learning or external OCR (Optical Character Recognition) models to improve accuracy over time. This capability raises security concerns, as it weakens CAPTCHA’s effectiveness as a bot mitigation technique, allowing AI-driven agents to interact with websites and services designed for human-only access.