Multimodal Models Only: Image processing requires models with vision capabilities. Currently, only Qwen3-VL 30B supports image and video inputs. Other models (DeepSeek, Llama, Qwen Coder) are text-only and cannot process images.
See the Model Catalog for complete model specifications and multimodal capabilities.
Image processing works through the chat/completions endpoint using base64-encoded images. Images are sent as data URLs in the message content alongside your text prompt.
There are several ways to convert your images to base64 format:
Copy
Ask AI
# Convert image to base64base64 -i image.jpg -o image_base64.txt# Or use it directly in your terminalbase64 image.jpg | pbcopy # Copies to clipboard on macOS