Open to PhD · Fall 2027

Satwik Pandey

AI Research Engineer at VFS Global. Researching reliability, reasoning, and interpretability of large models.

Green pixel-art triangles containing a mountain, forest and lake

Hi, I’m Satwik.

At VFS Global, I work on production document intelligence for multi-country visa processing. My work includes large-scale extraction and verification pipelines, multimodal validation, and uncertainty-aware systems for document workflows that process 4M+ documents daily. More broadly, I’m interested in building AI systems that are not just accurate, but reliable, monitorable, and robust under real-world constraints.

My research lives at the intersection of reliability, reasoning, and interpretability in large models. I’m interested in how language and multimodal systems behave once they leave clean benchmark settings: how they reason, how they fail, how they express uncertainty, and how we can build better verification, correction, and evaluation mechanisms around them. A lot of my recent work has focused on uncertainty quantification, but more broadly I care about making large models more interpretable, measurable, and trustworthy enough to deploy in settings where “usually right” is not good enough.

Previously, I worked on trustworthy reasoning for LLMs at UCSC’s AIEA Lab, built applied RAG and automation systems at Mesha, and worked on LLM-based carbon estimation pipelines at CleanTech Mart.

In Review COLM '26

S. Pandey, et al.

SELFDOUBT: Uncertainty Quantification for Reasoning LLMs via the Hedge-to-Verify Ratio

Proposed an O(1) black-box uncertainty framework that extracts behavioral hedge/verify signals from reasoning traces, significantly outperforming Semantic Entropy on discrimination (p=0.001) at 10× lower cost; a zero-hedge gate achieves 96.1% precision across 7 models and 3 benchmarks.

In Review UAI '26

S. Raghu, S. Pandey

Don't Blink: Evidence Collapse during Multimodal Reasoning

Identified a universal evidence collapse phenomenon in reasoning VLMs, observing visual attention drops up to 90.8% during generation and discovered a task-conditional failure regime where confident but visually disengaged predictions are hazardous on sustained visual reference tasks but benign on symbolic tasks.

In Review JSS

S. Pandey, et al.

Repair of Thought: Advancing Automated Program Repair through a Dual-Model Reasoning Framework

Introduced a function-level APR framework achieving an SOTA 83.1% plausible repair rate on Defects4J, with an automated verification pipeline combining AST alignment, control-flow symbolic analysis, and semantic checks.