Abstract
Binary analysis is central to cybersecurity research and practice, powering tasks such as malware detection, vulnerability discovery, exploit generation, and reverse engineering. Yet despite decades of innovation, binary analysis remains a specialized craft, hindered by steep learning curves, fragmented tools, and limited scalability.
This paper introduces Agentic Binary Analysis, an emerging paradigm that positions Large Language Models (LLMs) as autonomous agents capable of reasoning, planning, and interacting with binary-analysis toolchains. We explore how reasoning models—when coupled with structured interfaces and orchestration protocols such as the Model Context Protocol (MCP)—can autonomously perform complex analysis workflows. We demonstrate this concept through Dr.Binary, a practical agentic system that integrates AI reasoning with state-of-the-art binary-analysis tools.
1 Introduction
Binary analysis aims to understand compiled executables without source code access. It underlies malware forensics, firmware auditing, plagiarism detection, and universal binary hardening. Typical challenges include stripped symbols, compiler optimizations, and intentional obfuscation.
Despite remarkable academic progress—dynamic taint analysis, symbolic execution, hybrid fuzzing, and learning-based diffing—real-world adoption remains limited. Most tools are research prototypes requiring expert configuration and significant computational resources. Analysts often specialize in one sub-discipline, relying on ad-hoc scripting to combine heterogeneous techniques.
Meanwhile, LLMs have demonstrated reasoning, planning, and programming abilities across domains. They can interpret disassembly, generate scripts, and interface with external systems. These capabilities motivate a new question:
Can a large language model act as an autonomous binary-analysis agent?
2 Related Work
The founder and CEO Heng Yin and his collaborators have done a ton of research on binary analysis techniques.
2.1 Traditional Static and Dynamic Analysis
Binary analysis has produced numerous specialized tools:
Dynamic Taint Analysis:
DECAF (ISSTA 2014) — “Building a Platform-Neutral Whole-System Dynamic Binary Analysis Platform” Computer Science and Engineering
DroidScope (USENIX Security 2012) — “Seamlessly Reconstructing the OS and Dalvik Semantic Views for Dynamic Android Malware Analysis” Computer Science and Engineering
DECAF++ (RAID 2019) — “Elastic Whole-System Dynamic Taint Analysis” Computer Science and EngineeringFuzzing:
AFL-sensitive (RAID 2019) — “Be Sensitive and Collaborative: Analyzing Impact of Coverage Metrics in Greybox Fuzzing” Computer Science and Engineering
AFL-hier (NDSS 2021) — (Hierarchical seed scheduling for fuzzing) Computer Science and Engineering
Firm-AFL (USENIX Security 2019) — “High-Throughput Greybox Fuzzing of IoT Firmware via Augmented Process Emulation” Computer Science and EngineeringConcolic Execution:
SymFit (USENIX Security 2024) — “Making the Common (Concrete) Case Fast for Binary-Code Concolic Execution” Computer Science and Engineering
Marco (ICSE 2024) — “A Stochastic and Asynchronous Concolic Explorer” Computer Science and Engineering
JIGSAW (IEEE S&P 2022) — “Efficient and Scalable Path Constraints Fuzzing” Computer Science and EngineeringHybrid Fuzzing:
DigFuzz (NDSS 2019) — (Hybrid / coverage‐guided fuzzing) Computer Science and EngineeringPointer Analysis:
BinDSA (ISSTA 2025, Distinguished Paper Award) — “Efficient, Precise Binary-Level Pointer Analysis with Context-Sensitive Heap Reconstruction” Computer Science and Engineering
Each technique advances a narrow objective but rarely integrates seamlessly with others.
2.2 Learning-Based Binary Representation
Recent AI-driven approaches focus on code embeddings and similarity learning:
Genius (CCS 2016) — vector quantization for binary code embedding
Gemini (CCS 2017) — graph neural network–based function embedding
Asm2Vec (Oakland 2019) — treats instructions as “words” in a Doc2Vec style (discussed in Heng Yin’s course notes)
PalmTree (CCS 2021) — a BERT-style pretrained assembly language model (covered in Yin’s embedding slides)
StateFormer (FSE 2021) — mentioned in Heng Yin’s embedding teaching materials
jTrans (ISSTA 2022) — referenced in Yin’s embedding overview
CLAP (ISSTA 2024) — Learning Transferable Binary Code Representations with Natural Language Supervision
2.3 AI-Assisted Binary Diffing
DeepBinDiff (NDSS 2020) — “Learning Program-Wide Code Representations for Binary Diffing” Computer Science and Engineering
SigmaDiff (NDSS 2024) — “Semantics-Aware Deep Graph Matching for Pseudocode Diffing” Computer Science and Engineering
These systems combine neural embeddings with symbolic reasoning to compare binaries at scale. While effective, they remain tool-centric—requiring manual orchestration by experts.
3 Motivation
Binary analysis is powerful but inaccessible. The process involves complex configuration, limited interoperability, and long learning curves. Furthermore, despite automation, human analysts still perform most planning: selecting tools, defining objectives, interpreting results, and adjusting strategies.
At the same time, LLMs such as GPT-4/5 exhibit sophisticated reasoning, domain understanding, and code-generation capabilities. They can:
Comprehend disassembly and decompiled logic.
Write and execute analysis scripts (e.g., pwntools, angr).
Chain tools via natural-language reasoning and external API calls.
Explain intermediate results conversationally.
Agentic Binary Analysis leverages these traits to automate what was once an expert-only process.
4 Methodology: From Assisted to Agentic
We define Agentic Binary Analysis as:
An LLM-centric approach in which the model autonomously plans, queries, interprets, and reasons over structured binary-analysis tasks through tool integration with minimal human intervention.
4.1 Core Concepts
Autonomous Planning: The LLM decomposes a high-level question (“Is this binary ransomware?”) into sub-tasks—disassembly, entropy check, API string scan, signature match—and executes them sequentially.
Tool Invocation: Using MCP, the agent calls external analysis tools (disassemblers, decompilers, symbolic engines).
Iterative Reasoning: The model interprets results, updates its plan, and re-invokes tools as needed.
Explainability: The entire process remains transparent via chat-style logs or reports.
5 System Overview: Dr.Binary
To validate the concept, we developed Dr.Binary, an interactive agentic analysis system integrating LLM reasoning with established research tools.
5.1 Demonstrated Applications
Ransomware Analysis: Identify encryption routines and classify malicious binaries.
ECU Firmware Diffing: Compare automotive Electronic Control Unit binaries to detect behavioral changes.
Backdoor Detection: Diff binary versions to isolate injected functions or altered control flows.
CTF Challenge Solving: Autonomously decompile, reason about logic, and generate exploit scripts.
5.2 Observations
LLMs exhibit strong comprehension of assembly and decompiled code.
They possess extensive cybersecurity domain knowledge.
They write complex scripts and chain tool outputs effectively.
They adapt plans based on runtime feedback, forming long multi-tool pipelines.
6 Design Considerations: AI Tool Interfaces
6.1 The Interface Challenge
A fundamental research question arises: how should LLMs interact with binary-analysis tools?
This resembles historical interface design debates:
Hardware ↔ Software → Instruction Sets (RISC vs CISC)
Kernel ↔ Userspace → System Calls (UNIX vs Windows)
6.2 Abstraction Trade-Offs
Low Abstraction: Expose raw disassembly and let the LLM perform semantic reasoning directly.
Flexible but costly and context-limited.
High Abstraction: Rely on tools to build data-flow graphs and pointer analyses (e.g., BinDSA).
Efficient and less hallucinatory but inherits tool limitations.
Middle Ground: Provide intermediate representations (e.g., structured CFGs, symbol tables) for LLM reasoning.
The optimal design may evolve into a distinct discipline—AI Tool Interface Design—akin to HCI but focused on machine-to-machine collaboration.
7 Evaluation and Preliminary Findings
7.1 Qualitative Assessment
Experiments with Dr.Binary show:
Rapid identification of malware signatures and behavioral patterns.
Automated generation of decompilation summaries and Python/angr scripts.
Comparable accuracy to human analysts on routine reverse-engineering tasks.
7.2 Limitations
Dynamic Behavior Reasoning: Current LLMs rely on slow symbolic tools (e.g., angr). Scriptable SymFit or Unicorn/Qiling may improve scalability.
Context Overhead: Large functions and extended chats inflate LLM costs. Context optimization remains essential.
Scalability and Monetary Cost: Balancing LLM inference expenses with analysis depth is ongoing work.
8 Future Work
Key directions for future research include:
Dynamic Integration: Linking LLMs with runtime emulation frameworks.
Benchmark Development: Creating standard datasets for agentic binary analysis evaluation.
Explainable AI in Security: Quantifying trust and interpretability of LLM-based decisions.
Scalable Tool Interfaces: Defining standardized schemas for AI–Tool communication.
AI Agent Testing: Developing methods to verify and evaluate autonomous analysis agents.
9 Conclusion
Agentic Binary Analysis marks a paradigm shift in reverse engineering and malware research. By elevating LLMs from assistive tools to autonomous agents, it bridges the gap between cutting-edge academic methods and practical security operations.
Through systems like Dr.Binary, analysts can interact with binaries conversationally while the AI coordinates complex static and dynamic analyses behind the scenes. Though challenges remain in scalability and interface design, the vision is clear: future binary analysis will be not only automated but agentic — adaptive, explainable, and continuously learning.