Agentic Binary Analysis: Toward Autonomous AI-Driven Reverse Engineering

Abstract

Binary analysis is central to cybersecurity research and practice, powering tasks such as malware detection, vulnerability discovery, exploit generation, and reverse engineering. Yet despite decades of innovation, binary analysis remains a specialized craft, hindered by steep learning curves, fragmented tools, and limited scalability.

This paper introduces Agentic Binary Analysis, an emerging paradigm that positions Large Language Models (LLMs) as autonomous agents capable of reasoning, planning, and interacting with binary-analysis toolchains. We explore how reasoning models—when coupled with structured interfaces and orchestration protocols such as the Model Context Protocol (MCP)—can autonomously perform complex analysis workflows. We demonstrate this concept through Dr.Binary, a practical agentic system that integrates AI reasoning with state-of-the-art binary-analysis tools.

1 Introduction

Binary analysis aims to understand compiled executables without source code access. It underlies malware forensics, firmware auditing, plagiarism detection, and universal binary hardening. Typical challenges include stripped symbols, compiler optimizations, and intentional obfuscation.

Despite remarkable academic progress—dynamic taint analysis, symbolic execution, hybrid fuzzing, and learning-based diffing—real-world adoption remains limited. Most tools are research prototypes requiring expert configuration and significant computational resources. Analysts often specialize in one sub-discipline, relying on ad-hoc scripting to combine heterogeneous techniques.

Meanwhile, LLMs have demonstrated reasoning, planning, and programming abilities across domains. They can interpret disassembly, generate scripts, and interface with external systems. These capabilities motivate a new question:

Can a large language model act as an autonomous binary-analysis agent?

2 Related Work

The founder and CEO Heng Yin and his collaborators have done a ton of research on binary analysis techniques.

2.1 Traditional Static and Dynamic Analysis

Binary analysis has produced numerous specialized tools:

Dynamic Taint Analysis:
DECAF (ISSTA 2014) — “Building a Platform-Neutral Whole-System Dynamic Binary Analysis Platform” Computer Science and Engineering
DroidScope (USENIX Security 2012) — “Seamlessly Reconstructing the OS and Dalvik Semantic Views for Dynamic Android Malware Analysis” Computer Science and Engineering
DECAF++ (RAID 2019) — “Elastic Whole-System Dynamic Taint Analysis” Computer Science and Engineering
Fuzzing:
AFL-sensitive (RAID 2019) — “Be Sensitive and Collaborative: Analyzing Impact of Coverage Metrics in Greybox Fuzzing” Computer Science and Engineering
AFL-hier (NDSS 2021) — (Hierarchical seed scheduling for fuzzing) Computer Science and Engineering
Firm-AFL (USENIX Security 2019) — “High-Throughput Greybox Fuzzing of IoT Firmware via Augmented Process Emulation” Computer Science and Engineering
Concolic Execution:
SymFit (USENIX Security 2024) — “Making the Common (Concrete) Case Fast for Binary-Code Concolic Execution” Computer Science and Engineering
Marco (ICSE 2024) — “A Stochastic and Asynchronous Concolic Explorer” Computer Science and Engineering
JIGSAW (IEEE S&P 2022) — “Efficient and Scalable Path Constraints Fuzzing” Computer Science and Engineering
Hybrid Fuzzing:
DigFuzz (NDSS 2019) — (Hybrid / coverage‐guided fuzzing) Computer Science and Engineering
Pointer Analysis:
BinDSA (ISSTA 2025, Distinguished Paper Award) — “Efficient, Precise Binary-Level Pointer Analysis with Context-Sensitive Heap Reconstruction” Computer Science and Engineering

Each technique advances a narrow objective but rarely integrates seamlessly with others.

2.2 Learning-Based Binary Representation

Recent AI-driven approaches focus on code embeddings and similarity learning:

Genius (CCS 2016) — vector quantization for binary code embedding
Gemini (CCS 2017) — graph neural network–based function embedding
Asm2Vec (Oakland 2019) — treats instructions as “words” in a Doc2Vec style (discussed in Heng Yin’s course notes)
PalmTree (CCS 2021) — a BERT-style pretrained assembly language model (covered in Yin’s embedding slides)
StateFormer (FSE 2021) — mentioned in Heng Yin’s embedding teaching materials
jTrans (ISSTA 2022) — referenced in Yin’s embedding overview
CLAP (ISSTA 2024) — Learning Transferable Binary Code Representations with Natural Language Supervision

2.3 AI-Assisted Binary Diffing

DeepBinDiff (NDSS 2020) — “Learning Program-Wide Code Representations for Binary Diffing” Computer Science and Engineering
SigmaDiff (NDSS 2024) — “Semantics-Aware Deep Graph Matching for Pseudocode Diffing” Computer Science and Engineering

These systems combine neural embeddings with symbolic reasoning to compare binaries at scale. While effective, they remain tool-centric—requiring manual orchestration by experts.

3 Motivation

Binary analysis is powerful but inaccessible. The process involves complex configuration, limited interoperability, and long learning curves. Furthermore, despite automation, human analysts still perform most planning: selecting tools, defining objectives, interpreting results, and adjusting strategies.

At the same time, LLMs such as GPT-4/5 exhibit sophisticated reasoning, domain understanding, and code-generation capabilities. They can:

Comprehend disassembly and decompiled logic.
Write and execute analysis scripts (e.g., pwntools, angr).
Chain tools via natural-language reasoning and external API calls.
Explain intermediate results conversationally.

Agentic Binary Analysis leverages these traits to automate what was once an expert-only process.

4 Methodology: From Assisted to Agentic

We define Agentic Binary Analysis as:

An LLM-centric approach in which the model autonomously plans, queries, interprets, and reasons over structured binary-analysis tasks through tool integration with minimal human intervention.

4.1 Core Concepts

Autonomous Planning: The LLM decomposes a high-level question (“Is this binary ransomware?”) into sub-tasks—disassembly, entropy check, API string scan, signature match—and executes them sequentially.
Tool Invocation: Using MCP, the agent calls external analysis tools (disassemblers, decompilers, symbolic engines).
Iterative Reasoning: The model interprets results, updates its plan, and re-invokes tools as needed.
Explainability: The entire process remains transparent via chat-style logs or reports.

5 System Overview: Dr.Binary

To validate the concept, we developed Dr.Binary, an interactive agentic analysis system integrating LLM reasoning with established research tools.

5.1 Demonstrated Applications

Ransomware Analysis: Identify encryption routines and classify malicious binaries.
ECU Firmware Diffing: Compare automotive Electronic Control Unit binaries to detect behavioral changes.
Backdoor Detection: Diff binary versions to isolate injected functions or altered control flows.
CTF Challenge Solving: Autonomously decompile, reason about logic, and generate exploit scripts.

5.2 Observations

LLMs exhibit strong comprehension of assembly and decompiled code.
They possess extensive cybersecurity domain knowledge.
They write complex scripts and chain tool outputs effectively.
They adapt plans based on runtime feedback, forming long multi-tool pipelines.

6 Design Considerations: AI Tool Interfaces

6.1 The Interface Challenge

A fundamental research question arises: how should LLMs interact with binary-analysis tools?

This resembles historical interface design debates:

Hardware ↔ Software → Instruction Sets (RISC vs CISC)
Kernel ↔ Userspace → System Calls (UNIX vs Windows)

6.2 Abstraction Trade-Offs

Low Abstraction: Expose raw disassembly and let the LLM perform semantic reasoning directly.
- Flexible but costly and context-limited.
High Abstraction: Rely on tools to build data-flow graphs and pointer analyses (e.g., BinDSA).
- Efficient and less hallucinatory but inherits tool limitations.
Middle Ground: Provide intermediate representations (e.g., structured CFGs, symbol tables) for LLM reasoning.

The optimal design may evolve into a distinct discipline—AI Tool Interface Design—akin to HCI but focused on machine-to-machine collaboration.

7 Evaluation and Preliminary Findings

7.1 Qualitative Assessment

Experiments with Dr.Binary show:

Rapid identification of malware signatures and behavioral patterns.
Automated generation of decompilation summaries and Python/angr scripts.
Comparable accuracy to human analysts on routine reverse-engineering tasks.

7.2 Limitations

Dynamic Behavior Reasoning: Current LLMs rely on slow symbolic tools (e.g., angr). Scriptable SymFit or Unicorn/Qiling may improve scalability.
Context Overhead: Large functions and extended chats inflate LLM costs. Context optimization remains essential.
Scalability and Monetary Cost: Balancing LLM inference expenses with analysis depth is ongoing work.

8 Future Work

Key directions for future research include:

Dynamic Integration: Linking LLMs with runtime emulation frameworks.
Benchmark Development: Creating standard datasets for agentic binary analysis evaluation.
Explainable AI in Security: Quantifying trust and interpretability of LLM-based decisions.
Scalable Tool Interfaces: Defining standardized schemas for AI–Tool communication.
AI Agent Testing: Developing methods to verify and evaluate autonomous analysis agents.

9 Conclusion

Agentic Binary Analysis marks a paradigm shift in reverse engineering and malware research. By elevating LLMs from assistive tools to autonomous agents, it bridges the gap between cutting-edge academic methods and practical security operations.

Through systems like Dr.Binary, analysts can interact with binaries conversationally while the AI coordinates complex static and dynamic analyses behind the scenes. Though challenges remain in scalability and interface design, the vision is clear: future binary analysis will be not only automated but agentic — adaptive, explainable, and continuously learning.