AI Guides
What Is MarkItDown? Guide to microsoft/markitdown for LLM and RAG Workflows
Learn what Microsoft MarkItDown is, how it converts PDFs, Office files, HTML, images, audio, ZIP files and more into Markdown, and how it fits into LLM, RAG and MCP workflows.
💡Key Takeaways
- Learn what Microsoft MarkItDown is, how it converts PDFs, Office files, HTML, images, audio, ZIP files and more into Markdown, and how it fits into LLM, RAG and MCP workflows.
What Is MarkItDown? A Simple Guide to microsoft/markitdown for LLM and RAG Workflows
Image extracted from GitHub’s Open Graph preview for the microsoft/markitdown repository. This image is not SVG.1
Quick summary
MarkItDown is a Microsoft Python tool for converting many file types into Markdown. The official repository describes it as a lightweight Python utility for converting files to Markdown for use with LLMs and text-analysis pipelines.2
In plain terms: if you have PDFs, Word documents, Excel spreadsheets, PowerPoint decks, HTML pages, CSV/JSON/XML files, EPUBs, ZIP files, images, audio files or YouTube URLs, MarkItDown helps turn them into Markdown that can be read by ChatGPT, Claude, RAG systems, search indexes, vector databases or document-processing scripts.
MarkItDown is not primarily a high-fidelity document-layout converter. The README says the output can be reasonably presentable and human-friendly, but it is meant to be consumed by text-analysis tools rather than used as a perfect visual reproduction of the original document.2
What problem does MarkItDown solve?
LLMs and RAG systems work best with clean, structured text. But real documents are stored in many formats: PDFs, DOCX, PPTX, XLSX, HTML, CSV, JSON, XML, Outlook messages, images, audio files, ZIP archives, YouTube transcripts and EPUB books.
Without a common converter, every pipeline needs custom extraction logic. MarkItDown provides one tool that converts many of those formats into Markdown.
Why Markdown?
The README explains that Markdown is close to plain text, has minimal markup, and still represents useful document structure such as headings, lists, tables and links. It also notes that mainstream LLMs such as GPT-4o naturally understand Markdown and often produce Markdown in their answers; Markdown is also token-efficient.2
| Source format | Problem for LLMs | How Markdown helps |
|---|---|---|
| complex layout, line breaks, hidden structure | extracts readable text with headings/tables/lists | |
| DOCX | Word-specific structure and styles | keeps the main content in text form |
| PPTX | content spread across slides | turns slide text into readable output |
| XLSX | sheets and tables | outputs tables/text |
| HTML | tags, layout, ads, navigation | keeps content in a cleaner format |
| JSON/XML | deeply nested structure | makes data easier to inspect |
| ZIP | many nested files | iterates through contents |
Supported formats
The README lists support for converting from:2
- PowerPoint
- Word
- Excel
- Images: EXIF metadata and OCR
- Audio: EXIF metadata and speech transcription
- HTML
- Text-based formats such as CSV, JSON and XML
- ZIP files, by iterating over contents
- YouTube URLs
- EPUBs
- and more
Important detail: not every converter dependency is installed by default. MarkItDown uses optional dependencies, so you can install only the format families you need.
What MarkItDown is not
| MarkItDown is | MarkItDown is not |
|---|---|
| A file-to-Markdown converter | A Markdown editor |
| A document-prep tool for LLM/RAG pipelines | A perfect layout-preserving PDF converter |
| A Python library and CLI | A mandatory cloud service |
| Useful offline for many formats | Guaranteed to understand every complex layout |
| Extensible with plugins and MCP | A universal OCR engine for every case |
If your goal is “convert a document into Markdown that an LLM can read,” MarkItDown is a good fit. If your goal is exact layout preservation, test carefully because that is not the project’s main goal.
Installation
MarkItDown requires Python 3.10 or higher.2
Create a virtual environment:
python -m venv .venv
source .venv/bin/activate
Windows PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
Install all features:
pip install "markitdown[all]"
Install selected formats:
pip install "markitdown[pdf,docx,pptx]"
The README lists extras including [pptx], [docx], [xlsx], [xls], [pdf], [outlook], [az-doc-intel], [az-content-understanding], [audio-transcription], [youtube-transcription] and [all].2
Install from source:
git clone https://github.com/microsoft/markitdown.git
cd markitdown
pip install -e "packages/markitdown[all]"
Command-line usage
Basic conversion:
markitdown report.pdf > report.md
Write directly to a file:
markitdown report.pdf -o report.md
Pipe input:
cat report.pdf | markitdown > report.md
Examples:
markitdown document.docx -o document.md
markitdown slides.pptx -o slides.md
markitdown workbook.xlsx -o workbook.md
markitdown page.html -o page.md
markitdown data.json -o data.md
markitdown archive.zip -o archive.md
YouTube URL, if the right extra is installed:
markitdown "https://www.youtube.com/watch?v=VIDEO_ID" -o video.md
Python API
Basic usage from the README:2
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False)
result = md.convert("test.xlsx")
print(result.text_content)
Save output:
from pathlib import Path
from markitdown import MarkItDown
converter = MarkItDown()
result = converter.convert("report.pdf")
Path("report.md").write_text(result.text_content, encoding="utf-8")
Convert a folder:
from pathlib import Path
from markitdown import MarkItDown
converter = MarkItDown()
input_dir = Path("documents")
output_dir = Path("markdown")
output_dir.mkdir(exist_ok=True)
for path in input_dir.iterdir():
if path.is_file():
try:
result = converter.convert(str(path))
out = output_dir / f"{path.stem}.md"
out.write_text(result.text_content, encoding="utf-8")
print("Converted:", path)
except Exception as exc:
print("Failed:", path, exc)
Image descriptions with LLMs
The README says MarkItDown can use LLMs for image descriptions, currently for image files and PowerPoint files, by providing llm_client and llm_model.2
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe the image briefly and focus on information useful for search.",
)
result = md.convert("example.jpg")
print(result.text_content)
Use this when the image contains important visual content that normal text extraction cannot read.
OCR plugin for PDF/DOCX/PPTX/XLSX
The repository includes markitdown-ocr. Its README says it uses LLM Vision to extract text from images embedded in PDFs, DOCX, PPTX and XLSX files, using the same llm_client / llm_model pattern as MarkItDown’s image descriptions.3
Install:
pip install markitdown-ocr
pip install openai
CLI:
markitdown document.pdf --use-plugins --llm-client openai --llm-model gpt-4o
Python:
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
The OCR plugin README says that if no llm_client is provided, the plugin still loads but OCR is skipped and the standard built-in converter is used.3
Plugin system
MarkItDown supports third-party plugins. Plugins are disabled by default.2
List installed plugins:
markitdown --list-plugins
Use plugins:
markitdown --use-plugins path-to-file.pdf
The sample plugin README shows that a plugin implements a custom DocumentConverter, registers it through register_converters(), and exposes it through the markitdown.plugin entry point group in pyproject.toml.4
This lets teams build converters for internal report formats, invoice exports or custom file bundles.
Azure Document Intelligence and Azure Content Understanding
MarkItDown can use Azure cloud services when higher-quality extraction is needed.
Azure Document Intelligence:
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
Python:
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)
Azure Content Understanding:
markitdown path-to-file.pdf --use-cu --cu-endpoint "<content_understanding_endpoint>"
Python:
from markitdown import MarkItDown
md = MarkItDown(cu_endpoint="<content_understanding_endpoint>")
result = md.convert("report.pdf")
print(result.markdown)
Custom analyzer:
md = MarkItDown(
cu_endpoint="<content_understanding_endpoint>",
cu_analyzer_id="my-invoice-analyzer",
)
result = md.convert("invoice.pdf")
print(result.markdown)
The README warns that each Content Understanding-routed convert() call is a billable Azure API call, so restrict file types when needed.2
What is MarkItDown MCP?
markitdown-mcp is an MCP server for MarkItDown. Its README says it provides a lightweight STDIO, Streamable HTTP and SSE MCP server, exposing one tool: convert_to_markdown(uri), where uri can be http:, https:, file: or data:.5
Install:
pip install markitdown-mcp
Run STDIO:
markitdown-mcp
Run local HTTP/SSE:
markitdown-mcp --http --host 127.0.0.1 --port 3001
In simple terms: if Claude Desktop, Cursor or another MCP-compatible agent needs to read a document, MarkItDown can become the tool that converts it to Markdown first.
MarkItDown MCP with Claude Desktop and Docker
The markitdown-mcp README recommends Docker for Claude Desktop.5
Build:
docker build -t markitdown-mcp:latest .
Claude Desktop config:
{
"mcpServers": {
"markitdown": {
"command": "docker",
"args": ["run", "--rm", "-i", "markitdown-mcp:latest"]
}
}
}
Mount a local folder:
{
"mcpServers": {
"markitdown": {
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"-v",
"/home/user/data:/workdir",
"markitdown-mcp:latest"
]
}
}
}
Then an agent can call:
convert_to_markdown("file:///workdir/report.pdf")
Docker for the MarkItDown CLI
The main README includes this Docker example:2
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
The Dockerfile uses python:3.13-slim-bullseye, installs ffmpeg and exiftool, installs markitdown[all] and the sample plugin, and sets markitdown as the entrypoint.6
Use Docker when you want isolation, repeatable dependencies, batch conversion jobs or better control over file and network access.
MarkItDown in a RAG pipeline
Basic pipeline:
Original file
↓
MarkItDown
↓
Markdown text
↓
Chunking
↓
Embedding
↓
Vector database
↓
RAG answer
Simple script:
from pathlib import Path
from markitdown import MarkItDown
converter = MarkItDown()
docs_dir = Path("raw_docs")
md_dir = Path("markdown_docs")
md_dir.mkdir(exist_ok=True)
for file in docs_dir.glob("*"):
if not file.is_file():
continue
result = converter.convert(str(file))
md_file = md_dir / f"{file.stem}.md"
md_file.write_text(result.text_content, encoding="utf-8")
Then send the Markdown through your chunking, embedding and indexing pipeline.
Server-side security
The README warns that MarkItDown performs I/O with the privileges of the current process. Like open() or requests.get(), it can access resources available to that process. It recommends sanitizing untrusted inputs and using the narrowest conversion API needed, such as convert_stream() or convert_local().7
If you deploy MarkItDown for user uploads or server-side conversion:
- do not allow arbitrary file paths from users;
- do not fetch arbitrary URLs without filtering;
- block private IPs, loopback, link-local and metadata-service addresses;
- limit file size;
- restrict file types;
- run inside a container or sandbox;
- run as a user that cannot read secrets;
- disable network access if remote fetching is not needed;
- scan uploaded files where appropriate;
- avoid logging sensitive content.
Which convert method should you use?
The README recommends calling only the conversion method needed for the use case.7
| Method | Use when |
|---|---|
convert() | Quick, flexible, but permissive |
convert_local() | You only need local files |
convert_stream() | You open the stream yourself and want control |
convert_response() | You fetch a URL yourself and pass the response |
Rule of thumb: in production, use the narrowest method possible.
When should you use MarkItDown?
Use it when:
- you need Markdown from common document formats;
- you are preparing data for LLM or RAG pipelines;
- you need batch conversion for PDF/DOCX/PPTX/XLSX;
- you want a fast CLI for extracting text;
- you want an MCP tool for agents;
- you need Azure Document Intelligence or Content Understanding integration;
- you want to write converters for internal formats.
Avoid it when:
- exact layout preservation is required;
- documents are low-quality scans and no OCR/cloud converter is used;
- input is untrusted and not sandboxed;
- you need a Markdown editor;
- you need a polished end-user PDF/HTML renderer.
MarkItDown vs Pandoc vs OCR
| Tool | Best for | Note |
|---|---|---|
| MarkItDown | Converting many files to Markdown for LLM/RAG | Optimized for text-analysis pipelines |
| Pandoc | Academic/document markup conversion | Strong for document formats, not specifically LLM pipelines |
| OCR-only tools | Extracting text from images/scanned PDFs | May not preserve Markdown structure well |
| Azure Document Intelligence | Complex layout, scanned documents, tables | Cloud service, billable |
| Azure Content Understanding | Multimodal and structured field extraction | Cloud service, billable, enterprise use cases |
Practical checklist
- Use a virtual environment.
- Install only the extras you need if you want a lighter setup.
- Test the CLI on sample files before building a pipeline.
- Store intermediate Markdown for debugging RAG.
- Use OCR plugin or Azure for scanned PDFs.
- Do not allow arbitrary user-supplied paths or URLs.
- Keep MCP HTTP/SSE on localhost unless you have security controls.
- In Docker, mount only the folders needed.
- Enable plugins explicitly with
--use-plugins. - Control Azure Content Understanding cost.
FAQ
What is MarkItDown?
MarkItDown is a Microsoft Python tool for converting files and Office documents into Markdown for LLMs and text-analysis pipelines.2
Does MarkItDown support PDF?
Yes. The README lists PDF support; install PDF dependencies with pip install "markitdown[pdf]" or install all extras with markitdown[all].2
Does MarkItDown support Word and PowerPoint?
Yes. It supports Word and PowerPoint; the related extras are [docx] and [pptx].2
Does MarkItDown support OCR?
Yes. The README lists image OCR support, and markitdown-ocr adds LLM Vision OCR for embedded images in PDF, DOCX, PPTX and XLSX files.23
Can MarkItDown be used with Claude Desktop or MCP clients?
Yes. markitdown-mcp provides an MCP server with a convert_to_markdown(uri) tool.5
Should MarkItDown MCP be exposed to the internet?
No. The MCP README warns that the server has no authentication and binds to localhost by default; do not bind it to other interfaces unless you understand the security implications.5
Conclusion
microsoft/markitdown is useful for anyone building LLM, RAG or document-analysis workflows. Its goal is not perfect visual conversion. Its goal is to turn many real-world file types into Markdown that models and text-processing tools can use effectively.
The simplest way to start is pip install "markitdown[all]", then run markitdown file.pdf -o file.md. For production, the most important part is safety: control file paths, URLs, process permissions, sandboxing and the exact conversion API you call.
References
Footnotes
-
GitHub Open Graph preview image for
microsoft/markitdown. https://opengraph.githubassets.com/markitdown-guide/microsoft/markitdown ↩ -
Microsoft MarkItDown README. https://raw.githubusercontent.com/microsoft/markitdown/main/README.md ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15
-
packages/markitdown-ocr/README.md. https://raw.githubusercontent.com/microsoft/markitdown/main/packages/markitdown-ocr/README.md ↩ ↩2 ↩3 -
packages/markitdown-sample-plugin/README.md. https://raw.githubusercontent.com/microsoft/markitdown/main/packages/markitdown-sample-plugin/README.md ↩ -
packages/markitdown-mcp/README.md. https://raw.githubusercontent.com/microsoft/markitdown/main/packages/markitdown-mcp/README.md ↩ ↩2 ↩3 ↩4 -
MarkItDown Dockerfile. https://raw.githubusercontent.com/microsoft/markitdown/main/Dockerfile ↩
-
Microsoft MarkItDown README, Security Considerations. https://raw.githubusercontent.com/microsoft/markitdown/main/README.md ↩ ↩2
Written by PixelRouter Editorial Team
We publish deep, authoritative guides on AI infrastructure, API gateway security, cloud financial management, and system optimizations for developers.
FAQ
What is MarkItDown?
MarkItDown is a Microsoft Python tool for converting files and Office documents into Markdown for LLMs and text-analysis pipelines.
Does MarkItDown support PDF conversion?
Yes. The article states that MarkItDown supports PDF conversion, with PDF dependencies available through the [pdf] extra or through the [all] installation option.
Can MarkItDown convert Word and PowerPoint files?
Yes. The article lists Word and PowerPoint support, with related optional extras such as [docx] and [pptx].
Does MarkItDown support OCR?
Yes. The article says MarkItDown lists image OCR support, and the markitdown-ocr plugin can use LLM Vision to extract text from images embedded in PDF, DOCX, PPTX and XLSX files.
Can MarkItDown be used with MCP clients?
Yes. The article explains that markitdown-mcp provides an MCP server exposing a convert_to_markdown(uri) tool for MCP-compatible clients.
Should MarkItDown MCP be exposed to the internet?
No. The article notes that the MCP server has no authentication and binds to localhost by default, so it should not be bound to other interfaces unless the security implications are understood.
📂Related posts
AI Guides
What Is 9Remote? Remote Terminal, Desktop, Files, and AI Coding from Your Phone
A simple guide to decolua/9remote: how it lets developers access a host terminal, remote desktop, file explorer, local sites, and AI coding tools from a phone or browser with QR pairing and Cloudflare tunnel support.
AI Guides
What Is 9Router? A Simple Guide to AI Coding Provider Routing
A practical guide to decolua/9router, an open-source AI router and proxy for AI coding tools with OpenAI-compatible endpoints, provider routing, fallback combos, RTK token saving, dashboard setup, Docker deployment, and security notes.
AI Guides
What Is OmniVoice? A Simple Guide to Multilingual TTS and Voice Cloning
A beginner-friendly guide to k2-fsa/OmniVoice, covering multilingual zero-shot text-to-speech, voice cloning, voice design, installation, Python and CLI usage, batch inference, deployment patterns, and voice-safety notes.