Firecrawl 使用指南
Firecrawl 完整 API 使用教程
API 概览
| 端点 | 方法 | 说明 |
|-----|------|------|
| /v1/scrape | POST | 抓取单个页面 |
| /v1/crawl | POST | 爬取整个网站 |
| /v1/crawl/:id | GET | 查询爬取状态 |
| /v1/map | POST | 获取网站链接地图 |
基础抓取 (Scrape)
最简单的抓取
curl -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com"}'
指定输出格式
curl -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"formats": ["markdown", "html", "links"]
}'
可用格式
| 格式 | 说明 |
|-----|------|
| markdown | Markdown 格式(默认) |
| html | 清理后的 HTML |
| rawHtml | 原始 HTML |
| links | 页面所有链接 |
| screenshot | 页面截图 |
抓取选项详解
完整参数示例
curl -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"formats": ["markdown", "links"],
"onlyMainContent": true,
"includeTags": ["article", ".content", "#main"],
"excludeTags": ["nav", "footer", ".ad", "#sidebar"],
"waitFor": 2000,
"timeout": 30000
}'
参数说明
| 参数 | 类型 | 默认值 | 说明 |
|-----|------|--------|------|
| url | string | 必填 | 要抓取的 URL |
| formats | array | ["markdown"] | 输出格式 |
| onlyMainContent | boolean | true | 只返回主要内容 |
| includeTags | array | - | 只包含这些标签/类/ID |
| excludeTags | array | - | 排除这些标签/类/ID |
| waitFor | integer | 0 | 额外等待时间(毫秒) |
| timeout | integer | 30000 | 超时时间(毫秒) |
浏览器操作 (Actions)
在抓取前执行浏览器操作:
curl -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"actions": [
{"type": "wait", "milliseconds": 2000},
{"type": "click", "selector": "#load-more"},
{"type": "scroll", "direction": "down"},
{"type": "wait", "milliseconds": 1000}
]
}'
支持的操作
| 操作 | 参数 | 说明 |
|-----|------|------|
| wait | milliseconds | 等待指定时间 |
| click | selector | 点击元素 |
| scroll | direction: "up"/"down" | 滚动页面 |
| write | selector, text | 输入文本 |
| press | key | 按键(如 Enter) |
示例:登录后抓取
curl -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/login",
"formats": ["markdown"],
"actions": [
{"type": "write", "selector": "#username", "text": "user"},
{"type": "write", "selector": "#password", "text": "pass"},
{"type": "click", "selector": "#submit"},
{"type": "wait", "milliseconds": 3000}
]
}'
网站爬取 (Crawl)
启动爬取任务
curl -X POST http://localhost:3002/v1/crawl \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"limit": 100,
"maxDiscoveryDepth": 2
}'
返回:
{"id": "crawl-job-id-123"}
查询爬取状态
curl http://localhost:3002/v1/crawl/crawl-job-id-123
爬取选项
| 参数 | 类型 | 默认值 | 说明 |
|-----|------|--------|------|
| limit | integer | 10000 | 最大页面数 |
| maxDiscoveryDepth | integer | - | 最大发现深度 |
| includePaths | array | - | 包含的路径正则 |
| excludePaths | array | - | 排除的路径正则 |
| allowExternalLinks | boolean | false | 允许外部链接 |
| allowSubdomains | boolean | false | 允许子域名 |
示例:只爬取博客文章
curl -X POST http://localhost:3002/v1/crawl \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"includePaths": ["^/blog/.*$", "^/posts/.*$"],
"excludePaths": ["^/admin/.*$"],
"limit": 50
}'
网站地图 (Map)
获取网站所有链接:
curl -X POST http://localhost:3002/v1/map \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"limit": 200
}'
Map 选项
| 参数 | 类型 | 默认值 | 说明 |
|-----|------|--------|------|
| search | string | - | 过滤包含指定文本的链接 |
| limit | integer | 100 | 最大返回数量 |
| includeSubdomains | boolean | true | 包含子域名 |
截图功能
普通截图
curl -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"formats": [{"type": "screenshot"}]
}'
全页截图
curl -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"formats": [{
"type": "screenshot",
"fullPage": true,
"quality": 80
}]
}'
使用 jq 处理响应
提取 Markdown
curl -s ... | jq -r '.data.markdown'
提取链接
curl -s ... | jq '.data.links'
过滤特定链接
curl -s ... | jq '[.data.links[] | select(contains("/article/"))]'
保存到文件
curl -s ... | jq -r '.data.markdown' > output.md
Python SDK 示例
from firecrawl import Firecrawl
# 初始化(自托管不需要 API key)
firecrawl = Firecrawl(api_url="http://localhost:3002")
# 抓取页面
doc = firecrawl.scrape("https://example.com", {
"formats": ["markdown", "links"],
"onlyMainContent": True
})
print(doc.markdown)
print(doc.links)
JavaScript SDK 示例
import { Firecrawl } from 'firecrawl-js';
const firecrawl = new Firecrawl({
apiUrl: 'http://localhost:3002'
});
const doc = await firecrawl.scrape('https://example.com', {
formats: ['markdown', 'links'],
onlyMainContent: true
});
console.log(doc.markdown);
console.log(doc.links);
常见用例
1. 提取文章内容
curl -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://blog.example.com/post/123",
"formats": ["markdown"],
"onlyMainContent": true,
"excludeTags": ["nav", "footer", ".comments", ".sidebar"]
}'
2. 获取产品列表
curl -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://shop.example.com/products",
"formats": ["links"],
"includeTags": [".product-list"]
}' | jq '[.data.links[] | select(contains("/product/"))]'
3. 监控页面变化
# 抓取并保存
curl -s -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "formats": ["markdown"]}' \
| jq -r '.data.markdown' > page_$(date +%Y%m%d).md
4. 抓取动态加载内容
curl -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"actions": [
{"type": "scroll", "direction": "down"},
{"type": "wait", "milliseconds": 2000},
{"type": "scroll", "direction": "down"},
{"type": "wait", "milliseconds": 2000}
]
}'
错误处理
常见错误
| 错误 | 原因 | 解决方案 |
|-----|------|---------|
| timeout | 页面加载超时 | 增加 timeout 值 |
| blocked | 被网站拦截 | 添加 waitFor 延迟 |
| not found | 页面不存在 | 检查 URL |
重试机制
for i in {1..3}; do
result=$(curl -s -X POST http://localhost:3002/v1/scrape \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com"}')
if echo "$result" | jq -e '.data.markdown' > /dev/null 2>&1; then
echo "$result"
break
fi
echo "重试 $i/3..."
sleep 2
done
更多信息请参考 Firecrawl 官方文档