Firecrawl 使用指南

Firecrawl 完整 API 使用教程

API 概览

| 端点 | 方法 | 说明 |

|-----|------|------|

| /v1/scrape | POST | 抓取单个页面 |

| /v1/crawl | POST | 爬取整个网站 |

| /v1/crawl/:id | GET | 查询爬取状态 |

| /v1/map | POST | 获取网站链接地图 |

基础抓取 (Scrape)

最简单的抓取


curl -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{"url": "https://example.com"}'

指定输出格式


curl -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{

"url": "https://example.com",

"formats": ["markdown", "html", "links"]

}'

可用格式

| 格式 | 说明 |

|-----|------|

| markdown | Markdown 格式(默认) |

| html | 清理后的 HTML |

| rawHtml | 原始 HTML |

| links | 页面所有链接 |

| screenshot | 页面截图 |

抓取选项详解

完整参数示例


curl -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{

"url": "https://example.com",

"formats": ["markdown", "links"],

"onlyMainContent": true,

"includeTags": ["article", ".content", "#main"],

"excludeTags": ["nav", "footer", ".ad", "#sidebar"],

"waitFor": 2000,

"timeout": 30000

}'

参数说明

| 参数 | 类型 | 默认值 | 说明 |

|-----|------|--------|------|

| url | string | 必填 | 要抓取的 URL |

| formats | array | ["markdown"] | 输出格式 |

| onlyMainContent | boolean | true | 只返回主要内容 |

| includeTags | array | - | 只包含这些标签/类/ID |

| excludeTags | array | - | 排除这些标签/类/ID |

| waitFor | integer | 0 | 额外等待时间(毫秒) |

| timeout | integer | 30000 | 超时时间(毫秒) |

浏览器操作 (Actions)

在抓取前执行浏览器操作:


curl -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{

"url": "https://example.com",

"formats": ["markdown"],

"actions": [

{"type": "wait", "milliseconds": 2000},

{"type": "click", "selector": "#load-more"},

{"type": "scroll", "direction": "down"},

{"type": "wait", "milliseconds": 1000}

]

}'

支持的操作

| 操作 | 参数 | 说明 |

|-----|------|------|

| wait | milliseconds | 等待指定时间 |

| click | selector | 点击元素 |

| scroll | direction: "up"/"down" | 滚动页面 |

| write | selector, text | 输入文本 |

| press | key | 按键(如 Enter) |

示例:登录后抓取


curl -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{

"url": "https://example.com/login",

"formats": ["markdown"],

"actions": [

{"type": "write", "selector": "#username", "text": "user"},

{"type": "write", "selector": "#password", "text": "pass"},

{"type": "click", "selector": "#submit"},

{"type": "wait", "milliseconds": 3000}

]

}'

网站爬取 (Crawl)

启动爬取任务


curl -X POST http://localhost:3002/v1/crawl \

-H 'Content-Type: application/json' \

-d '{

"url": "https://example.com",

"limit": 100,

"maxDiscoveryDepth": 2

}'

返回:


{"id": "crawl-job-id-123"}

查询爬取状态


curl http://localhost:3002/v1/crawl/crawl-job-id-123

爬取选项

| 参数 | 类型 | 默认值 | 说明 |

|-----|------|--------|------|

| limit | integer | 10000 | 最大页面数 |

| maxDiscoveryDepth | integer | - | 最大发现深度 |

| includePaths | array | - | 包含的路径正则 |

| excludePaths | array | - | 排除的路径正则 |

| allowExternalLinks | boolean | false | 允许外部链接 |

| allowSubdomains | boolean | false | 允许子域名 |

示例:只爬取博客文章


curl -X POST http://localhost:3002/v1/crawl \

-H 'Content-Type: application/json' \

-d '{

"url": "https://example.com",

"includePaths": ["^/blog/.*$", "^/posts/.*$"],

"excludePaths": ["^/admin/.*$"],

"limit": 50

}'

网站地图 (Map)

获取网站所有链接:


curl -X POST http://localhost:3002/v1/map \

-H 'Content-Type: application/json' \

-d '{

"url": "https://example.com",

"limit": 200

}'

Map 选项

| 参数 | 类型 | 默认值 | 说明 |

|-----|------|--------|------|

| search | string | - | 过滤包含指定文本的链接 |

| limit | integer | 100 | 最大返回数量 |

| includeSubdomains | boolean | true | 包含子域名 |

截图功能

普通截图


curl -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{

"url": "https://example.com",

"formats": [{"type": "screenshot"}]

}'

全页截图


curl -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{

"url": "https://example.com",

"formats": [{

"type": "screenshot",

"fullPage": true,

"quality": 80

}]

}'

使用 jq 处理响应

提取 Markdown


curl -s ... | jq -r '.data.markdown'

提取链接


curl -s ... | jq '.data.links'

过滤特定链接


curl -s ... | jq '[.data.links[] | select(contains("/article/"))]'

保存到文件


curl -s ... | jq -r '.data.markdown' > output.md

Python SDK 示例


from firecrawl import Firecrawl

  

# 初始化(自托管不需要 API key)

firecrawl = Firecrawl(api_url="http://localhost:3002")

  

# 抓取页面

doc = firecrawl.scrape("https://example.com", {

"formats": ["markdown", "links"],

"onlyMainContent": True

})

  

print(doc.markdown)

print(doc.links)

JavaScript SDK 示例


import { Firecrawl } from 'firecrawl-js';

  

const firecrawl = new Firecrawl({

apiUrl: 'http://localhost:3002'

});

  

const doc = await firecrawl.scrape('https://example.com', {

formats: ['markdown', 'links'],

onlyMainContent: true

});

  

console.log(doc.markdown);

console.log(doc.links);

常见用例

1. 提取文章内容


curl -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{

"url": "https://blog.example.com/post/123",

"formats": ["markdown"],

"onlyMainContent": true,

"excludeTags": ["nav", "footer", ".comments", ".sidebar"]

}'

2. 获取产品列表


curl -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{

"url": "https://shop.example.com/products",

"formats": ["links"],

"includeTags": [".product-list"]

}' | jq '[.data.links[] | select(contains("/product/"))]'

3. 监控页面变化


# 抓取并保存

curl -s -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{"url": "https://example.com", "formats": ["markdown"]}' \

| jq -r '.data.markdown' > page_$(date +%Y%m%d).md

4. 抓取动态加载内容


curl -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{

"url": "https://example.com",

"formats": ["markdown"],

"actions": [

{"type": "scroll", "direction": "down"},

{"type": "wait", "milliseconds": 2000},

{"type": "scroll", "direction": "down"},

{"type": "wait", "milliseconds": 2000}

]

}'

错误处理

常见错误

| 错误 | 原因 | 解决方案 |

|-----|------|---------|

| timeout | 页面加载超时 | 增加 timeout 值 |

| blocked | 被网站拦截 | 添加 waitFor 延迟 |

| not found | 页面不存在 | 检查 URL |

重试机制


for i in {1..3}; do

result=$(curl -s -X POST http://localhost:3002/v1/scrape \

-H 'Content-Type: application/json' \

-d '{"url": "https://example.com"}')

if echo "$result" | jq -e '.data.markdown' > /dev/null 2>&1; then

echo "$result"

break

fi

echo "重试 $i/3..."

sleep 2

done


更多信息请参考 Firecrawl 官方文档

写文章用