小说抓取|gwozai

Python 捞尸人小说抓取

本教程将介绍如何使用 Python 实现一个文章抓取器（ArticleScraper）和章节导出器（ChapterExporter），结合 Playwright 进行网页抓取、SQLite 存储数据、Redis 缓存最新链接，并将抓取的内容导出为文本文件。代码适用于从小说网站（如示例中的 piaotia.com）抓取章节内容并保存。

目标

抓取网页内容：从指定小说网站抓取章节标题和内容。
存储数据：将抓取的内容保存到 SQLite 数据库，并使用 Redis 缓存最新抓取的链接。
导出数据：从数据库中提取章节并导出为文本文件。

第一部分：环境准备

依赖安装

运行代码前，需要安装以下 Python 库：

pip install playwright redis sqlite3
playwright install  # 安装 Playwright 的浏览器支持

playwright: 用于模拟浏览器操作，抓取动态网页内容。
redis: 用于缓存最新抓取的链接。
sqlite3: Python 内置模块，用于操作 SQLite 数据库。
logging: Python 内置模块，用于记录日志。

此外，您需要一个 Redis 服务器运行在指定地址（教程示例为 116.198.253.144:6379）。如果本地测试，可以安装 Redis 并修改配置。

第二部分：文章抓取器 (`ArticleScraper`)

代码结构

ArticleScraper 类负责从网页抓取小说章节并保存到数据库。以下是逐步解析：

1. 初始化与配置

class ArticleScraper:
    def __init__(self, db_path='articles6.db', start_url='https://www.piaotia.com/html/15/15679/11397663.html'):
        self.db_path = db_path  # SQLite 数据库路径
        self.start_url = start_url  # 起始抓取 URL
        self.redis_key = 'laoshiren'  # Redis 键名
        self.content_selectors = ['#content', '.content', '#BookText', 'div.contentbox', '#htmlContent']  # 内容选择器列表

db_path: 指定 SQLite 数据库文件路径。
start_url: 抓取的起始页面 URL。
redis_key: Redis 中存储最新链接的键名。
content_selectors: 一组 CSS 选择器，用于定位页面中的正文内容。

2. 数据库初始化

def _init_database(self):
    self.db_conn = sqlite3.connect(self.db_path)
    self.db_cursor = self.db_conn.cursor()
    self.db_cursor.execute('''
        CREATE TABLE IF NOT EXISTS articles (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT UNIQUE,
            content TEXT,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    ''')
    self.db_conn.commit()
    logging.info("SQLite数据库初始化完成")

创建一个名为 articles 的表，包含 id（自增主键）、title（标题，唯一）、content（内容）和 created_at（创建时间）字段。
使用 IF NOT EXISTS 确保表只创建一次。

3. 页面内容提取

def _get_page_content(self, page):
    try:
        title_element = page.locator('h1')
        title = title_element.inner_text(timeout=5000) if title_element.count() > 0 else "未找到标题"

        content = "未找到内容"
        for selector in self.content_selectors:
            content_element = page.locator(selector)
            if content_element.count() > 0:
                content = content_element.inner_text(timeout=5000)
                logging.info(f"使用选择器 {selector} 成功获取内容")
                break
        return title, content
    except Exception as e:
        logging.error(f"提取内容失败: {str(e)}")
        return "未找到标题", "未找到内容"

使用 Playwright 的 locator 方法定位 <h1> 标签提取标题。
遍历 content_selectors 列表，尝试提取正文内容，直到找到匹配的元素。
设置 5 秒超时，避免页面加载过慢导致失败。

4. 导航到下一页

def _navigate_to_next_page(self, page):
    next_button = page.get_by_role("link", name="下一章（快捷键 →）")
    if not next_button.is_visible(timeout=5000):
        logging.warning("未找到下一章按钮，停止抓取")
        return False
    next_button.click()
    page.wait_for_load_state('networkidle', timeout=15000)
    return True

使用 get_by_role 查找“下一章”链接。
点击按钮并等待页面加载完成（networkidle 状态），超时设为 15 秒。

5. 主运行逻辑

def run(self, playwright: Playwright):
    self._init_database()
    browser = playwright.chromium.launch(headless=True, proxy={"server": "http://127.0.0.1:7897"})
    context = browser.new_context()
    page = context.new_page()

    latest_url = self._get_latest_url()
    page.goto(latest_url or self.start_url)

    while True:
        current_url = page.url
        title, content = self._get_page_content(page)

        if self._is_valid_chapter(title):
            if not self._chapter_exists(title):
                self._save_chapter(title, content)
                self._save_latest_url(current_url)
            if "最新章节" in title:
                break
        if not self._navigate_to_next_page(page):
            break

使用无头浏览器（headless=True）启动 Chromium，并通过代理访问。
从 Redis 获取最新链接，若无则使用 start_url。
循环抓取：提取内容、验证章节标题、保存数据、更新 Redis，直到遇到“最新章节”或无下一页。

第三部分：章节导出器 (`ChapterExporter`)

代码结构

ChapterExporter 类负责从数据库导出章节到文本文件。

1. 初始化

class ChapterExporter:
    def __init__(self, db_path: str = 'articles6.db', output_dir: str = '.', default_encoding: str = 'utf-8'):
        self.db_path = Path(db_path)
        self.output_dir = Path(output_dir)
        self.default_encoding = default_encoding
        self.output_dir.mkdir(parents=True, exist_ok=True)

output_dir: 输出目录，使用 Path 确保跨平台兼容。
创建输出目录（如果不存在）。

2. 获取章节

def _fetch_chapters(self):
    self.db_cursor.execute('SELECT title, content FROM articles ORDER BY id ASC')
    chapters = self.db_cursor.fetchall()
    chapters_with_numbers = []
    for title, content in chapters:
        match = re.search(r'第(\d+)章', title)
        if match:
            chapter_num = int(match.group(1))
            chapters_with_numbers.append((chapter_num, title, content or ""))
    return sorted(chapters_with_numbers, key=lambda x: x[0])

从数据库查询所有章节，按 id 升序排列。
使用正则表达式提取章节号（如“第5章”），并按章节号排序。

3. 导出文件

def export(self, start_chapter: Optional[int] = None, last_n_chapters: Optional[int] = None, encoding: Optional[str] = None):
    encoding = encoding or self.default_encoding
    self._connect_db()
    chapters = self._fetch_chapters()
    chapters_to_export = self._filter_chapters(chapters, start_chapter, last_n_chapters)

    filename = self._generate_filename()
    filepath = self.output_dir / filename
    with filepath.open('w', encoding=encoding, errors='replace') as f:
        for _, title, content in chapters_to_export:
            f.write(f"{title}\n\n{content.strip()}\n\n{'='*50}\n\n")
    logging.info(f"成功导出 {len(chapters_to_export)} 个章节到文件: {filepath}")

支持按起始章节号（start_chapter）或最后 N 章（last_n_chapters）过滤。
生成带时间戳的文件名（如 捞尸人_20250305_123456.txt）。
以指定编码写入文件，每章之间用 = 分隔。

第四部分：运行与调试

示例运行

if __name__ == "__main__":
    # 抓取
    scraper = ArticleScraper()
    with sync_playwright() as playwright:
        scraper.run(playwright)

    # 导出
    exporter = ChapterExporter(db_path='articles6.db', output_dir='exports')
    exporter.export(last_n_chapters=3, encoding='gbk')