CLAUDE.md 4.9 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is an EPUB translation project that converts English books to Chinese. The system extracts HTML files from EPUB archives, processes and translates the content using various AI translation APIs (DeepSeek, SiliconFlow, etc.), and maintains translation progress in a SQLite database.

Architecture

Core Components

  • Translation Engine: Main translation scripts in /code/ directory that handle batch translation with progress tracking
  • HTML Processing: HTML content extraction and cleaning utilities for EPUB structure
  • Configuration Management: YAML-based configuration system for API keys, translation settings, and database paths
  • Progress Tracking: SQLite database system to track translation progress across files and translation groups
  • EPUB Structure: Organized directories (001/, 003/, 004/) containing different EPUB extracts with HTML files in /Ops/ subdirectories

Key Translation Scripts

  • translate_epub_v4(单线程版本)V3.py - Original stable single-threaded translation script
  • translate_epub_v4_optimized.py - RECOMMENDED Optimized version with batch DB operations and translation caching
  • translate_epub.py - Multi-threaded version with async processing
  • main.py - Simple API test script

HTML Processing

  • Files are organized in EPUB structure: META-INF/, Ops/ (HTML content), images/
  • HTML files contain structured content with CSS classes (primarily p34 for paragraphs)
  • Processing scripts clean and prepare HTML for translation while preserving structure

Common Commands

Running Translation

# RECOMMENDED: Optimized version with caching and batch operations
python code/translate_epub_v4_optimized.py

# Original stable version
python code/translate_epub_v4(单线程版本)V3.py

# Multi-threaded version
python code/translate_epub.py

# Performance comparison test
python code/performance_test.py

Progress Monitoring

# Check translation progress
sqlite3 translation_progress.db "SELECT * FROM translation_progress;"

# Detailed file progress
sqlite3 translation_progress.db "SELECT file_path, ROUND(processed_lines * 100.0 / total_lines, 2) as progress_percent, status, last_updated FROM file_progress;"

# Translation group progress
sqlite3 translation_progress.db "SELECT file_path, group_index, status, updated_at FROM group_progress ORDER BY file_path, group_index;"

# Check translation cache effectiveness (optimized version)
sqlite3 translation_progress.db "SELECT COUNT(*) as cached_translations, AVG(access_count) as avg_reuse FROM translation_cache;"

Dependencies

# Install required packages
pip install -r code/requirements.txt

Configuration

API Configuration

The system supports multiple AI translation providers configured in code/config.yaml:

  • DeepSeek API (default: deepseek-chat model)
  • SiliconFlow API
  • Zhipu API (GLM models)

Translation Settings

  • Configurable batch sizes (min: 3 lines, max: 10 lines per request)
  • Error handling with retry logic and cooldown periods
  • Concurrent request limits and timeout settings
  • Translation caching to avoid re-translating identical content

Database Schema

  • translation_progress: Overall progress tracking
  • file_progress: Per-file translation status
  • group_progress: Translation group/batch status
  • translation_cache: Cached translations (optimized version only)
  • line_progress: Individual line translation tracking

Optimization Features (translate_epub_v4_optimized.py)

Performance Enhancements

  • Batch Database Operations: Groups multiple DB writes into single transactions (20x faster DB operations)
  • Multi-level Translation Caching:
    • Memory cache (LRU) for immediate access
    • Persistent database cache for session recovery
    • File-based cache backup
  • Smart Progress Recovery: Efficiently resumes from interruption points
  • Optimized SQLite Settings: WAL mode, increased cache size, memory temp storage

Cache System Benefits

  • Avoids re-translating identical content across files
  • Persistent cache survives script restarts
  • Cache hit rates typically 30-60% for similar content
  • Automatic cache cleanup for old unused entries

Monitoring & Statistics

  • Real-time performance metrics
  • Cache effectiveness reporting
  • Database operation statistics
  • Translation speed tracking

File Organization

  • Input HTML files: Located in numbered directories (001/, 003/, 004/) under /Ops/ subdirectories
  • Translated output: Typically saved to *_translated/ directories
  • Archive versions: Historical translation script versions stored in /code/归档/
  • Configuration: YAML files in /code/ directory
  • Cache files: Stored in /cache/ directory (optimized version)