# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is an EPUB translation project that converts English books to Chinese. The system extracts HTML files from EPUB archives, processes and translates the content using various AI translation APIs (DeepSeek, SiliconFlow, etc.), and maintains translation progress in a SQLite database. ## Architecture ### Core Components - **Translation Engine**: Main translation scripts in `/code/` directory that handle batch translation with progress tracking - **HTML Processing**: HTML content extraction and cleaning utilities for EPUB structure - **Configuration Management**: YAML-based configuration system for API keys, translation settings, and database paths - **Progress Tracking**: SQLite database system to track translation progress across files and translation groups - **EPUB Structure**: Organized directories (001/, 003/, 004/) containing different EPUB extracts with HTML files in `/Ops/` subdirectories ### Key Translation Scripts - `translate_epub_v4(单线程版本)V3.py` - Original stable single-threaded translation script - `translate_epub_v4_optimized.py` - **RECOMMENDED** Optimized version with batch DB operations and translation caching - `translate_epub.py` - Multi-threaded version with async processing - `main.py` - Simple API test script ### HTML Processing - Files are organized in EPUB structure: `META-INF/`, `Ops/` (HTML content), `images/` - HTML files contain structured content with CSS classes (primarily `p34` for paragraphs) - Processing scripts clean and prepare HTML for translation while preserving structure ## Common Commands ### Running Translation ```bash # RECOMMENDED: Optimized version with caching and batch operations python code/translate_epub_v4_optimized.py # Original stable version python code/translate_epub_v4(单线程版本)V3.py # Multi-threaded version python code/translate_epub.py # Performance comparison test python code/performance_test.py ``` ### Progress Monitoring ```bash # Check translation progress sqlite3 translation_progress.db "SELECT * FROM translation_progress;" # Detailed file progress sqlite3 translation_progress.db "SELECT file_path, ROUND(processed_lines * 100.0 / total_lines, 2) as progress_percent, status, last_updated FROM file_progress;" # Translation group progress sqlite3 translation_progress.db "SELECT file_path, group_index, status, updated_at FROM group_progress ORDER BY file_path, group_index;" # Check translation cache effectiveness (optimized version) sqlite3 translation_progress.db "SELECT COUNT(*) as cached_translations, AVG(access_count) as avg_reuse FROM translation_cache;" ``` ### Dependencies ```bash # Install required packages pip install -r code/requirements.txt ``` ## Configuration ### API Configuration The system supports multiple AI translation providers configured in `code/config.yaml`: - DeepSeek API (default: `deepseek-chat` model) - SiliconFlow API - Zhipu API (GLM models) ### Translation Settings - Configurable batch sizes (min: 3 lines, max: 10 lines per request) - Error handling with retry logic and cooldown periods - Concurrent request limits and timeout settings - Translation caching to avoid re-translating identical content ## Database Schema - `translation_progress`: Overall progress tracking - `file_progress`: Per-file translation status - `group_progress`: Translation group/batch status - `translation_cache`: Cached translations (optimized version only) - `line_progress`: Individual line translation tracking ## Optimization Features (translate_epub_v4_optimized.py) ### Performance Enhancements - **Batch Database Operations**: Groups multiple DB writes into single transactions (20x faster DB operations) - **Multi-level Translation Caching**: - Memory cache (LRU) for immediate access - Persistent database cache for session recovery - File-based cache backup - **Smart Progress Recovery**: Efficiently resumes from interruption points - **Optimized SQLite Settings**: WAL mode, increased cache size, memory temp storage ### Cache System Benefits - Avoids re-translating identical content across files - Persistent cache survives script restarts - Cache hit rates typically 30-60% for similar content - Automatic cache cleanup for old unused entries ### Monitoring & Statistics - Real-time performance metrics - Cache effectiveness reporting - Database operation statistics - Translation speed tracking ## File Organization - Input HTML files: Located in numbered directories (001/, 003/, 004/) under `/Ops/` subdirectories - Translated output: Typically saved to `*_translated/` directories - Archive versions: Historical translation script versions stored in `/code/归档/` - Configuration: YAML files in `/code/` directory - Cache files: Stored in `/cache/` directory (optimized version)