Swift-Search-Rs

# ⚡ Swift Search RS ### The Fastest Open-Source Meta-Search Engine — Built in Pure Rust [![Version](https://img.shields.io/badge/version-5.0.1-blue?style=flat-square)](https://github.com/SandeepAi369/Swift-Search-Rs) [![Rust](https://img.shields.io/badge/rust-100%25-orange?style=flat-square&logo=rust)](https://www.rust-lang.org/) [![Engines](https://img.shields.io/badge/search%20engines-90+-brightgreen?style=flat-square)](https://github.com/SandeepAi369/Swift-Search-Rs) [![License](https://img.shields.io/badge/license-Apache%202.0-purple?style=flat-square)](./LICENSE) **90+ search engines** · **Stealth WAF bypass** · **BM25 ranking** · **5-tier content extraction** · **BYOK LLM synthesis** *A single Rust binary that queries 90+ search engines simultaneously, extracts clean article text from every result, ranks them with BM25, and optionally synthesizes answers using any LLM provider — all without a single line of Python, Node, or Java.* [Quick Start](#-quick-start) · [How It's Different](#-how-its-different) · [Architecture](#-architecture) · [API Reference](#-api-reference) · [Configuration](#%EF%B8%8F-configuration)

🏆 How It’s Different

Most meta-search tools (SearXNG, Searx, etc.) simply proxy queries and return URLs. Swift Search RS goes 5 levels deeper:

Capability	SearXNG	Perplexity	Swift Search RS
Meta-search across engines	✅ ~70	❌ proprietary	✅ 90+ engines
Full article text extraction	❌	❌ (summary only)	✅ 5-tier extractor
BM25 relevance ranking	❌	❌	✅ paragraph-level
Anti-bot / WAF stealth	❌	N/A	✅ 18 browser profiles
Iterative deep research	❌	✅ (paid)	✅ multi-batch
Domain-specialized search	❌	❌	✅ 5 domain modes
Self-hosted / no API keys	✅	❌	✅ zero dependencies
LLM provider (BYOK)	❌	✅ (locked)	✅ 15+ providers
SSE real-time streaming	❌	✅	✅ native SSE
Smart engine fallback	❌	N/A	✅ 2-phase dispatch
Proxy pool + Tor support	partial	❌	✅ round-robin + cooldown
Pure Rust / single binary	❌ (Python)	❌ (cloud)	✅ ~15MB binary

What Makes This Unique

🔍 90+ Search Engines — The Widest Coverage in Any Open-Source Tool

Not just "supports" — actually **queries them in parallel** with query snowballing: - **Major**: Google (14 regional variants), Bing (14 regional), DuckDuckGo, Brave, Yahoo - **Privacy**: Startpage, Qwant, Mojeek, Swisscows, MetaGer, Search Encrypt, Presearch - **Academic**: Google Scholar, Wikipedia (API-native) - **Regional**: Yandex, Baidu, Sogou, Naver, Daum, Seznam, Rambler - **Independent**: Wiby, Marginalia, Stract, Right DAO, Mwmbl, Yep - **Aggregators**: Dogpile, WebCrawler, Info, Excite, Lycos, AOL - **Vertical**: Google News, Bing News, Yahoo News, Brave News, DDG News/Images/Videos, Bing Images/Videos, Google Images/Videos Each engine has a **dedicated HTML parser** — no API keys needed, no rate-limit dependencies.

🛡️ Military-Grade Stealth — 18 Browser Fingerprints

Every request rotates through **18 real browser profiles** with: - Realistic `User-Agent` strings (Chrome 127–131, Firefox 128–133, Edge 131, Safari 18.2) - Full `Sec-CH-UA` client hint suite (Arch, Bitness, Platform-Version, Full-Version-List) - Randomized `Accept` header variants to evade fingerprint correlation - Per-request cookie isolation (no cross-request state leaks) - Configurable jitter timing between requests (50–200ms default) - Optional proxy pool with health tracking and auto-cooldown - Tor SOCKS5 proxy integration (multi-port) This bypasses Cloudflare, Akamai, and Imperva WAFs consistently — something no other open-source search engine even attempts.

📖 5-Tier Content Extraction — Not Just URLs

While other search tools give you links, Swift Search RS **scrapes and extracts the actual article text**: 1. **Structured Selectors** — `.entry-content`, `.article-body`, `#main-content` (35+ CMS patterns) 2. **Semantic HTML5** — `

`, `

`, `[role="main"]`, `[itemprop="articleBody"]` 3. **Scored Container** — Text-density scoring with link-ratio penalty (trafilatura-inspired) 4. **Content Elements** — `

`, `

` fallback collection
5. **Full Body** — Last-resort visible text extraction with boilerplate filtering

Plus: paragraph deduplication, boilerplate line regex filtering, and per-paragraph fingerprinting.
</details>


🧠 BM25 Paragraph-Level Ranking

Raw results aren't enough — relevance matters. Swift Search RS breaks every scraped article into paragraph-sized chunks and scores them using the **Okapi BM25 algorithm** (the same ranking model underlying Elasticsearch):

- Term frequency (TF) analysis per chunk
- Inverse document frequency (IDF) across all chunks
- Document length normalization
- Exact phrase match bonus (+1.25 score)
- Configurable K1 (1.2) and B (0.75) parameters

The top-K most relevant chunks are passed to the LLM — not raw pages — giving dramatically better synthesis quality.



🔬 Deep Research Mode — Multi-Batch Iterative Synthesis

Unlike simple "search and summarize" tools, Deep Research mode:
1. Queries **all 90+ engines** with 3 query variations (snowballing)
2. Scrapes **200+ sources** concurrently
3. Splits results into **batches of 50**
4. Synthesizes each batch iteratively — each batch builds on the previous report
5. Produces a **comprehensive research paper** with proper source citations

This is the open-source equivalent of Perplexity Pro Search — without the subscription.



🎯 Domain-Specialized Search

5 curated domain modes with optimized engine sets and query handling:

| Domain | Focus Engines | Use Case |
|---|---|---|
| 💻 **Tech** | Stack Overflow, GitHub, HN, dev blogs | Programming, APIs, DevOps |
| 🧬 **Science** | Google Scholar, PubMed, arXiv, Nature | Research papers, studies |
| 📊 **Finance** | Bloomberg, Reuters, Yahoo Finance | Markets, earnings, macro |
| 🏥 **Health** | NIH, WHO, Mayo Clinic, medical journals | Medical, clinical data |
| 📰 **News** | All news-specific engine variants | Breaking news, current events |

Each mode uses a separate **Category pill** in the UI — composable with any search mode (Lite + Tech, Research + Science, etc.)


---

## 🏗️ Architecture

```
  Client Request           Swift Search RS v5.0.1
  ┌──────────┐        ┌────────────────────────────────────────────────┐
  │ POST     │        │                                                │
  │ /search  │───────►│  1. Query Snowballing (3 variations)           │
  │          │        │  2. 90+ Engine Dispatch (semaphore-bounded)    │
  │          │        │  3. Smart Fallback (primary → backup engines)  │
  │          │        │  4. URL Dedup (single-parse pipeline)          │
  │          │        │  5. Concurrent Scrape (24 workers)             │
  │          │        │  6. 5-Tier Content Extraction                  │
  │          │        │  7. BM25 Paragraph Ranking                     │
  │          │        │  8. Optional LLM Synthesis (BYOK)              │
  │          │        │                                                │
  │  ◄───────┤────────│  Response: sources + extracted text + answer   │
  └──────────┘        └────────────────────────────────────────────────┘
```

### Smart Engine Fallback (2-Phase Dispatch)

```
Phase 1: Primary engines (fast, reliable)
    │
    ├── Results >= 8? ──► Continue to scraping
    │
    └── Results < 8? ──► Phase 2: Backup engines (18 alternatives)
                              └──► Guarantees data even when top engines fail
```

---

## 📁 Project Structure

```
Swift-Search-Rs/
├── Cargo.toml               # Dependencies & release optimizations (LTO, strip)
├── Dockerfile               # Multi-stage Docker build (~15MB final image)
├── LICENSE                   # Apache 2.0
├── README.md
├── ui.html                  # Perplexity-style search interface (embedded at compile time)
├── scripts/
│   ├── ram_monitor.sh       # Memory usage monitoring utility
│   └── test_fallback.py     # Engine fallback integration test
└── src/
    ├── main.rs              # Axum HTTP server — routes, middleware, SSE
    ├── config.rs            # 18 browser profiles, WAF bypass, env config
    ├── models.rs            # Request/Response types (serde JSON)
    ├── search.rs            # Search orchestration + 2-phase engine dispatch
    ├── stream.rs            # SSE streaming pipeline (/search/stream)
    ├── ranking.rs           # BM25 paragraph chunking & relevance ranking
    ├── llm.rs               # BYOK LLM: 15+ providers, iterative research synthesis
    ├── extractor.rs         # 5-tier content extraction (LazyLock optimized)
    ├── url_utils.rs         # URL normalization, single-parse dedup pipeline
    ├── cache.rs             # TempDb (in-memory) + HistoryDb (persistent JSON)
    ├── copilot.rs           # LLM-powered query rewriter
    ├── proxy_pool.rs        # Round-robin proxy rotation with health tracking
    └── engines/
        ├── mod.rs           # SearchEngine trait + engine factory + domain modes
        ├── generic.rs       # Template engine for 60+ regional variants
        ├── duckduckgo.rs    # DuckDuckGo HTML scraper
        ├── brave.rs         # Brave Search scraper
        ├── yahoo.rs         # Yahoo Search scraper
        ├── qwant.rs         # Qwant scraper
        ├── mojeek.rs        # Mojeek scraper
        ├── startpage.rs     # Startpage scraper
        ├── wikipedia.rs     # Wikipedia JSON API engine
        └── wiby.rs          # Wiby indie search engine
```

**Total**: ~6,200 lines of pure Rust · Zero Python · Zero Node · Zero Java

---

## ⚡ Quick Start

### Build from Source

```bash
git clone https://github.com/SandeepAi369/Swift-Search-Rs.git
cd Swift-Search-Rs

# Build optimized release binary
cargo build --release

# Run (starts on http://localhost:8000)
./target/release/swift-search-rs
```

### Docker

```bash
docker build -t swift-search-rs .
docker run -p 8000:8000 swift-search-rs
```

### Verify

```bash
# Health check
curl http://localhost:8000/health

# Basic search (returns sources + extracted text)
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "quantum computing breakthroughs 2026"}'

# Search with LLM answer (BYOK — bring your own key)
curl -X POST http://localhost:8000/search/lite-llm \
  -H "Content-Type: application/json" \
  -d '{
    "query": "explain transformer architecture",
    "llm": {
      "provider": "groq",
      "api_key": "YOUR_KEY",
      "model": "llama-3.3-70b-versatile",
      "base_url": "https://api.groq.com/openai/v1"
    }
  }'
```

### Open the UI

Navigate to `http://localhost:8000` for the built-in Perplexity-style search interface with:
- Mode selector (Lite / Deep Research / Academic / Reddit / YouTube)
- Category selector (Tech / Science / Finance / Health / News)
- Real-time searching animation with orbital spinner
- Word-by-word typewriter LLM response rendering
- Text-to-Speech with chunk pre-loading
- Settings panel for LLM provider configuration

---

## 📡 API Reference

### `POST /search`

**Standard search** — returns sources with extracted article text.

```json
// Request
{
  "query": "artificial intelligence trends 2026",
  "max_results": 30,
  "focus_mode": "lite"
}

// Response
{
  "query": "artificial intelligence trends 2026",
  "sources_found": 287,
  "sources_processed": 142,
  "search_results": [
    {
      "url": "https://www.nature.com/articles/...",
      "title": "AI breakthroughs reshape scientific discovery",
      "extracted_text": "Full article text extracted via 5-tier heuristics...",
      "char_count": 7270,
      "engine": "google_scholar"
    }
  ],
  "elapsed_seconds": 4.28,
  "engine_stats": {
    "engines_queried": ["wikipedia", "duckduckgo", "brave", "google", "..."],
    "total_raw_results": 322,
    "deduplicated_urls": 142
  }
}
```

### `POST /search/lite-llm`

Search + LLM synthesis (fast, single-pass). Requires `llm` config in request body.

### `POST /search/research-llm`

Deep Research — iterative multi-batch LLM synthesis over 200+ sources.

### `POST /search/stream`

SSE streaming endpoint — real-time source delivery + LLM token streaming.

### `GET /health`

```json
{
  "status": "ok",
  "version": "5.0.1",
  "engines": ["wikipedia", "duckduckgo", "brave", "...90 total..."],
  "uptime_seconds": 3600
}
```

### `GET /config`

Returns current runtime configuration (concurrency, timeouts, engine list, proxy status).

### `POST /api/tts`

Text-to-Speech synthesis via external TTS provider.

### `POST /api/models`

Dynamic model discovery — fetches available models from any OpenAI-compatible endpoint.

---

## 🔌 Supported LLM Providers

| Provider | Default Model | Notes |
|---|---|---|
| **Cerebras** | `llama-3.3-70b` | Fastest inference |
| **Groq** | `llama-3.3-70b-versatile` | Free tier available |
| **OpenAI** | `gpt-4o-mini` | GPT family |
| **Anthropic** | `claude-3-5-haiku-latest` | Claude family |
| **Google Gemini** | `gemini-2.0-flash` | Gemini family |
| **xAI** | `grok-2-latest` | Grok family |
| **DeepSeek** | `deepseek-chat` | Cost-effective |
| **Ollama** | `llama3` | Local / self-hosted |
| **OpenRouter** | `openai/gpt-4o-mini` | Multi-model router |
| **Together AI** | `llama-3.1-70B-Instruct-Turbo` | Open-source models |
| **Fireworks AI** | `llama-v3p1-70b-instruct` | Fast open-source |
| **SambaNova** | `Meta-Llama-3.1-70B-Instruct` | Enterprise |
| **NVIDIA NIM** | `llama-3.1-70b-instruct` | GPU-optimized |
| **Any OpenAI-compatible** | Custom | Any `/v1/chat/completions` endpoint |

---

## ⚙️ Configuration

All environment variables are optional — sensible defaults built-in.

### Search & Scraping

| Variable | Default | Description |
|---|---|---|
| `ENGINES` | 90 engines (curated) | Comma-separated engine names to enable |
| `MAX_URLS` | `420` | Maximum URLs to scrape per query |
| `CONCURRENCY` | `24` | Concurrent scrape workers |
| `ENGINE_CONCURRENCY` | `10` | Concurrent engine-query workers |
| `JITTER_MIN_MS` | `50` | Min random delay between engine requests (stealth) |
| `JITTER_MAX_MS` | `200` | Max random delay between engine requests (stealth) |
| `SCRAPE_TIMEOUT` | `0` | Per-URL scrape timeout in seconds |
| `MAX_HTML_BYTES` | `1500000` | Max HTML download size per page |

### Proxy & Stealth

| Variable | Default | Description |
|---|---|---|
| `PROXY_POOL` | *(empty)* | Comma-separated proxy URLs |
| `PROXY_POOL_FILE` | *(empty)* | File path with one proxy URL per line |
| `TOR_PROXY_PORTS` | *(empty)* | Comma-separated local Tor SOCKS5 ports |
| `PROXY_COOLDOWN_SECS` | `120` | Cooldown window after proxy failure |

### Server

| Variable | Default | Description |
|---|---|---|
| `PORT` | `8000` | HTTP server listen port |
| `RUST_LOG` | `swift_search_rs=info` | Log verbosity level |

---

## 🔒 Privacy & Security

- **Zero telemetry** — no tracking, no analytics, no phone-home
- **No cloud dependencies** — runs entirely on your hardware
- **No API keys required** — all 90 engines work without any API registration
- **Cookie isolation** — every request uses a fresh HTTP client (no cross-request state)
- **Tracking param removal** — strips 30+ UTM/analytics parameters from every URL
- **Domain blocklist** — auto-skips social media feeds, app stores, and binary file URLs
- **Optional BYOK LLM** — AI synthesis is opt-in; raw results always available
- **No data persistence** — search history is optional and local-only

---

## 📊 Performance Characteristics

| Metric | Lite Mode | Deep Research Mode |
|---|---|---|
| Engines queried | 11 primary + fallback | 90+ (3 query variations) |
| Sources scraped | 30–80 | 200–400+ |
| Time to results | 3–8 seconds | 15–45 seconds |
| LLM context quality | Top 25 BM25 chunks | Full iterative batches |
| Memory footprint | ~30MB RSS | ~80MB RSS peak |
| Binary size | ~15MB (stripped, LTO) | Same binary |

---

## 🗺️ Roadmap

- [ ] Response compression (gzip/brotli for API responses)
- [ ] Built-in caching layer with TTL
- [ ] Citation graph visualization
- [ ] Plugin system for custom engines
- [ ] WebSocket streaming support
- [ ] Multi-language query support

---

## 📄 License

Copyright 2026 [Sandeep](https://xel-studio.vercel.app/)

Licensed under the [Apache License, Version 2.0](/Swift-Search-Rs/LICENSE).

---


  Built with 🦀 Rust by Sandeep
  

  _{6,200 lines of pure Rust · Zero external runtime dependencies · One binary to rule them all}

This site is open source. Improve this page.