@zetagoaurum-dev/straw

v1.3.1

Published

5 months ago

Enterprise-grade unified JS/TS and Python scraping library for Web, YouTube, and Media (Images, Audio, Video, Documents)

0High
0Medium
0Low

zetagoaurum

scraping scraper youtube-scraper media-extractor anti-cors

🌟 Why Choose Straw?

If you're building data-mining tools, scraping content, or parsing media at scale, you need a solution that is anti-blocking, lightweight, and universal. Straw delivers exactly that. Written fully natively in both JavaScript/TypeScript and Python to eliminate any overhead.

✨ Key Features

Anti-CORS & Anti-Blocking: Built-in User-Agent rotation, exponential retry backoffs, and strict TLS circumvention.
Unified DX: The exact same API semantics in both Python and Node.js. Learn once, scrape anywhere.
Zero Bloatware: No heavy dependencies (like ytdl-core or Puppeteer). Uses raw inner DOM and JSON extraction for blazing speed.
Deep Extraction:
- WebScraper: Extracts metadata, OpenGraph tags, semantic texts, and internal/external links.
- YouTubeScraper: Bypasses EU consent blocks and natively extracts stream formats (Audio/Video), directly from ytInitialPlayerResponse.
- MediaScraper: Sniffs pages for deeply embedded media including Images (.png, .webp, .svg), Documents (.pdf, .docx, .xls), Audio (.mp3, .ogg), and Video (.mp4, .webm).

🏗️ Architecture Tree

straw/
│
├── src/                          # TypeScript Source Code (Node.js)
│   ├── core/client.ts            # Undici-based HTTP client
│   ├── scrapers/web.ts           # General Web HTML parser (Cheerio)
│   ├── scrapers/youtube.ts       # YouTube innerTube JSON parser
│   └── scrapers/media.ts         # Generic Media & Document Sniffer
│
├── straw/                        # Python Source Code (Python 3.8+)
│   ├── client.py                 # Async HTTP client (httpx)
│   ├── web.py                    # BeautifulSoup4 HTML parser
│   ├── youtube.py                # YouTube RegExp & JSON extraction
│   └── media.py                  # Generic Media & Document Sniffer
│
├── package.json                  # NPM Metadata & Build commands
├── pyproject.toml                # PyPI Metadata & Configuration
├── README.md                     # This documentation
└── CHANGELOG.md                  # Release Version History

📦 Installation

Node.js (TypeScript/JavaScript)

npm install @zetagoaurum-dev/straw

Python

pip install httpx beautifulsoup4 lxml
# Since this is a unified repository, you can copy the `straw` python module direct to your codebase.

💻 Usage

🚀 Node.js Example

import straw from '@zetagoaurum-dev/straw';

async function main() {
    // 1. Scraping Generic Webpages
    const web = straw.web();
    const data = await web.scrape('https://example.com');
    console.log("Title:", data.title);
    console.log("Links found:", data.links.length);

    // 2. Scraping YouTube Video Streams (Without API Keys)
    const yt = straw.youtube();
    const videoInfo = await yt.scrapeVideo('https://www.youtube.com/watch?v=aqz-KE-bpKQ');
    console.log("Duration:", videoInfo.durationSeconds);
    console.log("Stream Formats Available:", videoInfo.formats.length);

    // 3. Extracting Media (Images, PDFs, MP4s) from a page
    const media = straw.media();
    const mediaLinks = await media.extractMedia('https://en.wikipedia.org/wiki/File:Big_Buck_Bunny_4K.webm');
    console.log("Media Files Found:", mediaLinks.mediaLinks);
}

main();

🐍 Python Example

import asyncio
from straw import WebScraper, YouTubeScraper, MediaScraper

async def main():
    # 1. Scraping Generic Webpages
    web = WebScraper()
    data = await web.scrape('https://example.com')
    print("Title:", data['title'])
    await web.client.close()

    # 2. Scraping YouTube Video Streams
    yt = YouTubeScraper()
    video_info = await yt.scrape_video('https://www.youtube.com/watch?v=aqz-KE-bpKQ')
    print("Duration:", video_info['durationSeconds'])
    await yt.client.close()

    # 3. Extracting Media
    media = MediaScraper()
    media_links = await media.extract_media('https://en.wikipedia.org/wiki/File:Big_Buck_Bunny_4K.webm')
    print("Media Found:", media_links['mediaLinks'])
    await media.client.close()

if __name__ == "__main__":
    asyncio.run(main())

🛡️ Stability & Security

Quality Score: 100/100
Vulnerabilities: 0 (Checked via npm audit)
License: MIT License

👨‍💻 Credits

Authored and Maintained by ZetaGo-Aurum.
Built for the community. Designed for enterprise.