@zetagoaurum-dev/straw
v1.3.1
Published
Enterprise-grade unified JS/TS and Python scraping library for Web, YouTube, and Media (Images, Audio, Video, Documents)
Maintainers
Readme
🌟 Why Choose Straw?
If you're building data-mining tools, scraping content, or parsing media at scale, you need a solution that is anti-blocking, lightweight, and universal. Straw delivers exactly that. Written fully natively in both JavaScript/TypeScript and Python to eliminate any overhead.
✨ Key Features
- Anti-CORS & Anti-Blocking: Built-in User-Agent rotation, exponential retry backoffs, and strict TLS circumvention.
- Unified DX: The exact same API semantics in both Python and Node.js. Learn once, scrape anywhere.
- Zero Bloatware: No heavy dependencies (like
ytdl-coreor Puppeteer). Uses raw inner DOM and JSON extraction for blazing speed. - Deep Extraction:
WebScraper: Extracts metadata, OpenGraph tags, semantic texts, and internal/external links.YouTubeScraper: Bypasses EU consent blocks and natively extracts stream formats (Audio/Video), directly fromytInitialPlayerResponse.MediaScraper: Sniffs pages for deeply embedded media including Images (.png, .webp, .svg), Documents (.pdf, .docx, .xls), Audio (.mp3, .ogg), and Video (.mp4, .webm).
🏗️ Architecture Tree
straw/
│
├── src/ # TypeScript Source Code (Node.js)
│ ├── core/client.ts # Undici-based HTTP client
│ ├── scrapers/web.ts # General Web HTML parser (Cheerio)
│ ├── scrapers/youtube.ts # YouTube innerTube JSON parser
│ └── scrapers/media.ts # Generic Media & Document Sniffer
│
├── straw/ # Python Source Code (Python 3.8+)
│ ├── client.py # Async HTTP client (httpx)
│ ├── web.py # BeautifulSoup4 HTML parser
│ ├── youtube.py # YouTube RegExp & JSON extraction
│ └── media.py # Generic Media & Document Sniffer
│
├── package.json # NPM Metadata & Build commands
├── pyproject.toml # PyPI Metadata & Configuration
├── README.md # This documentation
└── CHANGELOG.md # Release Version History📦 Installation
Node.js (TypeScript/JavaScript)
npm install @zetagoaurum-dev/strawPython
pip install httpx beautifulsoup4 lxml
# Since this is a unified repository, you can copy the `straw` python module direct to your codebase.💻 Usage
🚀 Node.js Example
import straw from '@zetagoaurum-dev/straw';
async function main() {
// 1. Scraping Generic Webpages
const web = straw.web();
const data = await web.scrape('https://example.com');
console.log("Title:", data.title);
console.log("Links found:", data.links.length);
// 2. Scraping YouTube Video Streams (Without API Keys)
const yt = straw.youtube();
const videoInfo = await yt.scrapeVideo('https://www.youtube.com/watch?v=aqz-KE-bpKQ');
console.log("Duration:", videoInfo.durationSeconds);
console.log("Stream Formats Available:", videoInfo.formats.length);
// 3. Extracting Media (Images, PDFs, MP4s) from a page
const media = straw.media();
const mediaLinks = await media.extractMedia('https://en.wikipedia.org/wiki/File:Big_Buck_Bunny_4K.webm');
console.log("Media Files Found:", mediaLinks.mediaLinks);
}
main();🐍 Python Example
import asyncio
from straw import WebScraper, YouTubeScraper, MediaScraper
async def main():
# 1. Scraping Generic Webpages
web = WebScraper()
data = await web.scrape('https://example.com')
print("Title:", data['title'])
await web.client.close()
# 2. Scraping YouTube Video Streams
yt = YouTubeScraper()
video_info = await yt.scrape_video('https://www.youtube.com/watch?v=aqz-KE-bpKQ')
print("Duration:", video_info['durationSeconds'])
await yt.client.close()
# 3. Extracting Media
media = MediaScraper()
media_links = await media.extract_media('https://en.wikipedia.org/wiki/File:Big_Buck_Bunny_4K.webm')
print("Media Found:", media_links['mediaLinks'])
await media.client.close()
if __name__ == "__main__":
asyncio.run(main())🛡️ Stability & Security
- Quality Score: 100/100
- Vulnerabilities: 0 (Checked via
npm audit) - License: MIT License
👨💻 Credits
Authored and Maintained by ZetaGo-Aurum.
Built for the community. Designed for enterprise.
