Pkg
Stats

npm package discovery and stats viewer.

Discover Tips

General search
[free text search, go nuts!]
Package details
pkg:[package-name]
User packages
@[username]

Sponsor

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Twitter
@PkgStats
GitHub
pkgstats
Twitter
@ryanhefner
GitHub
ryanhefner
Site
ryanhefner.com

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

Framework
react / react-dom
Server
next / express / next-routes
Data Store
redux / react-redux / next-redux-wrapper / redux-thunk / redux-logger
Caching
lru-cache
CSS / Styling
next-page-transitions / styled-components
Typeface
@ibm/plex
Avatars
gravatar
Data Viz
chart.js / react-sparklines
Date formatting
dayjs
Infinite scrolling
react-scroll-trigger
Markdown rendering
react-markdown
Repository url parsing
hosted-git-info
User data
npm-user
Compiling
babel-plugin-module-resolver / babel-plugin-styled-components
Types
prop-types
Odds & Ends
es6-promise / isomorphic-fetch

© 2024 – Pkg Stats / Ryan Hefner

huanqiu-news-crawler

v1.0.0

Published

2 years ago

用于爬取环球网的新闻数据

Downloads

1

0High
0Medium
0Low

huanqiu-news-crawler huanqiu-crawler huanqiu crawler news-crawler cyq peter

环球网新闻爬虫脚本（NodeJs）

目录

start.js 可执行文件
config.json 配置文件
data 数据保存的文件夹（配置后自动生成）
- news_detail.json 爬取新闻的列表数据
- news_item.json 爬取的新闻详情数据

使用

// huanqiu-crawler模块导出的是一个可执行函数
const hc = require("huanqiu-news-crawler")
hc()

第一次使用，请关闭 config 的 isAutoAssignTime 属性，并配置 assignTime、maxScrollTop、maxNewsNumber

配置文件介绍

基本配置
"rootUrl" 爬虫目标的根地址（不可更改）
"newsType" 新闻类型，以及对应的链接
"isPrintInfoToConsole" 是否实时打印爬虫信息
"isAutoAssignTime" 是否开启自动匹配时间
"isScrollAwait" 是否开启滚动延迟
"isPageAwait" 是否开启页面关闭延时
"scrollAwaitTime" 页面滚动延时（默认单位：ms）
"pageAwaitTime" 页面关闭延时（默认单位：ms）
"isSavaToFile" 是否将数据保存到文件
"isSavaToDataBase" 是否将数据保存到Mysql数据库

数据配置
"assignTime" 将新闻以该时间进行筛选
"maxScrollTop" 最大滚动距离，相同条件下，滚动距离越大，数据获取越多（默认单位：px）
"maxNewsNumber" 爬取单类新闻的最大条数

数据库配置
"mysql_host" Mysql主机地址
"mysql_user" Mysql登录用户名
"mysql_password" Mysql登录口令
"mysql_database" 链接的数据库
"mysql_port" 端口号
"mysql_timezone" 时区

数据表结构

数据表结构

优化

优化爬虫的异步操作和流程
新增错误处理机制
实现新闻时间智能识别，按需爬取
解决了偶尔出现的未爬完数据就结束的Bug
新增了是否打印实时爬虫信息的功能