weibo-crawler

v0.1.3

Published

4 years ago

A weibo crawler

0High
0Medium
0Low

fytriht

Weibo Crawler / 微博爬虫

A simple weibo crawler

Features

Crawl all the weibos of a user.

Result

http://weibo.com/p/1005051736338681/home?from=page_100505_profile&wvr=6&mod=data#place

[
  {
    "scheme": "http://m.weibo.cn/status/EquqEyH0v?mblogid=EquqEyH0v&luicode=10000011&lfid=1076031736338681",
    "createdAt": "01-12 16:38",
    "id": "4063135008268475",
    "text": "想玩物丧志，请问玩什么物能比较轻松愉快无负担地丧志 ",
    "repostsCount": 3,
    "commentsCount": 47,
    "likesCount": 59
  },
  {
    "scheme": "http://m.weibo.cn/status/EpMbwy2dU?mblogid=EpMbwy2dU&luicode=10000011&lfid=1076031736338681",
    "createdAt": "01-08 00:00",
    "id": "4061434268111702",
    "text": "分享图片 ",
    "repostsCount": 5,
    "commentsCount": 7,
    "likesCount": 71,
    "pics": [
      "http://wx4.sinaimg.cn/orj360/677e6cf9ly1fbiie692r0j20ku09ywg7.jpg",
      "http://wx3.sinaimg.cn/orj360/677e6cf9ly1fbiie60h1nj20ku0a9ab9.jpg",
      "http://wx2.sinaimg.cn/orj360/677e6cf9ly1fbiie6hl5aj20ku04pdgb.jpg"
    ]
  },
  {
    "scheme": "http://m.weibo.cn/status/EpLXkAI8Q?mblogid=EpLXkAI8Q&luicode=10000011&lfid=1076031736338681",
    "createdAt": "01-07 23:25",
    "id": "4061425468749492",
    "text": "Repost",
    "repostsCount": 14,
    "commentsCount": 3,
    "likesCount": 29,
    "retweetedStatus": {
      "id": "4061419093386276",
      "text": "如果你讨厌文青，那这是一个好时代。我们有能者多劳的大大，有始终关怀引导年轻人文化思想的团团。文艺行业腊月十八，而你的孩子，再也不用看那些把文青们脑子弄乱的东西了。 ",
      "userName": "alitha",
      "userId": 3171360847
    },
    {
      "...": "...."
    }
]

Quick Start

git clone https://github.com/fytriht/weibo-crawler.git
cd weibo-crawler
npm install
npm run test

Basic usage

Installing

npm i weibo-crawler

Get URL & Start crawling

screenshot1

Go to the user's home page which you want to crawl
Click "他/她的主页" button
Copy the URL

const weiboCrawler = require('weibo-crawler')

const url = 'http://weibo.com/p/1005051736338681/home?from=page_100505_profile&wvr=6&mod=data#place'

weiboCrawler(url)
  .then(data => {
    console.log(JSON.stringify(data, null, 2))

    /*
     * or you can, for example:
     *
     * const fs = require('fs')
     *
     * fs.writeFile('data.json', JSON.stringify(data, null, 2), (err) => {
     *   if (err) return err
     * })
     * 
     */
  })
  .catch(err => console.error('Something went wrong.'))

You can also set the concurrency:

...
weiboCrawler(url, concurrency) // concurrency defaulting to 5
...

Limitation

Due to Sina Weibo's anti-crawling strategy, the default concurrency is recommended. If the concurrency is too high, you might get an error similar to the following. Just wait a few more minutes and try again.

node index.js

undefined:1
<!DOCTYPE html>
^

SyntaxError: Unexpected token < in JSON at position 0
    at JSON.parse (<anonymous>)
    at superagent.get.end (/*****************************/node_modules/weibo-crawler/getApiUrls.js:21:32)
    at Request.callback (/*****************************/node_modules/superagent/lib/node/index.js:672:3)
    at Stream.<anonymous> (/*****************************/node_modules/superagent/lib/node/index.js:866:18)
    at emitNone (events.js:86:13)
    at Stream.emit (events.js:185:7)
    at Unzip.<anonymous> (/*****************************/node_modules/superagent/lib/node/unzip.js:53:12)
    at emitNone (events.js:91:20)
    at Unzip.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)

If the user have a lot of weibos(3000+), you might get an unexpected result. I am working on this issue.

TODO

[x] test cases
[ ] error handling
[x] unescaped text content

Support

If you have any problem or suggestion please open an issue here

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme