npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@raphaellcs/data-cleaner

v2.0.0

Published

数据清洗工具 - 验证、分组、透视表

Downloads

9

Readme

@raphaellcs/data-cleaner

npm downloads license

数据清洗工具 - 快速清洗和转换数据文件

🚀 功能

  • 去除空行:过滤掉空数据
  • 去重:基于字段或整行去重
  • 去除空格:trim 字符串字段
  • 大小写转换:upper/lower/title
  • 列选择:只保留指定列
  • 数据过滤:基于条件的过滤
  • 排序:按列排序
  • 格式转换:JSON ↔ CSV
  • 统计信息:查看数据概况
  • 数据验证:内置验证规则和自定义规则(新)
  • 分组聚合:字段分组和时间分组(新)
  • 透视表:创建数据透视表(新)

📦 安装

npx @claw-dev/data-cleaner

📖 快速开始

1. 查看统计

data-cleaner stats data.csv

输出:

📊 数据统计

类型: array
总数: 1523

字段:
  - name
  - email
  - age

空值数量: 45
空字符串数量: 23

2. 去除空行和空格

data-cleaner clean data.csv cleaned.csv --remove-empty --trim

3. 去重

data-cleaner clean data.csv cleaned.csv --deduplicate

基于特定字段去重:

data-cleaner clean data.csv cleaned.csv --deduplicate --key email

4. 列选择

data-cleaner clean data.csv cleaned.csv --columns "name,email"

5. 数据过滤

# 年龄大于 18
data-cleaner clean data.csv cleaned.csv -F "age:gt:18"

# 邮件包含 @gmail.com
data-cleaner clean data.csv cleaned.csv -F "email:contains:@gmail.com"

# 等于特定值
data-cleaner clean data.csv cleaned.csv -F "status:eq:active"

6. 排序

# 按年龄升序
data-cleaner clean data.csv cleaned.csv -S age

# 按年龄降序
data-cleaner clean data.csv cleaned.csv -S age --order desc

7. 大小写转换

# 全部大写
data-cleaner clean data.csv cleaned.csv --case upper

# 全部小写
data-cleaner clean data.csv cleaned.csv --case lower

# 首字母大写
data-cleaner clean data.csv cleaned.csv --case title

8. 格式转换

# CSV 转 JSON
data-cleaner clean data.csv output.json -f json

# JSON 转 CSV
data-cleaner clean data.json output.csv -f csv

📋 过滤操作

| 操作符 | 说明 | 示例 | |--------|------|------| | eq | 等于 | status:eq:active | | neq | 不等于 | status:neq:deleted | | gt | 大于 | age:gt:18 | | lt | 小于 | age:lt:65 | | gte | 大于等于 | age:gte:18 | | lte | 小于等于 | age:lte:65 | | contains | 包含 | email:contains:@gmail.com | | startsWith | 以...开头 | name:startsWith:A | | endsWith | 以...结尾 | email:endsWith:.com |

🎯 使用场景

1. 清洗用户数据

data-cleaner clean users.csv users_cleaned.csv \
  --remove-empty \
  --deduplicate --key email \
  --trim \
  -F "status:eq:active"

去除空行、基于邮箱去重、去除空格、只保留活跃用户。

2. 提取特定列

data-cleaner clean products.csv products_simple.csv \
  --columns "id,name,price"

只保留产品 ID、名称和价格。

3. 格式转换

data-cleaner clean data.json data.csv -f csv
data-cleaner clean data.csv data.json -f json

在 JSON 和 CSV 之间转换。

4. 排序和限制

data-cleaner clean products.csv top10.csv \
  -S price --order desc \
  --limit 10

按价格降序,只保留前 10 个。

5. 数据标准化

data-cleaner clean emails.csv emails_cleaned.csv \
  --trim \
  --case lower

去除空格并转换为小写。

💡 组合使用

多个选项可以组合使用:

data-cleaner clean data.csv cleaned.csv \
  --remove-empty \
  --deduplicate --key id \
  --trim \
  --case lower \
  -F "status:eq:active" \
  -S created_at --order desc \
  --limit 1000

这会:

  1. 去除空行
  2. 基于 ID 去重
  3. 去除空格
  4. 转换为小写
  5. 只保留状态为 active 的记录
  6. 按创建时间降序排序
  7. 只保留前 1000 条

📊 统计信息

使用 --stats 查看清洗前后的对比:

data-cleaner clean data.csv cleaned.csv --stats

输出:

🔧 清洗数据

输入: data.csv
输出: cleaned.csv

原始数据:
📊 数据统计

类型: array
总数: 1523

字段:
  - id
  - name
  - email
  - age
  - status

空值数量: 45
空字符串数量: 23

清洗后数据:
📊 数据统计

类型: array
总数: 1456

字段:
  - id
  - name
  - email
  - age
  - status

✅ 已保存到: cleaned.csv
   从 1523 行减少到 1456 行

🔧 高级功能

1. 转换为大写并去除空值

data-cleaner clean data.csv cleaned.csv \
  --remove-empty \
  --trim \
  --case upper

2. 多步清洗

可以链式调用,逐步清洗:

# 第一步:去重
data-cleaner clean data.csv step1.csv --deduplicate --key id

# 第二步:过滤
data-cleaner clean step1.csv step2.csv -F "age:gte:18"

# 第三步:排序
data-cleaner clean step2.csv final.csv -S created_at --order desc

3. 批量处理

使用 shell 脚本批量处理:

#!/bin/bash

for file in data/*.csv; do
    output="cleaned/$(basename $file)"
    data-cleaner clean "$file" "$output" --remove-empty --trim
done

🚧 待实现

  • [ ] 支持更多文件格式(Excel、SQL)
  • [ ] 合并多个文件

✨ 新功能(v2.0.0)

数据验证

验证数据是否符合规则:

data-cleaner validate data.csv --config rules.json

创建验证规则配置 rules.json

{
  "email": ["required", "email"],
  "age": [
    "required",
    {"name": "number", "message": "年龄必须是数字"},
    {"name": "min", "value": 0, "message": "年龄不能为负数"},
    {"name": "max", "value": 120, "message": "年龄不能超过120"}
  ],
  "phone": [
    {"name": "pattern", "value": "^\\d{11}$", "message": "手机号必须是11位数字"}
  ],
  "status": [
    {"name": "enum", "value": ["active", "inactive", "pending"], "message": "状态值不合法"}
  ]
}

内置验证规则:

  • required - 必填
  • email - 邮箱格式
  • url - URL 格式
  • number - 数字
  • integer - 整数
  • positive - 正数
  • negative - 负数
  • min:<value> - 最小值
  • max:<value> - 最大值
  • minLength:<length> - 最小长度
  • maxLength:<length> - 最大长度
  • pattern:<regex> - 正则匹配
  • enum:[values] - 枚举值
  • date - 日期
  • future - 未来日期
  • past - 过去日期
  • phone - 电话号码

输出错误报告:

data-cleaner validate data.csv --config rules.json --output errors.csv --format csv

分组聚合

按字段分组并聚合:

# 按部门分组,计算平均工资
data-cleaner group employees.csv --group-by department --aggregate "salary:avg" --stats

# 多字段分组
data-cleaner group sales.csv --group-by "region,category" --aggregate "revenue:sum,count" --output grouped.json

时间分组:

# 按天分组
data-cleaner group orders.csv --time-field created_at --interval day --aggregate "amount:sum" --stats

# 按月分组
data-cleaner group orders.csv --time-field created_at --interval month --aggregate "amount:sum,count" --stats

# 按小时分组
data-cleaner group logs.csv --time-field timestamp --interval hour --aggregate "errors:sum" --stats

聚合类型:

  • sum - 求和
  • avg - 平均值
  • min - 最小值
  • max - 最大值
  • count - 计数
  • count_distinct - 去重计数
  • first - 第一个值
  • last - 最后一个值
  • concat - 拼接
  • array - 数组
  • percentile:XX - 百分位数(如 percentile:95)

透视表

创建数据透视表:

data-cleaner pivot sales.csv \
  --rows region \
  --columns product \
  --values revenue \
  --agg sum

示例输出:

    productA    productB    productC
region1      15000.00    23000.00    18000.00
region2      12000.00    25000.00    21000.00
region3      18000.00    20000.00    22000.00

保存透视表:

data-cleaner pivot sales.csv \
  --rows region \
  --columns product \
  --values revenue \
  --agg sum \
  --output pivot.json

🤝 贡献

欢迎提交 Issue 和 PR!

📄 许可证

MIT © 梦心


Made with 🌙 by 梦心