Files
lawyers/README.md
T
hello-dd-code c2b77975c1 feat: add douyin data export functionality to lawyer export script
- Introduced a new command-line argument `--douyin-only` to export data specifically for Douyin, including additional fields such as sec_uid, douyin_uid, and user information.
- Updated the README to include instructions for exporting Douyin data.
- Enhanced the export logic to accommodate new fields when exporting Douyin-specific data.
2026-03-09 21:26:50 +08:00

171 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# lawyers
`common_sites` 独立采集项目。
## 目录
- `common_sites/`:大律师、找法网、法律快车、律图、华律 5 个采集脚本
- `one_off_sites/`:一次性/临时站点采集脚本(不纳入常用站点批量启动)
- `request/proxy_config.py`:代理配置加载逻辑
- `request/proxy_settings.json`:代理配置文件
- `Db.py`:数据库连接与基础操作
- `config.py`:数据库与请求头配置
## 运行
```bash
cd /www/wwwroot/lawyers
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
./common_sites/start.sh
```
## 地区同步服务(Python
新增服务脚本:`services/area_sync_service.py`
用途:
- 替代原 `nas.nepiedg.site:9002` 的核心接口
- `GET /api/layer/get_area`:从数据库 `area_new` 读取地区列表并返回给 `js/douyin.js`
- `POST /api/layer/index`:接收脚本回传搜索数据,先保存原始 JSON 到本地,再按参数决定是否入库
- `GET/POST /api/layer/progress`:多设备共享采集断点(自动建表 `layer_progress`
`/api/layer/index` 当前入库规则(基于 `payload.data.user_list[].user_info`):
- 主要从 `signature`(简介)里正则提取手机号
- 若简介未命中,再从微信相关标记(`微信/wx/vx/v`)和 `unique_id/versatile_display` 提取手机号
- 必须命中关键词(默认:`律师,律所`)才允许入库,可通过 `DOUYIN_LAWYER_KEYWORDS` 调整
- `url` 固定写为 `https://www.douyin.com/user/{sec_uid}``sec_uid` 为空则跳过不入库)
启动:
```bash
cd /www/wwwroot/lawyers
./.venv/bin/python ./services/area_sync_service.py
```
常用环境变量:
```bash
AREA_SERVICE_HOST=0.0.0.0
AREA_SERVICE_PORT=9002
AREA_TARGET_TABLE=area_new
AREA_DOMAIN=maxlaw
DOUYIN_DOMAIN=抖音
DOUYIN_RAW_DIR=/www/wwwroot/lawyers/data/douyin_raw
DOUYIN_SAVE_ONLY=1
DOUYIN_LAWYER_KEYWORDS=律师,律所
LAYER_PROGRESS_TABLE=layer_progress
LAYER_PROGRESS_DEFAULT_KEY=douyin_batch_default
```
接口示例:
```bash
# 健康检查
curl 'http://127.0.0.1:9002/health'
# 读取数据库中的地区(默认直接返回数组,兼容 js/douyin.js
curl 'http://127.0.0.1:9002/api/layer/get_area?server=1'
# 如果需要带统计信息
curl 'http://127.0.0.1:9002/api/layer/get_area?table=area_new&domain=maxlaw&meta=1'
# 接收 douyin.js 回传结果并入库(默认写 lawyer.domain=抖音)
curl -X POST 'http://127.0.0.1:9002/api/layer/index?server=1&save_only=0' \
-H 'Content-Type: application/json' \
-d '{"source":"xhr","url":"https://www.douyin.com/aweme/v1/web/discover/search/","ts":1772811111,"cityIndex":0,"data":{"desc":"联系方式 13812345678"}}'
# 可选:指定写入域名(用于测试)
curl -X POST 'http://127.0.0.1:9002/api/layer/index?save_domain=codex_test_douyin' \
-H 'Content-Type: application/json' \
-d '{"source":"xhr","url":"https://www.douyin.com/aweme/v1/web/discover/search/","ts":1772811111,"cityIndex":0,"data":{"desc":"联系方式 13812345678"}}'
# 仅保存原始回传(不入库)
curl -X POST 'http://127.0.0.1:9002/api/layer/index?save_only=1' \
-H 'Content-Type: application/json' \
-d '{"source":"xhr","url":"https://www.douyin.com/aweme/v1/web/discover/search/","ts":1772811111,"cityIndex":0,"data":{"desc":"联系方式 13812345678"}}'
# 原始数据落盘目录(按天分文件)
# /www/wwwroot/lawyers/data/douyin_raw/douyin_index_YYYYMMDD.jsonl
# 读取共享断点(多设备)
curl 'http://127.0.0.1:9002/api/layer/progress?server=1&progress_key=douyin_batch_default'
# 更新共享断点
curl -X POST 'http://127.0.0.1:9002/api/layer/progress?server=1' \
-H 'Content-Type: application/json' \
-d '{"progress_key":"douyin_batch_default","device_id":"device-a","next_city_index":120,"area_signature":"xxxx","area_total":551,"current_city":"北京","reason":"city_done","status":"running"}'
# 清空共享断点
curl -X POST 'http://127.0.0.1:9002/api/layer/progress?server=1' \
-H 'Content-Type: application/json' \
-d '{"action":"clear","progress_key":"douyin_batch_default"}'
```
如果 9002 端口已有旧进程占用,可先执行:
```bash
lsof -iTCP:9002 -sTCP:LISTEN -t
kill <PID>
```
## 启动参数
`start.sh` 默认并行启动 5 个站点采集(大律师使用 `dls_fresh.py`)。
- 日志目录:`/www/wwwroot/lawyers/logs`
- 大律师 JSON 输出:`/www/wwwroot/lawyers/data/dls_records.jsonl`
常用环境变量:
```bash
# 顺序执行(默认 parallel
RUN_MODE=sequential ./common_sites/start.sh
# 大律师限制采集范围
DLS_CITY_FILTER=beijing DLS_MAX_CITIES=1 DLS_MAX_PAGES=1 ./common_sites/start.sh
# 大律师直连(不走代理)/ 仅导出JSON不写库
DLS_DIRECT=1 DLS_NO_DB=1 ./common_sites/start.sh
```
## 导出 Excel
新增导出脚本:`common_sites/export_lawyers_excel.py`
```bash
# 无参数:默认导出最近7天数据(含手机号/姓名/律所/省份/市区/站点名称)
# 并默认解析 params 扩展信息(邮箱/地址/执业证号/执业年限/擅长领域等)
./.venv/bin/python ./common_sites/export_lawyers_excel.py
# 按 create_time 时间戳范围导出
./.venv/bin/python ./common_sites/export_lawyers_excel.py \
--start-ts 1772380000 --end-ts 1772429999 \
--output ./data/lawyers_20260302.xlsx
# 只导出某站点,并带技术字段(url/域名/时间等)
./.venv/bin/python ./common_sites/export_lawyers_excel.py \
--domain 大律师 --include-extra
# 如果不需要解析 params 扩展信息
./.venv/bin/python ./common_sites/export_lawyers_excel.py --no-parse-params
# 导出抖音采集数据(domain=抖音),并附带 sec_uid/抖音号/简介/API来源等字段
./.venv/bin/python ./common_sites/export_lawyers_excel.py \
--douyin-only --start-ts 0 --output ./data/douyin_lawyers_export.xlsx
```
## 一次性站点(众法利)
脚本:`one_off_sites/zhongfali_single.py`
```bash
# 仅采集写 JSON(默认输出到 data/one_off_sites/
./.venv/bin/python ./one_off_sites/zhongfali_single.py --direct --no-db
# 采集并写入 lawyer 表(domain=众法利单页)
./.venv/bin/python ./one_off_sites/zhongfali_single.py --direct
```