# lawyers `common_sites` 独立采集项目。 ## 目录 - `common_sites/`:大律师、找法网、法律快车、律图、华律 5 个采集脚本 - `one_off_sites/`:一次性/临时站点采集脚本(不纳入常用站点批量启动) - `request/proxy_config.py`:代理配置加载逻辑 - `request/proxy_settings.json`:代理配置文件 - `Db.py`:数据库连接与基础操作 - `config.py`:数据库与请求头配置 ## 运行 ```bash cd /www/wwwroot/lawyers python3 -m venv .venv .venv/bin/pip install -r requirements.txt ./common_sites/start.sh ``` ## 地区同步服务(Python) 新增服务脚本:`services/area_sync_service.py` 用途: - 替代原 `nas.nepiedg.site:9002` 的核心接口 - `GET /api/layer/get_area`:从数据库 `area_new` 读取地区列表并返回给 `js/douyin.js` - `POST /api/layer/index`:接收脚本回传搜索数据,先保存原始 JSON 到本地,再按参数决定是否入库 - `GET/POST /api/layer/progress`:多设备共享采集断点(自动建表 `layer_progress`) `/api/layer/index` 当前入库规则(基于 `payload.data.user_list[].user_info`): - 主要从 `signature`(简介)里正则提取手机号 - 若简介未命中,再从微信相关标记(`微信/wx/vx/v`)和 `unique_id/versatile_display` 提取手机号 - 必须命中关键词(默认:`律师,律所`)才允许入库,可通过 `DOUYIN_LAWYER_KEYWORDS` 调整 - `url` 固定写为 `https://www.douyin.com/user/{sec_uid}`(`sec_uid` 为空则跳过不入库) 启动: ```bash cd /www/wwwroot/lawyers ./.venv/bin/python ./services/area_sync_service.py ``` 常用环境变量: ```bash AREA_SERVICE_HOST=0.0.0.0 AREA_SERVICE_PORT=9002 AREA_TARGET_TABLE=area_new AREA_DOMAIN=maxlaw DOUYIN_DOMAIN=抖音 DOUYIN_RAW_DIR=/www/wwwroot/lawyers/data/douyin_raw DOUYIN_SAVE_ONLY=1 DOUYIN_LAWYER_KEYWORDS=律师,律所 LAYER_PROGRESS_TABLE=layer_progress LAYER_PROGRESS_DEFAULT_KEY=douyin_batch_default ``` 接口示例: ```bash # 健康检查 curl 'http://127.0.0.1:9002/health' # 读取数据库中的地区(默认直接返回数组,兼容 js/douyin.js) curl 'http://127.0.0.1:9002/api/layer/get_area?server=1' # 如果需要带统计信息 curl 'http://127.0.0.1:9002/api/layer/get_area?table=area_new&domain=maxlaw&meta=1' # 接收 douyin.js 回传结果并入库(默认写 lawyer.domain=抖音) curl -X POST 'http://127.0.0.1:9002/api/layer/index?server=1&save_only=0' \ -H 'Content-Type: application/json' \ -d '{"source":"xhr","url":"https://www.douyin.com/aweme/v1/web/discover/search/","ts":1772811111,"cityIndex":0,"data":{"desc":"联系方式 13812345678"}}' # 可选:指定写入域名(用于测试) curl -X POST 'http://127.0.0.1:9002/api/layer/index?save_domain=codex_test_douyin' \ -H 'Content-Type: application/json' \ -d '{"source":"xhr","url":"https://www.douyin.com/aweme/v1/web/discover/search/","ts":1772811111,"cityIndex":0,"data":{"desc":"联系方式 13812345678"}}' # 仅保存原始回传(不入库) curl -X POST 'http://127.0.0.1:9002/api/layer/index?save_only=1' \ -H 'Content-Type: application/json' \ -d '{"source":"xhr","url":"https://www.douyin.com/aweme/v1/web/discover/search/","ts":1772811111,"cityIndex":0,"data":{"desc":"联系方式 13812345678"}}' # 原始数据落盘目录(按天分文件) # /www/wwwroot/lawyers/data/douyin_raw/douyin_index_YYYYMMDD.jsonl # 读取共享断点(多设备) curl 'http://127.0.0.1:9002/api/layer/progress?server=1&progress_key=douyin_batch_default' # 更新共享断点 curl -X POST 'http://127.0.0.1:9002/api/layer/progress?server=1' \ -H 'Content-Type: application/json' \ -d '{"progress_key":"douyin_batch_default","device_id":"device-a","next_city_index":120,"area_signature":"xxxx","area_total":551,"current_city":"北京","reason":"city_done","status":"running"}' # 清空共享断点 curl -X POST 'http://127.0.0.1:9002/api/layer/progress?server=1' \ -H 'Content-Type: application/json' \ -d '{"action":"clear","progress_key":"douyin_batch_default"}' ``` 如果 9002 端口已有旧进程占用,可先执行: ```bash lsof -iTCP:9002 -sTCP:LISTEN -t kill ``` ## 启动参数 `start.sh` 默认并行启动 5 个站点采集(大律师使用 `dls_fresh.py`)。 - 日志目录:`/www/wwwroot/lawyers/logs` - 大律师 JSON 输出:`/www/wwwroot/lawyers/data/dls_records.jsonl` 常用环境变量: ```bash # 顺序执行(默认 parallel) RUN_MODE=sequential ./common_sites/start.sh # 大律师限制采集范围 DLS_CITY_FILTER=beijing DLS_MAX_CITIES=1 DLS_MAX_PAGES=1 ./common_sites/start.sh # 大律师直连(不走代理)/ 仅导出JSON不写库 DLS_DIRECT=1 DLS_NO_DB=1 ./common_sites/start.sh ``` ## 导出 Excel 新增导出脚本:`common_sites/export_lawyers_excel.py` ```bash # 无参数:默认导出最近7天数据(含手机号/姓名/律所/省份/市区/站点名称) # 并默认解析 params 扩展信息(邮箱/地址/执业证号/执业年限/擅长领域等) ./.venv/bin/python ./common_sites/export_lawyers_excel.py # 按 create_time 时间戳范围导出 ./.venv/bin/python ./common_sites/export_lawyers_excel.py \ --start-ts 1772380000 --end-ts 1772429999 \ --output ./data/lawyers_20260302.xlsx # 只导出某站点,并带技术字段(url/域名/时间等) ./.venv/bin/python ./common_sites/export_lawyers_excel.py \ --domain 大律师 --include-extra # 如果不需要解析 params 扩展信息 ./.venv/bin/python ./common_sites/export_lawyers_excel.py --no-parse-params ``` ## 一次性站点(众法利) 脚本:`one_off_sites/zhongfali_single.py` ```bash # 仅采集写 JSON(默认输出到 data/one_off_sites/) ./.venv/bin/python ./one_off_sites/zhongfali_single.py --direct --no-db # 采集并写入 lawyer 表(domain=众法利单页) ./.venv/bin/python ./one_off_sites/zhongfali_single.py --direct ```