# iCoLoc

中文 | [English](#english)

iCoLoc 是一个基于本地大语言模型（LLM）的交互式空间共现模式挖掘系统，支持自然语言查询理解、模式挖掘、反馈学习与多轮迭代优化。

## 项目亮点

- 自然语言到挖掘参数的自动转换
- Co-location 模式挖掘与关联规则生成
- Stage0：意图理解与冷启动偏好初始化
- Stage3：基于用户反馈的偏好学习与重排序
- Stage4：多轮迭代式交互优化
- 同时支持 CLI 与 Web 界面

## 技术栈

- Python 3.9+
- FastAPI + Uvicorn（Web 服务）
- PyTorch / Transformers（主 LLM 推理）
- Sentence Transformers、NumPy（Stage0 / Stage3 句向量；与主 LLM 独立，默认自 Hugging Face 拉取 MiniLM 类模型）

## 项目结构

下列为**纳入版本管理**的顶层布局（本地数据、缓存、日志、未跟踪脚本等通常写在 `.gitignore` 中，故不单独列在目录树里）：

```text
iCoLoc/
├── main.py
├── run_experiment.py
├── run_plotter.py
├── README.md
├── LICENSE
├── WEB_README.md
├── requirements.txt
├── config/
├── data/
├── document/
├── results/
├── memory/
├── models/
├── logs/
└── src/
    ├── llm/
    ├── preference/
    ├── core/
    ├── controller/
    ├── learning/
    ├── embedding/
    ├── memory/
    ├── web/
    ├── experiment/
    └── download/
```

若你当前仅在 `document/` 下维护 Web 说明，可复制或软链接为仓库根目录的 `WEB_README.md`，与上表一致。

## 快速开始

### 1) 安装依赖

```bash
cd /home/ubuntu/codebase/yexijia/保研/iCoLoc
pip install -r requirements.txt
```

### 2) 配置模型与数据

编辑 `config/config.yaml`：

```yaml
model:
  model_name_or_path: "/path/to/your/model"
  adapter_name_or_path: null
  template: "qwen"

data:
  data_path: "./data/beijing_poi.json"
```

#### 主 LLM 权重：用 Ollama 下载与 `config.yaml` 的关系

iCoLoc 通过 `transformers`（可选 LlamaFactory）从**本地目录**加载主模型，该目录必须是 **Hugging Face 布局**（含 `config.json`、分词器相关文件等）。`model_name_or_path` 应指向这一目录。

若你使用 **[Ollama](https://ollama.com/)** 在本机拉取模型，可先安装 Ollama，再按 [Ollama Library](https://ollama.com/library) 中的名称执行，例如：

```bash
# 示例（标签以官网为准，需与你在 HF 侧准备的底座一致或兼容）
ollama pull qwen2.5:7b
```

**说明**：Ollama 将模型放在自有目录（如 Linux 上多为 `~/.ollama/models`），格式服务于 `ollama run`，**不能**把该路径直接填进 `model_name_or_path`。

运行 iCoLoc 时，请**另外**准备一份 **Hugging Face 格式**权重，例如：

```bash
pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir /你的路径/qwen2.5-7b-instruct
```

也可从 ModelScope、国内镜像等下载同名/兼容的 HF 模型并解压到本地。将 `model.model_name_or_path` 设为该目录（建议绝对路径），`template` 与模型家族一致（如 Qwen 使用 `qwen`）。Ollama 与 HF 两套可以并存：前者便于命令行试用，后者供本项目加载。

主 LLM 与 **Sentence Transformer 嵌入模型**仍是两套权重：后者用于 Stage0 意图向量与 Stage3 模式向量，默认在 `src/learning/embedder.py` 中使用 `sentence-transformers/all-MiniLM-L6-v2`（首次运行需联网，缓存一般在 `~/.cache/huggingface/hub`）。网络受限时可使用镜像（如 `HF_ENDPOINT`），或自行下载后修改 `embedder.py` 中的模型名为本地目录路径。

### 3) 运行系统

```bash
# CLI 交互模式
python main.py

# CLI 单次查询
python main.py --query "我更关注高置信度的三阶模式"

# Web 模式（默认端口 8000；监听地址默认为 0.0.0.0，局域网内可用本机 IP 访问）
python main.py --web
```

### 4) 数据集下载与转换（可选）

如果你想自己生成 POI 数据集，可以使用 `src/download` 下的脚本：

```bash
# 进入项目根目录
cd /home/ubuntu/codebase/yexijia/保研/iCoLoc

# 1) 安装下载所需依赖（如已安装可跳过）
pip install osmnx

# 2) 下载北京 POI GeoJSON（示例脚本）
python src/download/1.py

# 3) 转换为 iCoLoc 使用的 JSON 格式（id/type/x/y）
python src/download/convert_geojson.py src/download/beijing_poi.geojson data/beijing_poi.json
```

说明：
- `src/download/1.py` 会生成 `src/download/beijing_poi.geojson`
- `convert_geojson.py` 会将 GeoJSON 转为系统可直接读取的点数据 JSON
- 你也可以传入自己的 GeoJSON 文件路径与输出路径

## 常用命令

```bash
# 指定配置文件
python main.py --config config/config.yaml

# 手动触发训练（Stage3）
python main.py --train

# Stage4 指定迭代轮数
python main.py --query "推荐早餐店选址" --iter 3

# 自定义 Web 监听地址
python main.py --web --host 127.0.0.1 --port 8080
```

## 数据格式

输入数据为 JSON 数组，每条记录至少包含：

- `id`：实例 ID
- `type`：POI 类型
- `x`、`y`：空间坐标

```json
[
  {"id": 1, "type": "A", "x": 24, "y": 14},
  {"id": 2, "type": "B", "x": 13, "y": 3}
]
```

## 运行流程

1. 用户输入自然语言查询
2. LLM 解析查询并提取挖掘参数（必要时执行 Stage0）
3. 执行 Co-location 模式挖掘
4. 基于 Stage3/Stage4 进行反馈学习与迭代优化
5. 返回结果与解释，并持续收集反馈

## 实验与可视化

```bash
# 运行实验
python run_experiment.py

# 使用偏好加权配置运行实验
python run_experiment.py --preference-weighted-config config/config_preference_weighted.yaml

# 绘制图表
python run_plotter.py

# 根据指定指标文件绘图
python run_plotter.py --preference-weighted results/metrics_preference_weighted.json --output results/learning_curve.png
```

实验输出默认位于 `results/`。

## 文档

- Web 说明：`WEB_README.md`（与 `document/WEB_README.md` 内容可保持一致）
- 模块说明：`src/*/README.md`
- 总览：`document/项目说明.md`
- 其他：`document/执行流程文档.md`、`document/对比算法说明.md` 等

## 故障排查

- 主 LLM 加载失败：检查 `model_name_or_path`、显存与 `requirements.txt` 依赖
- Sentence Transformer 下载失败或 SSL 报错：检查网络、使用镜像，或改为本地路径（见上文「嵌入模型」）
- 可选：使用 LlamaFactory 加载时若提示 `datasets` 版本不符，可按日志调整或设置 `DISABLE_VERSION_CHECK=1`
- Web 启动失败：检查 `fastapi`、`uvicorn` 与端口占用
- 结果异常：检查 `data_path` 与 JSON 格式
- 日志位置：`logs/mvp.log`

## 许可

本项目以 [MIT License](LICENSE) 发布：允许自由使用、修改与商业使用，但须在分发中保留原始版权声明与许可全文。可将 `LICENSE` 中的 `iCoLoc authors` 替换为你的姓名或单位。

---

## English

iCoLoc is an interactive co-location pattern mining system powered by local LLMs. It supports natural-language query understanding, pattern mining, feedback-driven preference learning, and iterative optimization.

## Features

- Natural-language query to mining-parameter translation
- Co-location pattern mining and rule generation
- Stage0: intent understanding and cold-start preference initialization
- Stage3: feedback-based preference learning and re-ranking
- Stage4: multi-round interactive refinement
- Both CLI and Web interfaces

## Tech Stack

- Python 3.9+
- FastAPI + Uvicorn (Web service)
- PyTorch / Transformers (main LLM inference)
- Sentence Transformers and NumPy (Stage0/Stage3 sentence embeddings; separate from the main LLM; default MiniLM-style model from Hugging Face)

## Project Structure

Top-level layout below matches **what is tracked in version control**. Local data, caches, logs, and extra scripts are usually listed in `.gitignore` and are not shown in the tree.

```text
iCoLoc/
├── main.py
├── run_experiment.py
├── run_plotter.py
├── README.md
├── LICENSE
├── WEB_README.md
├── requirements.txt
├── config/
├── data/
├── document/
├── results/
├── memory/
├── models/
├── logs/
└── src/
    ├── llm/
    ├── preference/
    ├── core/
    ├── controller/
    ├── learning/
    ├── embedding/
    ├── memory/
    ├── web/
    ├── experiment/
    └── download/
```

If you only maintain the Web guide under `document/`, copy or symlink it to `WEB_README.md` at the repo root to match the layout above.

## Quick Start

### 1) Install dependencies

```bash
cd /home/ubuntu/codebase/yexijia/保研/iCoLoc
pip install -r requirements.txt
```

### 2) Configure model and data

Edit `config/config.yaml`:

```yaml
model:
  model_name_or_path: "/path/to/your/model"
  adapter_name_or_path: null
  template: "qwen"

data:
  data_path: "./data/beijing_poi.json"
```

#### Main LLM weights: Ollama vs `config.yaml`

iCoLoc loads the main model from a **local directory** via `transformers` (optionally LlamaFactory). That directory must be in **Hugging Face layout** (`config.json`, tokenizer files, etc.). Set `model_name_or_path` to that folder.

If you use **[Ollama](https://ollama.com/)** to pull models locally, install Ollama and run (names follow [Ollama Library](https://ollama.com/library)):

```bash
# Example; tags must match what you use on the HF side
ollama pull qwen2.5:7b
```

**Note:** Ollama stores files under its own tree (e.g. `~/.ollama/models` on Linux) for `ollama run`. **Do not** point `model_name_or_path` at that path.

For iCoLoc, prepare a separate **Hugging Face–format** checkpoint, for example:

```bash
pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir /path/to/qwen2.5-7b-instruct
```

You can also download from ModelScope or other mirrors. Set `model.model_name_or_path` to that directory (absolute path recommended) and match `template` to the family (e.g. `qwen` for Qwen). Ollama and HF copies can coexist: Ollama for CLI chats, HF folder for this project.

The main LLM and the **Sentence Transformer** are still separate: the latter is used in Stage0/Stage3 via `src/learning/embedder.py` (default `sentence-transformers/all-MiniLM-L6-v2`; first run needs network; cache under `~/.cache/huggingface/hub`). Use a mirror (`HF_ENDPOINT`) or a local path in `embedder.py` if needed.

### 3) Run

```bash
# CLI interactive mode
python main.py

# One-shot CLI query
python main.py --query "Find 3-order patterns with high confidence"

# Web mode (default port 8000; default bind 0.0.0.0 — reachable from LAN via host IP)
python main.py --web
```

### 4) Dataset Download and Conversion (Optional)

If you want to generate your own POI dataset, use scripts in `src/download`:

```bash
# Go to project root
cd /home/ubuntu/codebase/yexijia/保研/iCoLoc

# 1) Install dependency for downloading data (skip if already installed)
pip install osmnx

# 2) Download Beijing POI GeoJSON (example script)
python src/download/1.py

# 3) Convert GeoJSON to iCoLoc JSON format (id/type/x/y)
python src/download/convert_geojson.py src/download/beijing_poi.geojson data/beijing_poi.json
```

Notes:
- `src/download/1.py` generates `src/download/beijing_poi.geojson`
- `convert_geojson.py` converts GeoJSON into the point-based JSON used by iCoLoc
- You can also provide your own input GeoJSON and output JSON paths

## Common Commands

```bash
# Use custom config
python main.py --config config/config.yaml

# Trigger training manually (Stage3)
python main.py --train

# Set iteration rounds for Stage4
python main.py --query "Recommend locations for breakfast stores" --iter 3

# Customize Web host and port
python main.py --web --host 127.0.0.1 --port 8080
```

## Input Data Format

Input should be a JSON array. Each record should include:

- `id`: instance ID
- `type`: POI type
- `x`, `y`: spatial coordinates

```json
[
  {"id": 1, "type": "A", "x": 24, "y": 14},
  {"id": 2, "type": "B", "x": 13, "y": 3}
]
```

## Pipeline Overview

1. User submits a natural-language query
2. LLM parses intent and mining parameters (Stage0 when needed)
3. Co-location mining is executed
4. Stage3/Stage4 updates ranking with user feedback and iterations
5. Results and explanations are returned, with continuous feedback collection

## Experiments and Plots

```bash
# Run experiments
python run_experiment.py

# Run experiments with preference-weighted config
python run_experiment.py --preference-weighted-config config/config_preference_weighted.yaml

# Draw plots
python run_plotter.py

# Draw plots from a specific metrics file
python run_plotter.py --preference-weighted results/metrics_preference_weighted.json --output results/learning_curve.png
```

Outputs are saved to `results/` by default.

## Documentation

- Web guide: `WEB_README.md` (may mirror `document/WEB_README.md`)
- Module docs: `src/*/README.md`
- Overview: `document/项目说明.md`
- More: `document/执行流程文档.md`, `document/对比算法说明.md`, etc.

## Troubleshooting

- Main LLM load fails: verify `model_name_or_path`, GPU memory, and `requirements.txt`
- Sentence Transformer download or SSL errors: check network, use a mirror, or switch to a local path in `embedder.py` (see “Sentence Transformer” above)
- Optional LlamaFactory: if `datasets` version warnings appear, adjust per logs or set `DISABLE_VERSION_CHECK=1`
- Web startup fails: verify `fastapi`, `uvicorn`, and port availability
- Unexpected results: verify `data_path` and JSON format
- Logs: `logs/mvp.log`

## License

Released under the [MIT License](LICENSE). You may use, modify, and redistribute this software, including for commercial purposes, provided you retain the copyright notice and license text. Replace `iCoLoc authors` in `LICENSE` with your name or organization if you wish.