AgentSkillsCN

hierarchical-taxonomy-clustering

将 Spring Boot 2.x 应用程序迁移到 Spring Boot 3.x。当您需要更新 pom.xml 版本、移除已弃用的 JAXB 依赖、将 Java 升级至 17/21,或使用 OpenRewrite 实现自动化迁移时,可使用此技能。内容涵盖依赖更新、版本变更,以及迁移 checklist。

SKILL.md
--- frontmatter
name: hierarchical-taxonomy-clustering
description: Build unified multi-level category taxonomy from hierarchical product category paths from any e-commerce companies using embedding-based recursive clustering with intelligent category naming via weighted word frequency analysis.

Hierarchical Taxonomy Clustering

Create a unified multi-level taxonomy from hierarchical category paths by clustering similar paths and automatically generating meaningful category names.

Problem

Given category paths from multiple sources (e.g., "electronics -> computers -> laptops"), create a unified taxonomy that groups similar paths across sources, generates meaningful category names, and produces a clean N-level hierarchy (typically 5 levels). The unified category taxonomy could be used to do analysis or metric tracking on products from different platform.

Methodology

  1. Hierarchical Weighting: Convert paths to embeddings with exponentially decaying weights (Level i gets weight 0.6^(i-1)) to signify the importance of category granularity
  2. Recursive Clustering: Hierarchically cluster at each level (10-20 clusters at L1, 3-20 at L2-L5) using cosine distance
  3. Intelligent Naming: Generate category names via weighted word frequency + lemmatization + bundle word logic
  4. Quality Control: Exclude all ancestor words (parent, grandparent, etc.), avoid ancestor path duplicates, clean special characters

Output

DataFrame with added columns:

  • unified_level_1: Top-level category (e.g., "electronic | device")
  • unified_level_2: Second-level category (e.g., "computer | laptop")
  • unified_level_3 through unified_level_N: Deeper levels

Category names use | separator, max 5 words, covering 70%+ of records in each cluster.

Installation

bash
pip install pandas numpy scipy sentence-transformers nltk tqdm
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"

4-Step Pipeline

Step 1: Load, Standardize, Filter and Merge (step1_preprocessing_and_merge.py)

  • Input: List of (DataFrame, source_name) tuples, each of the with category_path column
  • Process: Per-source deduplication, text cleaning (remove &/,/'/-/quotes,'and' or "&", "," and so on, lemmatize words as nouns), normalize delimiter to >, depth filtering, prefix removal, then merge all sources. source_level should reflect the processed version of the source level name
  • Output: Merged DataFrame with category_path, source, depth, source_level_1 through source_level_N

Step 2: Weighted Embeddings (step2_weighted_embedding_generation.py)

  • Input: DataFrame from Step 1
  • Output: Numpy embedding matrix (n_records × 384)
  • Weights: L1=1.0, L2=0.6, L3=0.36, L4=0.216, L5=0.1296 (exponential decay 0.6^(n-1))
  • Performance: For ~10,000 records, expect 2-5 minutes. Progress bar will show encoding status.

Step 3: Recursive Clustering (step3_recursive_clustering_naming.py)

  • Input: DataFrame + embeddings from Step 2
  • Output: Assignments dict {index → {level_1: ..., level_5: ...}}
  • Average linkage + cosine distance, 10-20 clusters at L1, 3-20 at L2-L5
  • Word-based naming: weighted frequency + lemmatization + coverage ≥70%
  • Performance: For ~10,000 records, expect 1-3 minutes for hierarchical clustering and naming. Be patient - the system is working through recursive levels.

Step 4: Export Results (step4_result_assignments.py)

  • Input: DataFrame + assignments from Step 3
  • Output:
    • unified_taxonomy_full.csv - all records with unified categories
    • unified_taxonomy_hierarchy.csv - unique taxonomy structure

Usage

Use scripts/pipeline.py to run the complete 4-step workflow.

See scripts/pipeline.py for:

  • Complete implementation of all 4 steps
  • Example code for processing multiple sources
  • Command-line interface
  • Individual step usage (for advanced control)