##############################################################################
🚨🚨🚨 MANDATORY RULES - 必须严格遵循 🚨🚨🚨
##############################################################################
你已加载此 skill,现在必须像法律一样严格遵循以下所有规则!
⛔ 忽略这些规则将导致统计错误(如用错平均方法、忽略数据权重等) ✅ 严格遵循这些规则才能得出正确的分析结论
##############################################################################
⚠️ CRITICAL: 必须加载此 Skill 的场景
当用户提到以下任何词汇时,你必须先加载此 skill:
- •分析、查看、看看、统计、计算、汇总
- •平均、均值、占比、增长、趋势、变化
- •对比、比较、相关、影响因素
- •图表、报表、可视化
- •任何涉及数据处理、数值计算的任务
IMPORTANT: This skill provides data analysis best practices and guidelines. Use it alongside excel-python for actual data processing.
##############################################################################
🔴 STOP! 在执行任何计算之前,先检查以下关键点:
##############################################################################
- •分母是否相同? 不同分母的百分比不能简单平均!
- •权重是否相等? 不同规模的数据需要加权平均!
- •数据类型是什么? 百分比、绝对值、时间序列各有不同处理方法!
##############################################################################
Core Principles
When users ask to "analyze" (分析) data or request data insights, follow these professional standards:
1. Data Type Recognition
Identify Data Types Before Analysis
| Data Type | Description | Analysis Approach |
|---|---|---|
| Percentage/Rate | Ratios like completion rate, pass rate | Check if denominators are the same |
| Absolute Value | Amounts, quantities, revenue | Check for scale differences |
| Time Series | Annual, quarterly, monthly data | Use trend analysis |
| Categorical | Regions, product lines, departments | Use grouping/segmentation |
Chinese Unit Conversions
- •万 (wan) = 10,000
- •亿 (yi) = 100,000,000
- •Always convert to consistent units for analysis
2. Statistical Method Selection
When to Use Mean (Average)
✅ Use Simple Average:
- •Same denominator percentages (e.g., consecutive quarter growth rates)
- •Same population scores (e.g., same employee's quarterly performance)
- •Directly comparable values with same scale
❌ DO NOT Use Simple Average:
- •Different denominator percentages (e.g., different class pass rates)
- •Different scale absolute values (e.g., different department sizes)
Weighted Average Formula
def weighted_average(values, weights):
"""Calculate weighted average"""
if sum(weights) == 0:
return None
return sum(v * w for v, w in zip(values, weights)) / sum(weights)
# Example: Calculate weighted average pass rate
pass_rates = [0.80, 0.90, 0.75] # Pass rates
student_counts = [50, 20, 40] # Student counts (weights)
weighted_rate = weighted_average(pass_rates, student_counts)
print(f"全校及格率: {weighted_rate*100:.2f}%")
Statistical Measure Selection
| Measure | Best For | Not Suitable For |
|---|---|---|
| Mean | Normal distribution, no outliers | Skewed data, extreme values |
| Median | Outliers present, skewed | Calculating totals |
| Mode | Categorical data, most common | Numerical calculations |
| Std Dev | Measuring dispersion | Different scale comparison |
3. Business Analysis Dimensions
Trend Analysis
YoY Growth (同比):
增长率 = (本期 - 同期) / 同期 × 100%
- •Compare current period to same period last year
- •Used for year-over-year performance tracking
MoM Growth (环比):
增长率 = (本期 - 上期) / 上期 × 100%
- •Compare current month/quarter to previous month/quarter
- •Used for short-term trend tracking
Comparative Analysis
- •By Product Line: Compare performance across products
- •By Region: Compare regional performance
- •By Time Period: Compare Q1 vs Q2 vs Q3 vs Q4
- •By Customer Segment: Compare different customer groups
4. Data Quality Checks
Before analysis, ALWAYS check for:
import pandas as pd
df = pd.read_excel('/path/to/file.xlsx')
# 1. Data overview
print("数据形状:", df.shape)
print("数据类型:\n", df.dtypes)
print("前5行:\n", df.head())
# 2. Missing values
print("\n空值统计:\n", df.isnull().sum())
# 3. Basic statistics
print("\n数值列统计:\n", df.describe())
# 4. Check for anomalies
for col in df.select_dtypes(include=[np.number]).columns:
if (df[col] < 0).any():
print(f"警告: {col} 包含负数")
Data Quality Checklist
- • Check for null/missing values
- • Check for unreasonable values (e.g., negative numbers where not expected)
- • Check for duplicate records
- • Check for unit consistency
- • Check for date format consistency
5. Percentage/Rate Handling Rules
CRITICAL: The Percentage Trap
WRONG Example:
班级A: 80% 及格率 (40/50 = 80%) 班级B: 90% 及格率 (18/20 = 90%) 简单平均: (80% + 90%) / 2 = 85% ← WRONG!
CORRECT Example:
实际及格率 = (40 + 18) / (50 + 20) = 58/70 ≈ 82.86% ← CORRECT!
When to Use Which Method
Same Denominator → Simple Average OK:
Q1增长率: 10% Q2增长率: 15% Q3增长率: 12% 平均增长率: (10% + 15% + 12%) / 3 = 12.33% ← OK
Different Denominators → Must Use Weighted Average:
班级A: 80% 及格率 (50人) 班级B: 90% 及格率 (20人) 班级C: 75% 及格率 (40人) 全校及格率: (50×0.8 + 20×0.9 + 40×0.75) / (50+20+40) ≈ 80% ← CORRECT
6. Business Insight Framework
Four-Layer Output Structure
Layer 1 - Data Presentation:
- •Display tables/charts with raw data
- •Show key metrics clearly
Layer 2 - Basic Conclusions:
- •Rising/Falling/Stable trend
- •High/Low performance
- •Anomalies detected
Layer 3 - Business Insights:
- •Root cause analysis
- •Impact assessment
- •Pattern recognition
Layer 4 - Actionable Recommendations:
- •Specific next steps
- •Risk mitigation strategies
- •Opportunity identification
Example Output Format
📊 销售数据分析报告 【数据概览】 2024年总营收: 5000万元 同比增长: +15.2% 各季度分布: Q1=1200万, Q2=1100万, Q3=1300万, Q4=1400万 【基本结论】 ✅ 整体呈上升趋势 ✅ Q4达到全年峰值 ✅ 同比增长显著 【业务洞察】 🔍 Q4增长主要受节假日促销带动 🔍 Q2受市场影响略有下滑 🔍 线上渠道占比从40%提升至55% 【行动建议】 1. 加强Q1-Q3淡季营销力度 2. 继续扩大线上渠道投入 3. 提前备货应对Q4旺季
7. Common Analysis Templates
Excel Data Analysis Template
import pandas as pd
# Load data
df = pd.read_excel('/path/to/file.xlsx')
# Data quality check
print("=== 数据概览 ===")
print(df.info())
# Basic statistics
print("\n=== 基本统计 ===")
print(df.describe())
# Time series analysis
if '日期' in df.columns:
df['日期'] = pd.to_datetime(df['日期'])
df.sort_values('日期', inplace=True)
df['环比增速'] = df['金额'].pct_change() * 100
print("\n=== 环比增速 ===")
print(df[['日期', '金额', '环比增速']])
# Group analysis
if '产品线' in df.columns:
grouped = df.groupby('产品线')['金额'].agg(['sum', 'count', 'mean'])
print("\n=== 按产品线汇总 ===")
print(grouped)
Percentage Analysis Template
import pandas as pd
df = pd.read_excel('/path/to/file.xlsx')
# Calculate actual rate from numerator and denominator
if '及格人数' in df.columns and '总人数' in df.columns:
df['实际及格率'] = df['及格人数'] / df['总人数'] * 100
print("=== 各班及格率 ===")
print(df[['班级', '及格人数', '总人数', '实际及格率']])
# Overall rate (weighted average)
total_pass = df['及格人数'].sum()
total_students = df['总人数'].sum()
overall_rate = total_pass / total_students * 100
print(f"\n全校及格率: {overall_rate:.2f}%")
Trend Analysis Template
import pandas as pd
df = pd.read_excel('/path/to/file.xlsx')
# Ensure date sorting
df['日期'] = pd.to_datetime(df['日期'])
df.sort_values('日期', inplace=True)
# Calculate growth
df['同比增速'] = df.groupby(df['日期'].dt.month)['金额'].pct_change(periods=12) * 100
df['环比增速'] = df['金额'].pct_change() * 100
# Display trend
print("=== 趋势分析 ===")
print(df[['日期', '金额', '环比增速', '同比增速']])
# Identify patterns
latest_growth = df['环比增速'].iloc[-1]
if latest_growth > 10:
print("\n📈 近期增长强劲")
elif latest_growth > 0:
print("\n📊 稳步增长")
elif latest_growth > -10:
print("\n📉 小幅下滑")
else:
print("\n⚠️ 需关注下滑趋势")
8. Self-Check Before Output
Before presenting analysis results, ask yourself:
- • Did I use the correct statistical method for this data type?
- • Do percentages have the same denominator? If not, did I use weighted average?
- • Do my conclusions have data support?
- • Can the user understand my presentation?
- • Did I provide actionable insights?
- • Are Chinese terms used correctly (同比, 环比, 营收, 毛利)?
- • Did I check for data quality issues?
9. Common Pitfalls to Avoid
Wrong Analysis Examples
❌ WRONG: Averaging different denominator percentages
"各班及格率平均为 85%"
Why wrong: Different classes have different student counts.
✅ CORRECT:
"全校及格率为 82.86%(计算方式:总及格人数/总人数)"
❌ WRONG: Not explaining methodology
"平均增长率为 15%"
Why wrong: User doesn't know what was averaged.
✅ CORRECT:
"Q1-Q4平均环比增长率为 15%(基于4个季度的环比增速计算)"
❌ WRONG: No context
"销售额下降"
Why wrong: No timeframe or magnitude.
✅ CORRECT:
"2024年Q2销售额下降8.5%,环比Q1的5000万降至4575万"
10. Output Style Guidelines
Professional Communication
- •Data-Driven: State facts first, then analysis
- •Layered: Present → Analyze → Insight → Action
- •Clear: Use tables, bold text, and structure
- •Actionable: Provide concrete next steps
- •Contextual: Explain methodology when needed
Chinese Terminology
| Term | Usage |
|---|---|
| 同比 | Compare to same period last year |
| 环比 | Compare to previous period |
| 营收 | Revenue |
| 毛利 | Gross profit |
| 净利 | Net profit |
| 及格率 | Pass rate |
| 完成率 | Completion rate |
Formatting for Readability
# 📊 销售数据分析报告 ## 数据概览 | 指标 | 数值 | 同比 | |------|------|------| | 总营收 | 5000万 | +15.2% | ## 主要结论 ✅ 整体呈上升趋势 ⚠️ Q2略有下滑 ## 业务洞察 1. Q4受节假日带动增长显著 2. 线上渠道持续扩张 ## 行动建议 1. 加强淡季营销 2. 继续扩大线上投入
11. Analysis Workflow
When analyzing data, follow this sequence:
- •Read Data - Load and inspect structure
- •Quality Check - Identify nulls, outliers, issues
- •Select Methods - Choose appropriate statistical approach
- •Calculate - Perform analysis with Python
- •Verify - Check results for reasonableness
- •Interpret - Explain business meaning
- •Present - Structure output with insights and recommendations
文件操作安全规则
分析时的文件处理:
- •只读分析:直接读取原文件(
data_only=True),不做修改 - •需要修改时:先复制原文件 → 在副本上修改 → 原文件保持不变
- •保存分析结果:保存到新文件(如
原文件名_分析结果.xlsx),不覆盖原文件
# 只读分析(推荐)
wb = openpyxl.load_workbook('/path/to/原文件.xlsx', data_only=True)
# 需要修改/保存结果时
import shutil
shutil.copy('/path/to/原文件.xlsx', '/path/to/原文件_分析结果.xlsx')
wb = openpyxl.load_workbook('/path/to/原文件_分析结果.xlsx')
# ... 进行修改 ...
wb.save('/path/to/原文件_分析结果.xlsx')
汇报时说明文件位置:
- •分析结果保存在哪里
- •原文件是否保持不变
12. Advanced Analysis Topics
Outlier Detection
import pandas as pd
import numpy as np
df = pd.read_excel('/path/to/file.xlsx')
# IQR method for outliers
Q1 = df['金额'].quantile(0.25)
Q3 = df['金额'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['金额'] < lower_bound) | (df['金额'] > upper_bound)]
print("异常值:", outliers)
Correlation Analysis
# Correlation matrix
correlation_matrix = df.corr()
print("相关性分析:\n", correlation_matrix)
# Focus on high correlations
high_corr = correlation_matrix[correlation_matrix.abs() > 0.7]
print("高相关性 (>0.7):\n", high_corr)
Summary Checklist
Before finalizing any data analysis:
✅ Identified data types correctly ✅ Used appropriate statistical methods ✅ Handled percentages with correct weighting ✅ Checked data quality ✅ Provided clear methodology ✅ Structured output (Data → Conclusion → Insight → Action) ✅ Used professional Chinese terminology ✅ Included actionable recommendations ✅ Verified results are reasonable ✅ 原文件保持不变(如需修改,先复制再改) ✅ 告诉用户结果文件在哪里
Runtime
No additional dependencies needed. Use with:
- •
pandasfor data analysis - •
numpyfor statistical calculations - •Excel files via
openpyxl(excel-python skill)