<skill name="data-analyst"> # Professional Data Analyst Guidelines

##############################################################################

🚨🚨🚨 MANDATORY RULES - 必须严格遵循 🚨🚨🚨

##############################################################################

你已加载此 skill，现在必须像法律一样严格遵循以下所有规则！

⛔ 忽略这些规则将导致统计错误（如用错平均方法、忽略数据权重等） ✅ 严格遵循这些规则才能得出正确的分析结论

##############################################################################

⚠️ CRITICAL: 必须加载此 Skill 的场景

当用户提到以下任何词汇时，你必须先加载此 skill：

•分析、查看、看看、统计、计算、汇总
•平均、均值、占比、增长、趋势、变化
•对比、比较、相关、影响因素
•图表、报表、可视化
•任何涉及数据处理、数值计算的任务

IMPORTANT: This skill provides data analysis best practices and guidelines. Use it alongside excel-python for actual data processing.

##############################################################################

🔴 STOP! 在执行任何计算之前，先检查以下关键点：

##############################################################################

•分母是否相同？ 不同分母的百分比不能简单平均！
•权重是否相等？ 不同规模的数据需要加权平均！
•数据类型是什么？ 百分比、绝对值、时间序列各有不同处理方法！

##############################################################################

Core Principles

When users ask to "analyze" (分析) data or request data insights, follow these professional standards:

1. Data Type Recognition

Identify Data Types Before Analysis

Data Type	Description	Analysis Approach
Percentage/Rate	Ratios like completion rate, pass rate	Check if denominators are the same
Absolute Value	Amounts, quantities, revenue	Check for scale differences
Time Series	Annual, quarterly, monthly data	Use trend analysis
Categorical	Regions, product lines, departments	Use grouping/segmentation

Chinese Unit Conversions

•万 (wan) = 10,000
•亿 (yi) = 100,000,000
•Always convert to consistent units for analysis

2. Statistical Method Selection

When to Use Mean (Average)

✅ Use Simple Average:

•Same denominator percentages (e.g., consecutive quarter growth rates)
•Same population scores (e.g., same employee's quarterly performance)
•Directly comparable values with same scale

❌ DO NOT Use Simple Average:

•Different denominator percentages (e.g., different class pass rates)
•Different scale absolute values (e.g., different department sizes)

Weighted Average Formula

python

def weighted_average(values, weights):
    """Calculate weighted average"""
    if sum(weights) == 0:
        return None
    return sum(v * w for v, w in zip(values, weights)) / sum(weights)

# Example: Calculate weighted average pass rate
pass_rates = [0.80, 0.90, 0.75]  # Pass rates
student_counts = [50, 20, 40]     # Student counts (weights)

weighted_rate = weighted_average(pass_rates, student_counts)
print(f"全校及格率: {weighted_rate*100:.2f}%")

Statistical Measure Selection

Measure	Best For	Not Suitable For
Mean	Normal distribution, no outliers	Skewed data, extreme values
Median	Outliers present, skewed	Calculating totals
Mode	Categorical data, most common	Numerical calculations
Std Dev	Measuring dispersion	Different scale comparison

3. Business Analysis Dimensions

Trend Analysis

YoY Growth (同比):

code

增长率 = (本期 - 同期) / 同期 × 100%

•Compare current period to same period last year
•Used for year-over-year performance tracking

MoM Growth (环比):

code

增长率 = (本期 - 上期) / 上期 × 100%

•Compare current month/quarter to previous month/quarter
•Used for short-term trend tracking

Comparative Analysis

•By Product Line: Compare performance across products
•By Region: Compare regional performance
•By Time Period: Compare Q1 vs Q2 vs Q3 vs Q4
•By Customer Segment: Compare different customer groups

4. Data Quality Checks

Before analysis, ALWAYS check for:

python

import pandas as pd

df = pd.read_excel('/path/to/file.xlsx')

# 1. Data overview
print("数据形状:", df.shape)
print("数据类型:\n", df.dtypes)
print("前5行:\n", df.head())

# 2. Missing values
print("\n空值统计:\n", df.isnull().sum())

# 3. Basic statistics
print("\n数值列统计:\n", df.describe())

# 4. Check for anomalies
for col in df.select_dtypes(include=[np.number]).columns:
    if (df[col] < 0).any():
        print(f"警告: {col} 包含负数")

Data Quality Checklist

• Check for null/missing values
• Check for unreasonable values (e.g., negative numbers where not expected)
• Check for duplicate records
• Check for unit consistency
• Check for date format consistency

5. Percentage/Rate Handling Rules

CRITICAL: The Percentage Trap

WRONG Example:

code

班级A: 80% 及格率 (40/50 = 80%)
班级B: 90% 及格率 (18/20 = 90%)
简单平均: (80% + 90%) / 2 = 85%  ← WRONG!

CORRECT Example:

code

实际及格率 = (40 + 18) / (50 + 20) = 58/70 ≈ 82.86%  ← CORRECT!

When to Use Which Method

Same Denominator → Simple Average OK:

code

Q1增长率: 10%
Q2增长率: 15%
Q3增长率: 12%
平均增长率: (10% + 15% + 12%) / 3 = 12.33%  ← OK

Different Denominators → Must Use Weighted Average:

code

班级A: 80% 及格率 (50人)
班级B: 90% 及格率 (20人)
班级C: 75% 及格率 (40人)
全校及格率: (50×0.8 + 20×0.9 + 40×0.75) / (50+20+40) ≈ 80%  ← CORRECT

6. Business Insight Framework

Four-Layer Output Structure

Layer 1 - Data Presentation:

•Display tables/charts with raw data
•Show key metrics clearly

Layer 2 - Basic Conclusions:

•Rising/Falling/Stable trend
•High/Low performance
•Anomalies detected

Layer 3 - Business Insights:

•Root cause analysis
•Impact assessment
•Pattern recognition

Layer 4 - Actionable Recommendations:

•Specific next steps
•Risk mitigation strategies
•Opportunity identification

Example Output Format

code

📊 销售数据分析报告

【数据概览】
2024年总营收: 5000万元
同比增长: +15.2%
各季度分布: Q1=1200万, Q2=1100万, Q3=1300万, Q4=1400万

【基本结论】
✅ 整体呈上升趋势
✅ Q4达到全年峰值
✅ 同比增长显著

【业务洞察】
🔍 Q4增长主要受节假日促销带动
🔍 Q2受市场影响略有下滑
🔍 线上渠道占比从40%提升至55%

【行动建议】
1. 加强Q1-Q3淡季营销力度
2. 继续扩大线上渠道投入
3. 提前备货应对Q4旺季

7. Common Analysis Templates

Excel Data Analysis Template

python

import pandas as pd

# Load data
df = pd.read_excel('/path/to/file.xlsx')

# Data quality check
print("=== 数据概览 ===")
print(df.info())

# Basic statistics
print("\n=== 基本统计 ===")
print(df.describe())

# Time series analysis
if '日期' in df.columns:
    df['日期'] = pd.to_datetime(df['日期'])
    df.sort_values('日期', inplace=True)
    df['环比增速'] = df['金额'].pct_change() * 100
    print("\n=== 环比增速 ===")
    print(df[['日期', '金额', '环比增速']])

# Group analysis
if '产品线' in df.columns:
    grouped = df.groupby('产品线')['金额'].agg(['sum', 'count', 'mean'])
    print("\n=== 按产品线汇总 ===")
    print(grouped)

Percentage Analysis Template

python

import pandas as pd

df = pd.read_excel('/path/to/file.xlsx')

# Calculate actual rate from numerator and denominator
if '及格人数' in df.columns and '总人数' in df.columns:
    df['实际及格率'] = df['及格人数'] / df['总人数'] * 100
    print("=== 各班及格率 ===")
    print(df[['班级', '及格人数', '总人数', '实际及格率']])

    # Overall rate (weighted average)
    total_pass = df['及格人数'].sum()
    total_students = df['总人数'].sum()
    overall_rate = total_pass / total_students * 100
    print(f"\n全校及格率: {overall_rate:.2f}%")

Trend Analysis Template

python

import pandas as pd

df = pd.read_excel('/path/to/file.xlsx')

# Ensure date sorting
df['日期'] = pd.to_datetime(df['日期'])
df.sort_values('日期', inplace=True)

# Calculate growth
df['同比增速'] = df.groupby(df['日期'].dt.month)['金额'].pct_change(periods=12) * 100
df['环比增速'] = df['金额'].pct_change() * 100

# Display trend
print("=== 趋势分析 ===")
print(df[['日期', '金额', '环比增速', '同比增速']])

# Identify patterns
latest_growth = df['环比增速'].iloc[-1]
if latest_growth > 10:
    print("\n📈 近期增长强劲")
elif latest_growth > 0:
    print("\n📊 稳步增长")
elif latest_growth > -10:
    print("\n📉 小幅下滑")
else:
    print("\n⚠️ 需关注下滑趋势")

8. Self-Check Before Output

Before presenting analysis results, ask yourself:

• Did I use the correct statistical method for this data type?
• Do percentages have the same denominator? If not, did I use weighted average?
• Do my conclusions have data support?
• Can the user understand my presentation?
• Did I provide actionable insights?
• Are Chinese terms used correctly (同比, 环比, 营收, 毛利)?
• Did I check for data quality issues?

9. Common Pitfalls to Avoid

Wrong Analysis Examples

❌ WRONG: Averaging different denominator percentages

code

"各班及格率平均为 85%"

Why wrong: Different classes have different student counts.

✅ CORRECT:

code

"全校及格率为 82.86%（计算方式：总及格人数/总人数）"

❌ WRONG: Not explaining methodology

code

"平均增长率为 15%"

Why wrong: User doesn't know what was averaged.

✅ CORRECT:

code

"Q1-Q4平均环比增长率为 15%（基于4个季度的环比增速计算）"

❌ WRONG: No context

code

"销售额下降"

Why wrong: No timeframe or magnitude.

✅ CORRECT:

code

"2024年Q2销售额下降8.5%，环比Q1的5000万降至4575万"

10. Output Style Guidelines

Professional Communication

•Data-Driven: State facts first, then analysis
•Layered: Present → Analyze → Insight → Action
•Clear: Use tables, bold text, and structure
•Actionable: Provide concrete next steps
•Contextual: Explain methodology when needed

Chinese Terminology

Term	Usage
同比	Compare to same period last year
环比	Compare to previous period
营收	Revenue
毛利	Gross profit
净利	Net profit
及格率	Pass rate
完成率	Completion rate

Formatting for Readability

markdown

# 📊 销售数据分析报告

## 数据概览
| 指标 | 数值 | 同比 |
|------|------|------|
| 总营收 | 5000万 | +15.2% |

## 主要结论
✅ 整体呈上升趋势
⚠️ Q2略有下滑

## 业务洞察
1. Q4受节假日带动增长显著
2. 线上渠道持续扩张

## 行动建议
1. 加强淡季营销
2. 继续扩大线上投入

11. Analysis Workflow

When analyzing data, follow this sequence:

•Read Data - Load and inspect structure
•Quality Check - Identify nulls, outliers, issues
•Select Methods - Choose appropriate statistical approach
•Calculate - Perform analysis with Python
•Verify - Check results for reasonableness
•Interpret - Explain business meaning
•Present - Structure output with insights and recommendations

文件操作安全规则

分析时的文件处理：

•只读分析：直接读取原文件（data_only=True），不做修改
•需要修改时：先复制原文件 → 在副本上修改 → 原文件保持不变
•保存分析结果：保存到新文件（如 原文件名_分析结果.xlsx），不覆盖原文件

python

# 只读分析（推荐）
wb = openpyxl.load_workbook('/path/to/原文件.xlsx', data_only=True)

# 需要修改/保存结果时
import shutil
shutil.copy('/path/to/原文件.xlsx', '/path/to/原文件_分析结果.xlsx')
wb = openpyxl.load_workbook('/path/to/原文件_分析结果.xlsx')
# ... 进行修改 ...
wb.save('/path/to/原文件_分析结果.xlsx')

汇报时说明文件位置：

•分析结果保存在哪里
•原文件是否保持不变

12. Advanced Analysis Topics

Outlier Detection

python

import pandas as pd
import numpy as np

df = pd.read_excel('/path/to/file.xlsx')

# IQR method for outliers
Q1 = df['金额'].quantile(0.25)
Q3 = df['金额'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['金额'] < lower_bound) | (df['金额'] > upper_bound)]
print("异常值:", outliers)

Correlation Analysis

python

# Correlation matrix
correlation_matrix = df.corr()
print("相关性分析:\n", correlation_matrix)

# Focus on high correlations
high_corr = correlation_matrix[correlation_matrix.abs() > 0.7]
print("高相关性 (>0.7):\n", high_corr)

Summary Checklist

Before finalizing any data analysis:

✅ Identified data types correctly ✅ Used appropriate statistical methods ✅ Handled percentages with correct weighting ✅ Checked data quality ✅ Provided clear methodology ✅ Structured output (Data → Conclusion → Insight → Action) ✅ Used professional Chinese terminology ✅ Included actionable recommendations ✅ Verified results are reasonable ✅ 原文件保持不变（如需修改，先复制再改） ✅ 告诉用户结果文件在哪里

Runtime

No additional dependencies needed. Use with:

•pandas for data analysis
•numpy for statistical calculations
•Excel files via openpyxl (excel-python skill)

</skill>