Systems Architect Skill
Purpose
Design robust, scalable architectures for bioinformatics software and pipelines.
When to Use This Skill
Use this skill when you need to:
- •Design software architecture for complex bioinformatics systems
- •Choose appropriate data structures (pandas, anndata, HDF5, databases)
- •Plan for scalability (memory, compute, storage)
- •Define APIs and interfaces between components
- •Design pipeline orchestration (Snakemake, Nextflow, custom)
- •Make technology stack decisions
Workflow Integration
Pattern: Requirements → Architecture Design → Implementation Spec
code
Biologist Commentator validates requirements
↓
Systems Architect designs architecture
↓
Produces technical specification
↓
Software Developer implements from spec
Core Responsibilities
1. System Design
- •Component architecture (modular, extensible)
- •Data flow design
- •Error handling strategy
- •Scalability planning
2. Technology Selection
- •Data structures (when to use what)
- •Storage formats (CSV, HDF5, Parquet, databases)
- •Execution environments (local, HPC, cloud)
- •Pipeline orchestration tools
3. Performance Planning
- •Memory requirements estimation
- •Compute resource allocation
- •I/O optimization strategies
- •Parallelization approach
4. Integration Strategy
- •How to wrap existing tools
- •Container strategy (Docker/Singularity)
- •Dependency management
- •Version pinning
Standard Architecture Template
Use assets/architecture_template.md:
code
# System Architecture: [Project Name] ## Overview [1-2 sentence system description] ## Components 1. [Component Name]: [Purpose] 2. [Component Name]: [Purpose] ## Data Flow [Input] → [Processing] → [Output] ## Technology Stack - Language: Python 3.11 - Key Libraries: pandas, numpy, scikit-learn - Storage: HDF5 for matrices, SQLite for metadata - Execution: Snakemake on HPC cluster ## Scalability - Dataset size: [Expected range] - Memory: [Requirements] - Compute: [CPU cores, time estimates] - Storage: [Space requirements] ## Error Handling [Strategy for failures, retries, logging] ## Deployment [Installation, configuration, execution]
Data Structure Selection Guide
See references/data_structure_guide.md for full details.
Quick Reference:
| Use Case | Structure | When |
|---|---|---|
| Tabular data <1GB | pandas DataFrame | General analysis |
| Tabular data >1GB | Dask DataFrame | Out-of-core processing |
| Single-cell data | AnnData | scRNA-seq analysis |
| Large matrices | HDF5 | Persistent storage |
| Relational queries | SQLite/PostgreSQL | Complex joins |
| Genomic intervals | BED/GFF files | Standard interchange |
| Time series | pandas with DatetimeIndex | Temporal data |
Scalability Considerations
Memory Estimation
code
RNA-seq count matrix: genes × samples × 8 bytes 20,000 genes × 1,000 samples × 8 = 160 MB (fits in RAM) 20,000 genes × 100,000 cells × 8 = 16 GB (need sparse or chunking)
Compute Planning
code
DESeq2 analysis: O(n_genes × n_samples²) 100 samples: ~5 minutes 1,000 samples: ~8 hours Strategy: Subset for testing, full run overnight
Storage Planning
code
FASTQ (compressed): 50-100 MB per million reads 50M reads = 5 GB 100 samples × 50M reads = 500 GB Strategy: Delete FASTQ after alignment, keep BAM
Integration Patterns
Wrapping External Tools
python
# Pattern 1: Subprocess call
import subprocess
result = subprocess.run(
['fastqc', input_file, '-o', output_dir],
capture_output=True, check=True
)
# Pattern 2: Python binding (preferred if available)
import pysam
bam = pysam.AlignmentFile(bam_file, 'rb')
Container Strategy
yaml
# Dockerfile approach for reproducibility FROM python:3.11-slim RUN pip install numpy pandas scikit-learn COPY pipeline.py /app/ ENTRYPOINT ["python", "/app/pipeline.py"]
Output: Technical Specification
Deliverable to Software Developer includes:
- •Architecture diagram (components + data flow)
- •Component specifications (inputs, outputs, responsibilities)
- •Technology stack (exact versions)
- •Data structures (schemas, formats)
- •Error handling (what to do when steps fail)
- •Performance requirements (memory, time, storage)
- •Testing strategy (unit, integration, validation)
References
For detailed guidance:
- •
references/architecture_patterns.md- Common patterns with pros/cons - •
references/data_structure_guide.md- When to use which data structure - •
references/scalability_considerations.md- Memory, compute, storage planning - •
references/integration_patterns.md- How to wrap tools, containers, dependencies
Example Architecture
Project: QC Pipeline for 1,000 RNA-seq Samples
code
## Architecture Specification ### Overview Parallel QC pipeline processing 1,000 bulk RNA-seq FASTQ files with automated report generation. ### Components 1. Validator: Check FASTQ integrity, format 2. QC Runner: Execute FastQC in parallel 3. Aggregator: Combine metrics with MultiQC 4. Reporter: Generate summary statistics and plots ### Data Flow FASTQ files → Validator → QC Runner (parallel) → Aggregator → HTML Report ### Technology Stack - Execution: Snakemake (manages dependencies, parallelization) - QC: FastQC 0.12.1 - Aggregation: MultiQC 1.14 - Custom code: Python 3.11, pandas, matplotlib - Storage: FASTQ (gzip), QC metrics (JSON), report (HTML) ### Scalability - Data: 1,000 samples × 50M reads × 100 bp = 500 GB FASTQ - Compute: 100 parallel jobs on HPC cluster - Time: 30 min per sample → 300 min total (5 hours) - Memory: 4 GB per FastQC job = 400 GB total (distributed) ### Error Handling - Retry failed jobs (3 attempts) - Continue pipeline if individual samples fail - Log all errors with sample ID - Final report includes QC pass/fail status per sample ### Deployment - Install: conda env from environment.yml - Config: samples.csv (list of FASTQ paths) - Execute: snakemake --cores 100 --cluster "sbatch -c 4 --mem=4GB" - Output: results/multiqc_report.html
Hands to Software Developer for implementation.
Success Criteria
Architecture is complete when:
- • All components clearly defined
- • Data flow unambiguous
- • Technology choices justified
- • Scalability analyzed (memory, compute, storage)
- • Error handling planned
- • Developer can implement without architecture questions