Joshua | Reliability Engineer & GPU Infrastructure Expert

Experience & Education

Speciality

Failure Prediction Innovator upholding "stability before scale." Created GPU cluster health metrics now adopted industry-wide.

Experience

1.Reliability Engineer, NVIDIA DGX Systems (4 years)
2.Lead architect for national supercomputing center disaster recovery
3.Core developer of WhaleFlux Self-Healing System

Education

1.MS High-Performance Computing, MIT
2.BSc Electrical Engineering, UC Berkeley

Posts

GPU Management: Slashing Costs in Gemini Fine-Tuning

Joshua Jul 17, 2025

Mastering PEFT Fine-Tuning: How PEFT & WhaleFlux Slash LLM Tuning Costs & Boost Performance

Joshua Jul 17, 2025

Cluster Model: Integrating Computational Management and Data Clustering

Joshua Jul 17, 2025

AI Inference: From Training to Practical Use

Joshua Jul 15, 2025

Optimize Your End-to-End ML Workflow: From Experimentation to Deployment

Joshua Jul 14, 2025

Quantization in Machine Learning：Shrink ML Models, Cut Costs, Boost Speed

Joshua Jul 14, 2025

Fine-Tuning LLMs Without Supercomputers: How GPU Optimization Unlocks Cost-Effective Customization

Joshua Jul 10, 2025

Real-Time Alerts for GPU Clusters: Stop Costly AI Downtime Before It Starts

Joshua Jul 10, 2025

Full-Stack Observability: The Secret Weapon for Efficient AI/GPU Operations

Joshua Jul 10, 2025