Joshua | Reliability Engineer & GPU Infrastructure Expert

Experience & Education

Speciality

Failure Prediction Innovator upholding "stability before scale." Created GPU cluster health metrics now adopted industry-wide.

Experience

1.Reliability Engineer, NVIDIA DGX Systems (4 years)
2.Lead architect for national supercomputing center disaster recovery
3.Core developer of WhaleFlux Self-Healing System

Education

1.MS High-Performance Computing, MIT
2.BSc Electrical Engineering, UC Berkeley

Posts

AI Inference: From Training to Practical Use

Joshua Jul 15, 2025

Optimize Your End-to-End ML Workflow: From Experimentation to Deployment

Joshua Jul 14, 2025

Quantization in Machine Learning：Shrink ML Models, Cut Costs, Boost Speed

Joshua Jul 14, 2025

Fine-Tuning LLMs Without Supercomputers: How GPU Optimization Unlocks Cost-Effective Customization

Joshua Jul 10, 2025

Real-Time Alerts for GPU Clusters: Stop Costly AI Downtime Before It Starts

Joshua Jul 10, 2025

Full-Stack Observability: The Secret Weapon for Efficient AI/GPU Operations

Joshua Jul 10, 2025

CUDA Unchained: How WhaleFlux Turns CUDA GPU Potential into AI Profit

Joshua Jun 30, 2025

How GPU and CPU Bottlenecks Bleed Millions (and How WhaleFlux Fixes It)

Joshua Jun 30, 2025

Distributed Computing Decoded: From Theory to AI Scale with WhaleFlux

Joshua Jun 24, 2025