Staff Software Engineer - AI Infrastructure

XPengposted 3 days ago

$179,400 - $303,600/Yr

Full-time

Santa Clara, CA

Repair and Maintenance

Upload and Match ResumeTrack Jobs with Teal

About the position

XPENG is a leading smart technology company at the forefront of innovation, integrating advanced AI and autonomous driving technologies into its vehicles, including electric vehicles (EVs), electric vertical take-off and landing (eVTOL) aircraft, and robotics. With a strong focus on intelligent mobility, XPENG is dedicated to reshaping the future of transportation through cutting-edge R&D in AI, machine learning, and smart connectivity. We are looking for a versatile Machine Learning Infrastructure Engineer to join XPENG's Fuyao AI Platform team - a core AI infrastructure powering autonomous driving, robotics, and intelligent cockpit teams with large-scale data processing, model training, and inference acceleration. You will begin with building and optimizing our next-generation DataLoader and Dataset Management System, and later expand to distributed training, large-scale inference, model pruning/quantization, and operator-level acceleration, improving AI model efficiency and scalability.

Responsibilities

Design, develop, and maintain high-performance Dataloader SDKs and Dataset Management Systems for multi-source, heterogeneous data (images, videos, point clouds, sensor streams, etc.).
Optimize multi-threaded/multi-process data pipelines for minimal I/O latency and preprocessing overhead, supporting large-scale model training and inference workloads.
Contribute to AI infrastructure projects beyond data loading, including distributed training and inference optimization.
Custom operator development (CUDA kernels, TensorRT, ROCm) and hardware-specific acceleration for GPU/TPU.
Model optimization techniques such as pruning, quantization, distillation, sparsification, and mixed-precision training.
Collaborate with algorithm and platform teams to translate business needs into scalable, production-grade solutions.
Continuously identify and address performance bottlenecks across the AI training and inference stack.

Requirements

Master's degree in Computer Science, Software Engineering, or equivalent experience.
5+ years of experience in large-scale data processing or ML infrastructure.
Proficient in Python with solid software engineering fundamentals, clean coding practices, and strong debugging skills.
Hands-on experience with relational databases and NoSQL systems, including metadata and cache management; prior experience with large-scale VectorDB is highly desirable.
Experience in at least one of the following areas: Large-scale deep learning training or inference optimization focused on scalability and model acceleration (distributed training strategies, quantization, CUDA kernel development, and related optimizations).
Columnar storage formats (Parquet/ORC) and related ecosystems, including partitioning, compression, and vectorized I/O optimization.
Linux file system and network I/O optimization for NFS, (high-performance) distributed file systems, and object storage.
Large-scale data loading frameworks (PyTorch Dataloader, Hugging Face Datasets).
Strong communication skills and ability to work cross-functionally in fast-paced environments.
Strong ability to learn quickly, adapt to new challenges, and proactively explore and adopt new technologies.

Nice-to-haves

Familiarity with the autonomous driving industry and enthusiasm for its challenges.
Experience with distributed computing frameworks such as Apache Ray.
Experience in building and scaling ML infrastructure in cloud-native environments.

Benefits

Base salary range of $179,400-$303,600, in addition to bonus, equity and benefits.

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder

Staff Software Engineer - AI Infrastructure

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company