XPengposted 3 days ago
$179,400 - $303,600/Yr
Full-time
Santa Clara, CA
Repair and Maintenance

About the position

XPENG is a leading smart technology company at the forefront of innovation, integrating advanced AI and autonomous driving technologies into its vehicles, including electric vehicles (EVs), electric vertical take-off and landing (eVTOL) aircraft, and robotics. With a strong focus on intelligent mobility, XPENG is dedicated to reshaping the future of transportation through cutting-edge R&D in AI, machine learning, and smart connectivity. We are looking for a versatile Machine Learning Infrastructure Engineer to join XPENG's Fuyao AI Platform team - a core AI infrastructure powering autonomous driving, robotics, and intelligent cockpit teams with large-scale data processing, model training, and inference acceleration. You will begin with building and optimizing our next-generation DataLoader and Dataset Management System, and later expand to distributed training, large-scale inference, model pruning/quantization, and operator-level acceleration, improving AI model efficiency and scalability.

Responsibilities

  • Design, develop, and maintain high-performance Dataloader SDKs and Dataset Management Systems for multi-source, heterogeneous data (images, videos, point clouds, sensor streams, etc.).
  • Optimize multi-threaded/multi-process data pipelines for minimal I/O latency and preprocessing overhead, supporting large-scale model training and inference workloads.
  • Contribute to AI infrastructure projects beyond data loading, including distributed training and inference optimization.
  • Custom operator development (CUDA kernels, TensorRT, ROCm) and hardware-specific acceleration for GPU/TPU.
  • Model optimization techniques such as pruning, quantization, distillation, sparsification, and mixed-precision training.
  • Collaborate with algorithm and platform teams to translate business needs into scalable, production-grade solutions.
  • Continuously identify and address performance bottlenecks across the AI training and inference stack.

Requirements

  • Master's degree in Computer Science, Software Engineering, or equivalent experience.
  • 5+ years of experience in large-scale data processing or ML infrastructure.
  • Proficient in Python with solid software engineering fundamentals, clean coding practices, and strong debugging skills.
  • Hands-on experience with relational databases and NoSQL systems, including metadata and cache management; prior experience with large-scale VectorDB is highly desirable.
  • Experience in at least one of the following areas: Large-scale deep learning training or inference optimization focused on scalability and model acceleration (distributed training strategies, quantization, CUDA kernel development, and related optimizations).
  • Columnar storage formats (Parquet/ORC) and related ecosystems, including partitioning, compression, and vectorized I/O optimization.
  • Linux file system and network I/O optimization for NFS, (high-performance) distributed file systems, and object storage.
  • Large-scale data loading frameworks (PyTorch Dataloader, Hugging Face Datasets).
  • Strong communication skills and ability to work cross-functionally in fast-paced environments.
  • Strong ability to learn quickly, adapt to new challenges, and proactively explore and adopt new technologies.

Nice-to-haves

  • Familiarity with the autonomous driving industry and enthusiasm for its challenges.
  • Experience with distributed computing frameworks such as Apache Ray.
  • Experience in building and scaling ML infrastructure in cloud-native environments.

Benefits

  • Base salary range of $179,400-$303,600, in addition to bonus, equity and benefits.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service