AWS Neuron is the complete software stack for the AWS Inferentia and Trainium cloud-scale machine learning accelerators and servers that use them. This role is for a software engineer in the Machine Learning Inference Model Enablement team for AWS Neuron at Annapurna Labs. This role is responsible for development, enablement and performance tuning of a wide variety of LLM model families, including massive scale large language models like the Llama family, DeepSeek and beyond, as well as stable diffusion, vision transformers and many more. The Inference Model Enablement team works side by side with compiler engineers and runtime engineers to create, build and tune distributed inference solutions with Trainium and Inferentia. Experience optimizing inference performance for both latency and throughput on these large models using Python, Pytorch or JAX is a must. Experience with Deepspeed and other distributed inference libraries is a bonus, as extending these techniques for the Neuron based system is key.