Hi, I’m Akhil#
I started this page to write about software and machine learning.
I studied Electrical Engineering with minors in Math and Physics at Rutgers University, where I focused on signal processing, digital design, and embedded systems.
I then did my Master’s in Computer Science at Cornell University, specializing in machine learning. There, I developed and deployed both vision and language models while also studying representation learning, optimization techniqiues, and theoretical foundations of modern AI systems.
I’ve previously interned at Johnson & Johnson (Risk Technology), JPMorgan Chase (Asset & Wealth Management), Crewcial Partners (Endowment Fund Management), Herbert J. Sims (Wealth Management), and Clifford Beers Clinic.
I’ve also done research in image noise reductions using VAEs (variational autoencoders), auto-labeling for supervised learning, and animal movement tracking under various environments using computer vision.
I’m currenlty a new grad engineer @ a Fortune 500.
Feel free to reach out to me at akhilvaragamreddy@gmail.com or linkedin.
I like explaining concepts with clarity and making hard ideas intuitive.
→ Read my posts
TL;DR: This post dives deeper into advanced CUDA performance techniques from tiling and pipelining to occupancy tuning, loop unrolling, and warp-aware memory access. I explain how real-world matrix multiplies use shared memory and grid-stride loops to handle massive inputs efficiently, and how tricks like operator fusion and double buffering unlock GPU throughput. We also look at how OpenAI’s Triton gives you Python-first control over writing custom fused GPU kernels.
The repsitory with the corresponding PyTorch implementation is available here.
...
TL;DR: This post demystifies the core concepts behind CUDA, walking through how GPU kernels work, how threads and memory hierarchies are structured, and how to write and launch a kernel. It’s a hands-on introduction to CUDA for engineers who want to understand how deep learning frameworks really run under the hood (and how writing a few lines of CUDA can unlock massive speedups).
The repsitory with the corresponding PyTorch implementation is available here.
...
I’ve seen a lot of people use pre-made agentic frameworks to handle various tasks but I wanted to create a framework from scratch to see the brain, actions, and tools behind these agents.
In the coming years, I believe Robotaxi will scale like Uber, while Waymo will scale like Lyft - if at all. This reflects a broader paradigm shift: lean vision neural networks vs. bulky, sensor-heavy autonomy.
I’ve trained hundreds of models in school and on my own, but barely any of them were served or pushed to production - I wanted to document what usually happens after training.
How far can you compress adjacency matrices before everything falls apart? I pushed LoRA to its limits and watched adjacency melt into noise.
AlphaGo blew my mind and AlphaGoZero was the cherry on top. Recreating this in a simple environment was a long term dream of mine.
What happens when you take a modern ML model, quantize it like crazy, and try to deploy it on decade-old hardware (a potato)? I tested EMNIST on the edge — literally.
I felt like I would never understand how LLMs really generate text until I actually create a transformer from scratch and actually code up multi-head self attention with non-modular, functional-style PyTorch.
Attention is really useful but just like any great invention, it was built out of necessity for the problems at hand. In this post, I aim to dive into the issues researchers were dealing with between 2013-2017.