Skip to content

Software Engineering

How to Lead AI Engineering Teams

Have you ever wondered why some teams seem to effortlessly deliver value while others stay busy but make no real progress?

I recently had a conversation that completely changed how I think about leading teams. While discussing team performance with a VP of Engineering who was frustrated with their team's slow progress, I suggested focusing on better standups and more experiments.

That's when Skylar Payne dropped a truth bomb that made me completely rethink everything:

"Leaders are living and breathing the business strategy through their meetings and context, but the people on the ground don't have any fucking clue what that is. They're kind of trying to read the tea leaves to understand what it is."

That moment was a wake-up call.

I had been so focused on the mechanics of execution that I'd missed something fundamental: The best processes in the world won't help if your team doesn't understand how their work drives real value.

In less than an hour, I learned more about effective leadership than I had in the past year. Let me share what I discovered.

SWE vs AI Engineering Standups

When I talk to engineering leaders struggling with their AI teams, I often hear the same frustration: "Why is everything taking so long? Why can't we just ship features like our other teams?"

This frustration stems from a fundamental misunderstanding: AI development isn't just engineering - it's applied research. And this changes everything about how we need to think about progress, goals, and team management. In a previous article I wrote about communication for AI teams. Today I want to talk about standups specifically.

The ticket is not the feature, the ticket is the experiment, the outcome is learning.

The right way to do AI engineering updates

Helping software engineers enhance their AI engineering processes through rigorous and insightful updates.


In the dynamic realm of AI engineering, effective communication is crucial for project success. Consider two scenarios:

Scenario A: "We made some improvements to the model. It seems better now."

Scenario B: "Our hypothesis was that fine-tuning on domain-specific data would improve accuracy. We implemented this change and observed a 15% increase in F1 score, from 0.72 to 0.83, on our test set. However, inference time increased by 20ms on average."

Scenario B clearly provides more value and allows for informed decision-making. After collaborating with numerous startups on their AI initiatives, I've witnessed the transformative power of precise, data-driven communication. It's not just about relaying information; it's about enabling action, fostering alignment, and driving progress.

What is prompt optimization?

Prompt optimization is the process of improving the quality of prompts used to generate content. Often by using few shots of context to generate a few examples of the desired output, then refining the prompt to generate more examples of the desired output.

Hiring MLEs at early stage companies

Build fast, hire slow! I hate seeing companies make dumb mistakes, especially regarding hiring, and I’m not against full-time employment. Still, as a consultant, part-time engagements are often more beneficial to me, influencing my perspective on hiring. That said, I've observed two notable patterns in startup hiring practices: hiring too early and not hiring for dedicated research. Unfortunately, these patterns lead to startups hiring machine learning engineers to bolster their generative AI strengths, only to have them perform janitorial work for the first six months of joining. It makes me wonder if startups are making easy-to-correct mistakes based on a sense of insecurity in trying to capture this current wave of AI optimism. Companies hire Machine learning engineers too early in their life cycle.¶

Many startups must stop hiring machine learning engineers too early in the development process, especially when the primary focus should have been on app development and integration work. A full-stack AI engineer can provide much greater value at this stage since they're likely to function as a full-stack developer rather than a specialized machine learning engineer. Consequently, these misplaced machine learning engineers often assist with app development or DevOps tasks instead of focusing on their core competencies of training models and building ML solutions.

After all, my background is in mathematics and physics, not engineering. I would rather spend my days looking at data than trying to spend two or three hours debugging TypeScript build errors.

Format your own prompts

This is mostly to add onto Hamels great post called Fuck you show me the prompt

I think too many llm libraries are trying to format your strings in weird ways that don't make sense. In an OpenAI call for the most part what they accept is an array of messages.

from pydantic import BaseModel

class Messages(BaseModel):
    content: str
    role: Literal["user", "system", "assistant"]

But so many libaries wanted me you to submit a string block and offer some synatic sugar to make it look like this: They also tend to map the docstring to the prompt. so instead of accessing a string variable I have to access the docstring via __doc__.

Tips for probabilistic software

This writing stems from my experience advising a few startups, particularly smaller ones with plenty of junior software engineers trying to transition into machine learning and related fields. From this work, I've noticed three topics that I want to address. My aim is that, by the end of this article, these younger developers will be equipped with key questions they can ask themselves to improve their ability to make decisions under uncertainty.

  1. Could an experiment just answer my questions?
  2. What specific improvements am I measuring?
  3. How will the result help me make a decision?
  4. Under what conditions will I reevaluate if results are not positive?
  5. Can I use the results to update my mental model and plan future work?

Recommendations with Flight at Stitch Fix

As a data scientist at Stitch Fix, I faced the challenge of adapting recommendation code for real-time systems. With the absence of standardization and proper performance testing, tracing, and logging, building reliable systems was a struggle.

To tackle these problems, I created Flight – a framework that acts as a semantic bridge and integrates multiple systems within Stitch Fix. It provides modular operator classes for data scientists to develop, and offers three levels of user experience.