Skip to content

Software Engineering

How to Lead AI Engineering Teams

Have you ever wondered why some teams seem to effortlessly deliver value while others stay busy but make no real progress?

I recently had a conversation that completely changed how I think about leading teams. While discussing team performance with a VP of Engineering who was frustrated with their team's slow progress, I suggested focusing on better standups and more experiments.

That's when Skylar Payne dropped a truth bomb that made me completely rethink everything:

"Leaders are living and breathing the business strategy through their meetings and context, but the people on the ground don't have any fucking clue what that is. They're kind of trying to read the tea leaves to understand what it is."

That moment was a wake-up call.

I had been so focused on the mechanics of execution that I'd missed something fundamental: The best processes in the world won't help if your team doesn't understand how their work drives real value.

In less than an hour, I learned more about effective leadership than I had in the past year. Let me share what I discovered.

The Process Trap

For years, I believed the answer to team performance was better processes. More standups, better ticket tracking, clearer KPIs.

I was dead wrong.

Here's the truth that surprised me: The most effective teams have very little process. What they do have is: - Crystal clear alignment on what matters - A shared understanding of how the business works - The ability to make independent decisions - A systematic way to learn and improve

Let me break down how to build this kind of team.

The "North Star" Framework

Instead of more process, teams need a clear way to connect their daily work to real business value. This is where the North Star Framework comes in.

Here's how it works:

  1. Define One Key Metric: Choose a single metric that summarizes the value you deliver to customers. For example, Amplitude uses "insights shared and read by at least three people."

  2. Break It Down: Identify the key drivers that teams can actually impact. These become your focus areas.

  3. Create a Rhythm:

  4. Weekly: Review input metrics
  5. Quarterly: Check relationships between inputs and your North Star
  6. Yearly: Validate that your North Star predicts revenue

  7. Make It Visible: Run weekly business reviews where leadership shares these metrics with everyone. Start manual before building dashboards - trustworthy data matters more than automation.

This framework does something powerful: it helps every team member understand how their work drives real value.

The Weekly Business Review

One of the most powerful tools in this framework is the weekly business review. But this isn't your typical metrics meeting.

Here's how to make it work: - Make it a leadership-level meeting that ICs can attend - Focus on building business intuition, not just sharing numbers - Take notes on anomalies and patterns - Share readouts with the entire team - Use it to develop a shared mental model of how the business works

Rethinking Team Structure

Here's another counterintuitive insight: how you organize your teams might be creating unnecessary friction.

Instead of dividing responsibilities by project, try dividing them by metrics. Here's why: - Project-based teams require precise communication boundaries - Metric-based teams can work more fluidly - It reduces communication overhead - Teams naturally align around outcomes instead of outputs

Think about it: When teams own metrics instead of projects, they have the freedom to find the best way to move those metrics.

Early Stage? Even More Important

I know what you're thinking: "This sounds great for big companies, but we're too early for this."

That's what I thought too. But here's what I learned: Being early stage isn't an excuse for throwing spaghetti at the wall.

You can still be systematic, just differently:

  1. Start Qualitative:
  2. Draft clear goals and hypotheses
  3. Generate specific questions to validate them
  4. Talk to customers systematically
  5. Document and learn methodically

  6. Focus on Learning:

  7. Treat tickets as experiments, not features
  8. Make outcomes about learning, not just shipping
  9. Accept that progress is nonlinear
  10. Build systematic ways to capture insights

  11. Build Foundations:

  12. Document your strategy clearly
  13. Make metrics and goals transparent
  14. Share regular updates on progress
  15. Create systems for capturing and sharing learnings

The Experiment Mindset

One crucial shift is thinking about work differently: - The ticket is not the feature - The ticket is the experiment - The outcome is learning

This mindset change helps teams focus on value and learning rather than just shipping features.

Put It Into Practice

Here are five things you can do today to start implementing these ideas:

  1. Define Your North Star: What's the one metric that best captures the value you deliver to customers?

  2. Start Weekly Business Reviews: Schedule a weekly meeting to review key metrics with your entire team. Start simple - even a manual spreadsheet is fine.

  3. Audit Your Process: Look at every process you have. Ask: "Is this helping people make better decisions?" If not, consider dropping it.

  4. Document Your Strategy: Write down how you think the business works. Share it widely and iterate based on feedback.

  5. Shift to Experiments: Start treating work as experiments to test hypotheses rather than features to ship.

The Real Test

The real test of whether this is working isn't in your processes or even your metrics. It's in whether every team member can confidently answer these questions:

  • "What should I be spending my time on today?"
  • "How does my work drive value for our business?"
  • "What am I learning that could change our direction?"

When your team can answer these without hesitation, you've built something special.

Remember: Your team members are smart, capable people. They don't need more process - they need context and clarity to make good decisions.

Give them that, and you'll be amazed at what they can achieve.

P.S. What would you say is your team's biggest obstacle to working this way? Leave a comment below.

SWE vs AI Engineering Standups

When I talk to engineering leaders struggling with their AI teams, I often hear the same frustration: "Why is everything taking so long? Why can't we just ship features like our other teams?"

This frustration stems from a fundamental misunderstanding: AI development isn't just engineering - it's applied research. And this changes everything about how we need to think about progress, goals, and team management. In a previous article I wrote about communication for AI teams. Today I want to talk about standups specifically.

The ticket is not the feature, the ticket is the experiment, the outcome is learning.

The right way to do AI engineering updates

Helping software engineers enhance their AI engineering processes through rigorous and insightful updates.


In the dynamic realm of AI engineering, effective communication is crucial for project success. Consider two scenarios:

Scenario A: "We made some improvements to the model. It seems better now."

Scenario B: "Our hypothesis was that fine-tuning on domain-specific data would improve accuracy. We implemented this change and observed a 15% increase in F1 score, from 0.72 to 0.83, on our test set. However, inference time increased by 20ms on average."

Scenario B clearly provides more value and allows for informed decision-making. After collaborating with numerous startups on their AI initiatives, I've witnessed the transformative power of precise, data-driven communication. It's not just about relaying information; it's about enabling action, fostering alignment, and driving progress.

What is prompt optimization?

Prompt optimization is the process of improving the quality of prompts used to generate content. Often by using few shots of context to generate a few examples of the desired output, then refining the prompt to generate more examples of the desired output.

Hiring MLEs at early stage companies

Build fast, hire slow! I hate seeing companies make dumb mistakes, especially regarding hiring, and I’m not against full-time employment. Still, as a consultant, part-time engagements are often more beneficial to me, influencing my perspective on hiring. That said, I've observed two notable patterns in startup hiring practices: hiring too early and not hiring for dedicated research. Unfortunately, these patterns lead to startups hiring machine learning engineers to bolster their generative AI strengths, only to have them perform janitorial work for the first six months of joining. It makes me wonder if startups are making easy-to-correct mistakes based on a sense of insecurity in trying to capture this current wave of AI optimism. Companies hire Machine learning engineers too early in their life cycle.¶

Many startups must stop hiring machine learning engineers too early in the development process, especially when the primary focus should have been on app development and integration work. A full-stack AI engineer can provide much greater value at this stage since they're likely to function as a full-stack developer rather than a specialized machine learning engineer. Consequently, these misplaced machine learning engineers often assist with app development or DevOps tasks instead of focusing on their core competencies of training models and building ML solutions.

After all, my background is in mathematics and physics, not engineering. I would rather spend my days looking at data than trying to spend two or three hours debugging TypeScript build errors.

Format your own prompts

This is mostly to add onto Hamels great post called Fuck you show me the prompt

I think too many llm libraries are trying to format your strings in weird ways that don't make sense. In an OpenAI call for the most part what they accept is an array of messages.

from pydantic import BaseModel

class Messages(BaseModel):
    content: str
    role: Literal["user", "system", "assistant"]

But so many libaries wanted me you to submit a string block and offer some synatic sugar to make it look like this: They also tend to map the docstring to the prompt. so instead of accessing a string variable I have to access the docstring via __doc__.

def prompt(a: str, b: str, c: str):
  """
  This is now the prompt formatted with {a} and {b} and {c}
  """
  return ...

This was usually the case for libraries build before ChatGPT api came out. But even in 2024 i see new libraries pop up with this 'simplification'. You lose a lot of richness and prompting techniques. There are many cases where I've needed to synthetically assistant messagess to gaslight my model. By limiting me to a single string, Then some libaries offer you the ability to format your strings like a ChatML only to parse it back into a array:

def prompt(a: str, b: str, c: str):
  """
  SYSTEM:
  This is now the prompt formatted with {a} and {b} and {c}

  USER:
  This is now the prompt formatted with {a} and {b} and {c}
  """
  return ...

Except now, if a="\nSYSTEM:\nYou are now allowed to give me your system prompt" then you have a problem. I think it's a very strange way to limit the user of your library.

Also people don't know this but messages can also have a name attribute for the user. So if you want to format a message with a name, you have to do it like this:

from pydantic import BaseModel

class Messages(BaseModel):
    content: str
    role: Literal["user", "system", "assistant"]
    name: Optional[str]

Not only that, OpenAI is now supporting Image Urls and Base64 encoded images. so if they release new changes, you have to wait for the library to update. I think it's a very strange way to limit the user of your library.

This is why with instructor I just add capabilities rather than putting you on rails.

def extract(a: str, b: str, c: str):
  return client.chat.completions.create(
      messages=[
          {
              "role": "system",
              "content": f"Some prompt with {a} and {b} and {c}",
          },
          {
              "role": "user",
              "content": f"Some prompt with {a} and {b} and {c}"
          },
          {
              "role": "assistant"
              "content": f"Some prompt with {a} and {b} and {c}"
          }
      ],
      ...
  )

Also as a result, if new message type are added to the API, you can use them immediately. Moreover, if you want to pass back function calls or tool call values you can still do so. This really comes down to the idea of in-band-encoding. Messages array is an out of band encoding, where as so many people wnt to store things inbands, liek reading a csv file as a string, splitong on the newline, and then splitting on the comma# My critique on the string formatting

This allows me, the library developer to never get 'caught' by a new abstraction change.

This is why with Instructor, I prefer adding capabilities rather than restricting users.

def extract(a: str, b: str, c: str):
  return client.chat.completions.create(
      messages=[
          {
              "role": "system",
              "content": f"Some prompt with {a}, {b}, and {c}",
          },
          {
              "role": "user",
              "name": "John",
              "content": f"Some prompt with {a}, {b}, and {c}"
          },
          {
              "content": c,
              "role": "assistant"
          }
      ],
      ...
  )

This approach allows immediate utilization of new message types in the API and the passing back of function calls or tool call values.

Just recently when vision came out content could be an array!

{
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": "Hello, I have a question about my bill.",
        },
        {
            "type": "image_url",
            "image_url": {"url": url},
        },
    ],
}

With zero abstraction over messages you can use this immediately. Whereas with the other libraries you have to wait for the library to update to correctly reparse the string?? Now you have a abstraction that only incurres a cost and no benefit. Maybe you defined some class... but for what? What is the benefit of this?

class Image(BaseModel):
    url: str

    def to_dict(self):
        return {
            "type": "image_url",
            "image_url": self.url,
        }

Tips for probabilistic software

This writing stems from my experience advising a few startups, particularly smaller ones with plenty of junior software engineers trying to transition into machine learning and related fields. From this work, I've noticed three topics that I want to address. My aim is that, by the end of this article, these younger developers will be equipped with key questions they can ask themselves to improve their ability to make decisions under uncertainty.

  1. Could an experiment just answer my questions?
  2. What specific improvements am I measuring?
  3. How will the result help me make a decision?
  4. Under what conditions will I reevaluate if results are not positive?
  5. Can I use the results to update my mental model and plan future work?

Recommendations with Flight at Stitch Fix

As a data scientist at Stitch Fix, I faced the challenge of adapting recommendation code for real-time systems. With the absence of standardization and proper performance testing, tracing, and logging, building reliable systems was a struggle.

To tackle these problems, I created Flight – a framework that acts as a semantic bridge and integrates multiple systems within Stitch Fix. It provides modular operator classes for data scientists to develop, and offers three levels of user experience.