MarketMaker
Posts
ML PM: 5 questions to ask to ramp up on a new machine learning product

ML PM: 5 questions to ask to ramp up on a new machine learning product

Use these questions to quickly ramp up on your new product area

Ruth Chew
February 06, 2024

Welcome! This is a two-for-one deal of the “Product Craft” and “Maybe ML” bit of MarketMaker.

Getting to know your new product child

Yay, you did it! You snagged that new PM gig, product managing a ML product (or products), and now you need to… Quickly learn everything about that model. Like, 15 minutes ago. Ayup.

Here’s five questions I would ask the pod (engineers/design/etc) to help me ramp up as quickly as possible on our new darling product child. Note that this might apply more to a standard predictive ML model (e.g. fraud prediction, search ranking) - AI might be a bit of a different ball game.

1: What is the model’s objective function? What is it trying to predict, or do, or optimize for?

First and foremost, you should learn what the model’s objective is, and go really deep into this. “It predicts fraudulent transactions” is far too superficial. Is it a binary fraud or no fraud outcome that we’re trying to predict, or a continuous variable? What’s the natural occurrence rate or natural distribution of the outcome? Is the outcome data cleanly and automatically captured, or are we relying on human-in-the-loop cross-checks or manual data entry?

Knowing the exact thing the model is trying to predict has a huge influence on how to improve it and drive value. You’ll use different ML and statistical techniques - for example, a binary fraud-or-no-fraud model with a natural 2% occurrence rate will have different goodness-of-fit measures than a model that tries to predict a house’s sale price in 30 years.

Understanding the objective function in great detail also sets you up to have better cross-functional discussions or spot misalignments. You may find that the model, say, is optimizing for “user makes a purchase”, but actually the business cares about revenue, so perhaps the objective should be changed to “maximize revenue earned per session”.

2: How do we know if a new version of the model is better than an old one?

Having a really rigorous approach to testing and comparing models is what separates good ML teams from bad ML teams.

Ideally, your team should have a clear and fixed set of metrics that they look at for each newly trained model.

Clear: There is some rigorous, numerical, statistical test for separation, e.g. a p-value of <0.05 indicates that NewModel outperforms ExistingModel. Gut feel is not acceptable. Eyeballing of charts is a bit of a gray area, but can be helpful.
Fixed: This is to avoid cherry picking. It’s way too easy to produce 100 metrics, be like, “oh, these 8 look better” and try to claim that the new model is better. You’ll want to use the same, core set of key metrics (preferably no more than 5) each time, so that all models are held to the same standard.

You should also ask if the metrics used in offline training differ from how the model is evaluated in the wild. This can be pretty common, for example, a team may use a statistical metric such as precision-recall area under the curve in training to judge if a new model outperforms the existing model; but when the models are A/B tested against one another, a business metric such as fraud cost prevented may be used instead. This is especially true if counterfactuals cannot be simulated during model training (e.g. because NewAlgo matched worker A to the job rather than ExistingAlgo’s worker B, we need to actually observe worker A working that job to know if they were a better fit for it than worker B).

3: What are the most important predictive variables, and how do they get updated for inference?

Many models exhibit an 80-20 or long-tail distribution for feature importance - the top 5 to 10 model features (i.e. datapoints) will have an outsized impact on the model output and performance.

Learning what these are, and their exact formulation or data pipeline is valuable for several reasons:

Product / user intuition: Are these the features you would have expected, or do you find them surprising? For example, it’s not surprising that an item’s price or sales in the last 30 days are important for e-commerce search ranking; but if you saw something like “whether or not the item description has emojis”, that might be worth a deeper dive with your team, or you might learn some strange new insight about your users.
Dependencies and performance threats: We would all love to have clean stable data, but a common firedrill pattern I’ve seen is that an upstream team changes how something is computed, or the name of a tracking event, and them boon suddenly your Very Important Feature is zeros or null values down the entire column and very bad things happen to your model. If your ML systems are more mature there should be hardening or automated testing against this, but especially in a startup environment you’ll want to keep an eye on these key inputs.
Model refreshes: How the underlying data points change over time affects how often you’ll need to do a pure model update. (A pure model update is re-training a model just to counter model drift, or natural changes to what the model weights should be). For example, if a key feature is “number of bookings in the last 30 days”, and the data pipeline runs once a day, a certain hotel may have “7” for this value on Jan 1, and have “2” for this value on Feb 1. That hotel will then naturally sink lower in search results, even if you do not update your model weights. If you rely a lot on lifetime count features, however (e.g. lifetime number of subscribers to a newsletter), you may need to do model retrains more often as the weights need to be adjusted to account for higher possible values of that feature.

4: What other models interact with our model(s) before it reaches the end user?

For all the hand-wringing about models inside of models, realistically, most ML teams at mid-to-large sized companies operate alongside other ML teams. The below two diagrams show past set-ups I’ve worked in, where my team’s model either ingested other models, and/or was combined or modified in some way before reaching the end user.

Example Structure 1, where my team and a sibling product team both used the same model inputs from other ML teams

Example Structure 2, where my team’s output was combined with that of 3 other models

Knowing this flow is important for three reasons:

Root cause analysis: If something gets borked, having a good representation in your mind of your product’s upstream inputs and downstream systems will help you quickly triage the root cause
Partner cultivation: You’ll want to build a good relationship ASAP with the PMs of those other models, and potentially work with them on multi-team initiatives.
Roadmapping and value estimation: If the outputs of your model are heavily modified before reaching the end user, this may influence what items are most valuable on your roadmap. If your model goes through a lot of modifications, small model accuracy improvements may be wiped out by the time the output reaches the end user, and you may want to focus on inference / model serving speed or bigger step changes.

5: How is our model actually used in real life? Do humans override it?

In some applications, the final system outcome is passed straight from the machines to the end user - for example, if the model says that Google’s search results are supposed to displayed in the order {A, B, C, D}, we’re pretty sure the search results page will show them in order {A, B, C, D}.

But in many other cases, you’ll discover humans being itchy-fingered, or trying to override the model outcomes somehow. In this NBER working paper, human bail judges overrode the algorithm’s recommendations 18% of the time (and, awkwardly, the paper finds that in 90% of those overrides, the judge underperformed the algorithm). Anecdotally, I’ve also had friends cancel Uber or Lyft rides if they get a driver with a low star rating or one too far away, which is a way that humans try to override the dispatch model.

Overrides like these introduce noise into your training data, and may cause model performance issues if not properly separated out and addressed.

But from a more valuable PM perspective, this is also a way that users are trying to show you that something about your model’s outputs are unexpected to them. This could be a great opportunity for an improvement or new feature for you to build.