The Problem: Did it work?
Imagine you’re a Product Manager. Your job is to ship things (products, features, or marketing campaigns) that make an impact on company’s bottom line. It’s equally important to know when something doesn’t work as when it does, so you can quickly course correct. But measuring impact is actually extremely difficult to do.
Existing impact analysis methods fall short. One method is to look at how your key metrics change before and after the launch over time. But there are too many confounding variables for this to be reliable. (Maybe marketing released a campaign that drove engagement, or maybe it’s the start of busy season.) Another way is A/B testing the feature. But tests are time-consuming and slow down development time. More importantly, A/B tests withhold a valuable feature from half of your customers until results come back, making for a poor experience for those users. Apple famously shies away from A/B testing for this very reason.
Our challenge:
How can we help customers get a reliable assessment of impact without slowing the development process?
================
What is the Impact Report?
The impact report is powerful one-of-a-kind report in the product analytics industry that uses propensity matching (a machine learning model) to accurately measure the impact of both new and past launches. It does this by:
• Segmenting customers by used / not used (adopters vs non-adopters)
• Segmenting by days since first use (zero-base timeline around days since first use of the new feature) - controls for people using feature for first time on different days
• Segmenting by propensity to do the goal metric - controls for self-selection bias (Ex: The tendency for your most active users to not only perform your key events frequently, but also to try out new features more often as well. In other words, without any correction for self-selection, the adopter group will tend to be filled with your most active users. Unsurprisingly, those adopters would have better numbers than the non-adopters — a misleading result.)
What is Propensity Modeling?
Propensity matching uses a machine learning model to identify sets of users who have a similar likelihood, or propensity, to use the new feature, and buckets them into groups. Within each propensity group, the model will compare the behavior of adopters and non-adopters to calculate the delta between them. Finally, it takes the average of the deltas from each propensity group, weighted by group size, to determine the overall impact of the feature. A 95% confidence interval serves as the cherry on top, to confirm whether or not the result is statistically significant. In short, propensity matching enables you to derive precise assessments of impact via the observational data you already gather — no cumbersome testing required.
Our Challenge:
How can we make the impact report easy to understand, but fit within the current way Mixpanel works?
I inherited Impact from a Senior designer who left the company.
Process
Discovery - Identify users, goals (gap data, stakeholder interviews, define personas)
Define scope (define & prioritize user stories)
Usability Testing, pain point synthesis
Design - Competitive & market research, design, gather feedback & iterate
Results & Takeaways - Success metrics, feedback, learnings, future work
Project Details
Role: Lead/Solo Designer - research, UI design
Environment: Desktop web
Duration: Oct 2019 - Feb 2020
Results
Impact Report Adoption: 52% adoption (averaged over the last 30days)
Impact Queries per day: 630 queries per day in last (on average over the last 30d)
Impact Report Views per Day: 283 views per day (on averaged over the last 30d)
Impact Report accounts for 74% of all report views in last 30d
“The Impact report has become one of my favorites in Mixpanel. It allows us to quickly assess how successful our product and marketing releases are and see exactly how each of them affects the actions we care about the most within our product.” —Talya Heller, Director of Product, Rêv Worldwide
===
Background Context:
Inherited Impact report from Sr. Designer who left
Take impact over the finish line to be released to the general public
Goals needed for GA:
how to handle multiple metrics
add time window selector
how to handle propensity model layer *
rethink report, QB
The Challenge:
How can we help Product Mangers learn whether their launch made an impact (even in the absence of an AB test) using machine learning?
The Problem: Design an experience from the ground up to show how a launch effected core KPI’s.
Who are we designing for? Who uses impact? >
PM, statistician, layperson?
Simplicity vs confidence:
HMW show the results of a complex statistical model (propensity modeling) in a simple way?
How does this experience fit into the rest of MP?
HMW design a simple experience while “showing our work” so people can follow how we arrived at the result?
What is Propensity Modeling?
Propensity modeling: technique used to infer causality. Can approximate an AB test by grouping similar users by their likelihood (propensity) to do X [launch event].
A true randomize controlled experiment is like comparing a set of twins (Jan 1 and Jan 2 are identical in every way), except Jan 2 gets the treatment, used feature. Propensity modeling approximates this by grouping
Trains a probabilistic model to predict whether a user will
Drank Soylent (A)
Did not drink Soylent (NA)
(1/7 core analytics reports)
Context:
Scope/My Role: Designer working alongside one other designer to bring Launch Analysis report from initial concept to general release.
Context: Did it work?
Imagine you’re a PM and your job is to launch products/features/marketing campaigns that make an impact on your business KPI’s. But in a CO where every team is constantly shipping, it’s hard to know how much of what YOU did made an impact.
To measure if X effected Y, the gold standard is to perform a randomized controlled experiment (AB test, segment users into C/T grps, keeping all variables the same except for what you’re measuring). But AB tests aren’t always feasible. Cons: takes time to get stat sig, customers that don’t get new feat aren’t happy (why Apple famously doesn’t do AB tests).
The impact report determines whether a launch made an impact in the absence of an AB test using statistical modeling and ML. (*Causal Inference is a stat tool that lets ML algorithms predict the likelihood ea user will do the launch event. It groups approximates an AB test by grouping like users by their likelihood to do the launch event X)
*Causal Inference is a stat tool that lets ML algorithms predict the likelihood ea user will do the launch event.)
How did behavior change before & after launch?
Looking at [goal metric] over time => if up & to right, doesn’t tell you much due to confounding variables.
To get around confounding variables, must:
-segment customers by used/not used
-segment by days since first use
-segment by propensity to do goal
What does impact report do? Aims to tells you whether your launch worked.
Goal: Learn whether there’s causality in the absence of an AB test (randomized controlled experiment).
I
assingn2 user groups (treatment vs control group), easy to measure impact
Launch Analysis: natural experiment, not AB test, hard to measure causality bc many variables are changed (correlation does not imply causation)
End:
QB:
Since date: MP can’t automatically determine launch date (defaults to 30d from today ← I determined from talking to PMs that analysis of a launch is usually within 30d of release).
How to handle multiple metrics?
Adopter chart:
What’s different between the adopters vs non-adopters?
Shows how adopter vs non-adopters changed over t (for all users that did the goal event).
Change time window
Users want to know:
What’s the user adoption of the launch?
How did the goal event adopting vs not adopting
What’s the sample size for the report? (Does it show only users who did the goal event, or also
What impact did the launch have on my goal event?
Positive / Neg impact? How much?
How sure are we that this observed impact was actually caused by Y and not some other factor? (= Confidence score >/= 95% means significant)
How can we control for self-selection bias to get true causality
Impact calculates the user adoption of the launch, the impact of the launch on an important event, and the differences between users that adopt the launch and those that do not.
Adding flexibility to Impact Report:
How does it handle multiple goal metrics?
Of all users that did the goal event, how did adoption vs non-adoption
Explaining Impact Report: Determining casuality
Of all users who did [Y = goal metric],
Impact includes confidence to indicate the statistical significance of report calculations. Interpret the confidence as the probability that the final delta is primarily caused by the launch event
(Generally, Launch was successful if: + overall Delta & Confidence >/= 95% ), as opposed to existing by chance.
ERF vs LA, Design Crit:
ERF: AB test = randomized controlled experiment, 2 user groups (treatment vs control group), easy to measure impact
Launch Analysis: natural experiment, not AB test, hard to measure causality bc many variables are changed (correlation does not imply causation)
Shows
1) avg rate of ppl who didn’t do (NA) & did the new event (A) before the launch & after the launch.
Avg rate of the metric
LA: WIP (6/27/19)
1:18 in vid:
How LA behaves in DB:
-Entry point is from a DB (Flow: Select DB > Click LA> view report)
-LA automatically picks up the reports within the selected DB (are the metrics) > select a launch event. By default, the report only counts all users who did the metric/goal/effect event (or any event). But can change to count all users who did any event (V2). Of that user base, how many DID DO the launch event (Adopters) vs did NOT do the launch event (NA)?
We show the average rate / freq
-CONS:
-Inflexible manipulation of impacted/goal metric: Can only segment a goal metric if the report has segment