Does Your Model Launch Have a SafetyNet?

Vera DadokApril 4, 2018

In the adversarial and constantly changing world of fraud detection and trust enablement, keeping up with the most recent online patterns is crucial to success. Sift Science uses machine learning (ML) models to rate the riskiness of online events with Sift Scores from 0 (trustworthy) to 100 (high risk of fraud) in real time for thousands of online businesses. Although these models are equipped with real-time online learning to automatically adjust to rapid shifts, it’s necessary to re-train frequently. This allows us to fully capture the structure of new fraud patterns within our ensemble of models and provide the most accurate models possible given the data customers have sent to us.


Unfortunately there’s a price to pay when you change things in machine learning models, even when that change is an improvement, even when that change reflects the freshest state of the world. That price is paid in stability. Any improvement, by definition, is also a change. We’re a software as a service (SaaS) company, and our Sift Scores are being used by our customers in a variety of ways to reduce the impact of fraud on their businesses and enable rapid growth by reducing friction on trustworthy users. Most customer use cases are operationally sensitive to score distribution shifts. 

During each launch, our models show improvements in fraud detection performance as measured by the area under the receiver operating characteristic (AUC-ROC). How could a better machine learning model harm customer operations? Let’s find out…

Disruptiveness of Score Distribution Shifts

Let’s consider one of the most typical customer use cases. In many cases, customers are taking an action on the riskiest Sift Score events. Actions include queuing up these events for manual review or automatically blocking a certain percentage of the riskiest. Customers use the value of the Sift Score to assess risk, so this process translates into picking a score threshold s* and performing an operation on all transactions with scores above that threshold. We make this easy to implement via Workflows.

See the illustration below of a typical use case for Sift Scores, where a customer blocks all events with scores above a threshold of 95.

Customers operating with a particular model have chosen that score threshold carefully. These customers will have observed that over time, under standard operations, a certain fraction of risky users exceeds the threshold pold = P(sold > s*). When we launch a new model, the underlying score distribution may change. In particular, the fraction of scores above the customer threshold s*,  pnew = P(snew > s*), may significantly increase or significantly decrease from the fraction of events above s* under the old model. For example, the illustration below shows a case in which a sudden change in score distribution increases the fraction of traffic above the score threshold.

Illustration of score distribution change that leads to higher block rate

Sudden increases in traffic above a score threshold may lead to 

  • Sudden jumps in the volume of automatically blocked traffic.  While this may be due to an improved model catching more fraud at high score ranges, it also may be due to blocking more good users and thus losing revenue. See illustrative figures of these two cases for a new model with improved AUC-ROC.
  • Long review queue backups due to overwhelming fraud analyst team resources

Sudden decreases in traffic above a score threshold can lead to

  • Increased traffic bypassing review queues or auto-block rules. This may be indicative of growing false negatives amongst the larger amount of unreviewed events and reduction in fraud prevention effectiveness

These sudden shift scenarios are why we need a SafetyNet. 

Disruptive Model Shift Detection

What is SafetyNet? SafetyNet helps us detect significant score distribution shifts during model launch, shifts we expect could have a significant impact on customer operations. To do this we need

  1. A metric to measure customer impact
  2. A process to detect significant shifts leveraging this metric

The Metric

To choose a metric for customer impact, we return to the customer perspective. Let’s look at a single customer before a new model launch. We have two models: an old model which we’d like to replace and a freshly-trained new model. We’ve examined the potential consequences if we suddenly launch the new model over all customer traffic without warning. 

The operational consequences of the launch are due to sudden volume changes in blocking or queueing traffic. This leads to a metric related to the differences between the scores above a threshold in the new and old models pnew and pold

The absolute change (pnew – pold) doesn’t capture the operational realities. Consider a two situations with very different operational impacts on a manual review queue managed by a team of fraud analysts.

  1. A queue that collected 20% of traffic suddenly collects 21% of traffic
  2. A queue that collected 0.5% of traffic suddenly collects 1.5% of traffic

In the first case, the review team only needs to do 5% more work. However in the second scenario the team would need to review 3 times the expected volume!

The metric: the percentage change θ at relevant s* threshold locations for each customer. 

Detecting significant shifts

To avoid disruptive score distribution changes during model deployments, we need a process to detect significant shifts using this metric θ at relevant s* threshold locations and assess if the magnitude of this change is outside of an acceptable interval around 0. In other words, we want to check if

As we are always improving the model, we do expect some minor fluctuations, thus bmin < 0 < bmax.

  1. Collect a data sample
    • Take two samples of scores, nold points from the old model, and nnew points from the new model, over disjoint user events. This corresponds to a data collection process of running the models on different subsets of event traffic. 
  2. Compute empirical block rate changes from the sampled data 
    • We only have a finite sample, nold points of which xold are above s* and nnew points of which xnew are above s*.  So the empirical block rate change is 
    • We compute this initially at all integer s* locations between 0 and 100, as customers may be operating at any threshold. In some cases, such as customers using Workflows, we have information on relevant s* locations. In other cases, we retain the set of all thresholds that would experience shifts.
  1. Compute confidence intervals around the empirical block rate changes
    • Comparing the empirical block rate change directly to [bmin , bmax] would create many false positives claiming the magnitude of change was outside of acceptable bounds due to the limited sample sizes. We highlight only statistically significant changes by computing a confidence interval (CI) around the empirical block rate change θemp(s*), and finding thresholds s* that have CIs entirely outside of this region. 
    • For small sample sizes, we use a combination of additive smoothing and bootstrapping to build CIs. For larger sample sizes we compute the confidence interval by viewing the empirical block rate change plus one (θemp + 1) as a risk ratio and using a normal approximation on the log of the ratio. 

A note on trust: for most of this analysis we’ve described approaches for fraud prevention, focusing on the fraction of traffic above a particular score threshold. For customers more interested in reducing friction for trustworthy users, we can do an equivalent analysis on the fraction of traffic below score thresholds.

After this procedure, for each customer we’ve collected a set of potential score thresholds Sshift at which a customer is likely to experience a significant shift.  

Beyond Statistical Detection

Suppose we discover that our hypothetical customer has a non-empty Sshift and may experience a disruptive change during model deployment. What’s the next step? We have a couple options. In certain cases we can calibrate scores during launch to mitigate distribution changes directly via a remapping of scores. However in other cases we instead work with the customer to alert them of the upcoming launch. During a customer conversation we need to include recommended actions and ideally information about why the shift is occurring.  

Internal Review

We produce a variety of internal information around the expected score shift, potential causes, and recommendations to pass on to customers. This is also a first chance for manual inspection of any score shifts.

For visualizing the shifts, we’ve found the most useful internal figures are 

  • The reverse cumulative distribution function (reverse CDF), which corresponds to empirical block rates above a threshold at each score
  • Fraction change in reverse CDF including 
    • empirical values θemp
    • confidence interval [θlow, θhigh]
    • score threshold line indicating magnitude at which we would consider a shift disruptive
    • red markers at score thresholds that could lead to statistically significant disruptions

Note that examining the probability mass function (PMF) can be misleading as long-tailed score distributions can be difficult to compare visually.  A plot of confidence intervals against allowed bounds is very illustrative of the significance of shifts. Below we show a few examples of these figures.

Customer Communication

We finally transform the known significant changes in block rate into tables of operating point recommendations for updating relevant score thresholds s* to  s*new. When possible, we like to pair these recommendations with insights into why the change occurred. 

As far as insights, the most useful information is examining recent changes in data inputs such as newly integrated data fields or changes in volume of fraud vs. normal users. This can be paired with information around known modeling changes in final messaging.

Ideally when we launch the new model, customers are prepared for any changes and can continue operating with minimal disruption while catching even more fraud than before.

What’s Next?

While SafetyNet provides peace of mind around detection of score shifts, there’s plenty of room to improve our systems around model launch. We can improve score shift detection accuracy and efficiency in a variety of ways such as expanding the set of metrics to cover a wider variety of use cases or further automating manual steps. We will continue to develop new strategies and systems to make model launches seamless for our customers.   

Want to know more or help us build the next generation product? Apply for a job or check out our article on How Sift Works.