When video ads fail things can go from bad to worse pretty quickly. Operating at scale, the lost opportunity can rake up 100,000s of missed impressions and video views in only a few minutes.

Scale that up to one or two days and you're looking at a serious missed revenue problem.

So it's critical that once something does go wrong you take the absolute minimal amount of time possible in correcting it; and for good.

The customer success team here at Watching That has solved for many video ad failures across our global media network and, as such, we have developed a pretty robust approach.

In this multi-part series we're going to going to go through:

So to kick off here is how, in 5 steps, we approach troubleshooting the majority of video ad failures.

#1 Identify that something is actually wrong

This might sound like a no-brainer but be careful: something being wrong is different to something being imperfect.

A fill rate of 65% is an effect of an imperfect setup but it doesn't mean something is actually wrong.

So the first thing to do is understand that something is indeed wrong and can be fixed.

This requires two things:

  1. A base line of what is normal;
  2. A measurement of just how far away from normal things are now;

The easiest place to start is with your ad server VAST  error reports cross referenced with your video play out reports.  You're going to look for data clustering mapped by a key value pair or line item ID (GAM) or placement ID (Freewheel) with a video id and/or player id.

This will start giving you the context you need to describe the hotspot explicitly enough.

There are at least 3 levels of error reports available that you can inspect and start to spotting these clusters. The trick is to be able to add context to these events to form links towards the systems that are the cause.

Look for these reports across your ad server, video platform and client side code frameworks. Ideally you'd have a unified universal error list that you can pivot across many dimensions of your setup like that provided from the Watching That platform.

Example of a consolidated error report

 

#2 Elaborate and describe where the trouble is coming from

With a detailed description of where the errors are being thrown you can now roll up your sleeves and get into the nuts and bolts of your setup.

Until you can isolate which component in your system is misbehaving you won't be able to do much about it.

The bad news is that in a typical setup there are many cogs in the engine; some of which you don't have any control over.

The good news is that, by completing the first step, you can start eliminating the systems that not relevant and seriously narrow the search field by using identifiers like video ids, creative urls and other data signatures to start pointing the finger.

You can use the Watching That Video Ad Map to help you further pinpoint the source of the trouble.

You also have an ace you can play: 99 out of 100 times things break because of some change introduced into the system. This means that once you've identified what system is your prime suspect you can then look to the recent change logs to cross reference with.

Example of time series visualisation of video ad errorsFor this to be truly conclusive you need:

  1. A good set of change logs; and
  2. Time series data that shows you exactly when the effect of the change was introduced.

The WT team uses the platform's real time monitoring and time based visualisations to see exactly when something has moved unexpectedly.

#3 Correct the problem

Generally speaking this is the easy part. All the hard work is what gets you to the point of issuing the fix.

Typically the problem sits in one of two buckets:

  1. Code and systems that you control; and
  2. Code and systems that you do not control.

But since you've arrived at this level of understanding it just means that routing the request to fix to relevant party is straightforward.

What's more, rather than just assigning responsibility and walking away, you can now provide more than enough example data and a high fidelity description of the problem so that solving the problem is just a chore to be completed.

At Watching That, we send dev teams and suppliers detailed reports of :

  • where the trouble is exactly,
  • the component that's causing it,
  • examples of the effects it's having; and
  • recommendations for a solution based on past experience

Example report for supplier optimisation

Instead of being met with suspicion and a defensive posture, this approach is very welcomed as the hard work is already completed.

The 3 of the most common causes of errors we've seen are:

  1. Ad tag macros not being filled in properly - without them ad buyers can't get the right level of context;
  2. Timeouts that are set too short - you need at least 8 seconds if you're operating a waterfall setup
  3. In flight Brand Safety filters - autoplay, active tab, viewability, size are all measurable in real time.  You should pre-clear requests before they are submitted to not waste the users' time if you know the inventory won't meet the verification checks

#4 Verify the problem has, indeed, been fixed

It's not unusual for a problem to be reported to have been fixed and it turns out that it hasn't.

After you've received confirmation of a fix being released your troubleshooting effort continues as you work your way back up.

The easiest way here to confirm the wrong has been righted is to revisit the first step of the process and compare the new system levels with your definition of normal.

Remember that trouble very usually starts when a change is introduced into the system.

Issuing a fix is essentially just another change. So to verify that the problem has indeed been addressed and solved you need to compare the post fix performance levels with your expectation of normal levels.

If that's all matches up then job's done (almost); otherwise you know immediately that you need iterate through the steps again.

Some reporting tools can take a few hours (if not days) to provide you with the data you need. This can make the whole process incredibly inefficient so aim to deploy real time data platforms so you can validate within minutes and move on quickly.Example of a real time dashboard in Watching That

The Watching That platform updates in real time so the team here can validate almost immediately if a fix is suitable or if it needs to be refined.   The time savings this provides are invaluable.

#5 Apply preventive measures

We live by the motto `everyone gets to make a mistake; as long as they only make it once`.

The point is we learn from failing but we need to make sure we're learning. Things break all the time so preventing the reoccurrence of known trouble is a must.

At Watching That, we put in place Alerts that monitor for the conditions that lead to the trouble being introduced in the first place.

Screenshot 2019-06-26 at 15.01.58

You can also setup Google Alerts and other devops alerting platforms that keep an eye on things for you. You definitely shouldn't be the last to know something has gone wrong.

We also encourage our customers to document changes to their systems efficiently and effectively as they are the key to when things go wrong.  The humble spreadsheet is a mighty tool in this regard.  But also use JIRA, Github or other release management software.

Sometimes that isn't practical and, in those cases, we add the conditions into our 'fault library' that we then add into our product so we can automatically detect and resolve these issues without our customers even knowing!

So think about just creating your own list of issue resolutions, so you can always call back on them and not have to store them in your living memory.

Magic.

If you, or your team, want to Troubleshoot Video Ad Failures like a pro please feel free to get in touch!  

Alternatively, as we progress this series of posts feel free to subscribe below to have them delivered direct.

 

Get Watching That's blog direct to your inbox