Avanan was the first company to use pure-API email security back in 2016. It is a very compelling approach, because the deployment is extremely easy, it protects internal email and superior visibility to provide the best security.
Back then there was very little documentation from Microsoft on the issue, but two weeks ago Microsoft published some guidelines that we had to learn the hard way. So, we thought it would make a good opportunity to share our real-world experience with it.
There were a few problems with API, all summarized with the following: it cannot scale. How we solved it is at the bottom, but first, here are the problems with API-only.
Microsoft’s throttling is now well documented, and we’ll get to it soon. What is not documented by Microsoft is that the API will start experiencing delays when the tenant is loaded. You will get reply failures to API calls and even dropped events - the API will just ignore certain messages. To overcome this, we added to the event-based callback, a polling mechanism as backup that as you would expect from polling, introduced bigger delays.
Here’s a reply shared with us by a customer of an API-based email filtering service that he received from his API vendor. When he asked, “Is there a reason why sometimes the emails appear in the inbox and don't automatically go into the quarantine folder?“, here was their reply:
What Microsoft did document is a throttling mechanism to the API, designed to protect its servers from receiving too many requests. The exact implementation of how the throttling works is naturally only known to Microsoft, but they do acknowledge throttling is a challenge for anyone planning to use their API (Microsoft Graph - Don't Get Throttled!).
There are generally three thresholds you could run into.
The first limit is the Per app across all tenants. What it means is that the vendor that provided you an app has a limit across all their install-base of 2,000 calls per second. It might sound a lot but what it means is that even if another customer of your vendor is overloading the API, for example because of email loops, you might get throttled.
(From: https://docs.microsoft.com/en-us/graph/throttling)
The second limit is the Per app per mailbox set at 10,000 per 10 minutes. To receive the information about an email (including its attachments, etc) there are roughly 4-5 API calls needed, so this number translates to about 2,000 emails per 10 minutes or 3 per second. Again, most of the time it’s fine but over time across all mailboxes, you’ll get throttled.
But perhaps the most alarming throttling limit is an undocumented one: Microsoft doesn’t tell us when or how, but rather declares they will throttle the API if the tenant is experiencing high load, or if their entire service is under load. As described in their documentation:
In other words, API is second priority to email processing. If there’s load, then API is the first to suffer. This makes sense from an implementation standpoint, but for email security to rely on a second-priority API is very risky.
The impact of throttling is devastating. Once it happens, it is pretty unpredictable when the service will resume and we have seen it take 24 hours or more. The failure reply from Microsoft will suggest the delay before retry, but it is just a recommendation and in fact, getting failed retries will only extend the penalty time. While waiting to retrieve old messages new emails continue to come. The queue, recovery time and total delay only gets longer. Across multiple requests, in a large environment, it basically means the service is no longer operational.
Avanan is using the Microsoft API across the system but not for the real-time email retrieval that needs to work at wire-speed.
For real-time emails - we use SMTP. It’s scalable and it can be inline.
Microsoft’s Graph API was not designed for implementing email security - it is not inline, it does not provide SLA on delivery/update times, and as demonstrated in this blog, it will throttle the service to protect from load and will degrade performance if load does occur. In simple words - if it cannot guarantee performance during load - it cannot scale for services that require a timely response.
This is the reason Avanan implemented a unique mode that:
Applies all configurations via API - to deploy in a click
Use API to scan existing data (E.g. historical email) and contextual data (configs, users, groups, etc), for full visibility
And get real-time emails via SMTP - because it allows inline and because it can scale