Not all adblockers are born equal
A deep dive into one of the fastest network filtering engines.
Since its inception in 2015, the Cliqz browser has been shipping with the most advanced anti-tracking protection available, which takes care of protecting the privacy of users. It is not uncommon for popular extensions or browsers out there to rely on adblocking mechanisms (aka content blocking) to offer privacy protection. This is not optimal for multiple reasons: not all third-party requests can be blocked, breakage of pages is higher, and numerous exceptions are used to fix said breakage, leading to a weakened protection for users. At Cliqz, we think that adblockers are complementary to anti-tracking technologies. One taking care of privacy protection, the other reducing clutter and distractions, optimizing loading times of pages, and maximizing battery life of mobile devices. All of this while minimizing breakage as the adblocker can afford blocking requests less aggressively, while not putting users’ privacy at risk.
The performance of Cliqz’s adblocker has already been covered in the past. We published a comparative adblockers performance study early 2019, showcasing huge disparities in decision times to block advertising or tracking requests—up to three orders of magnitude, in fact—between blockers with similar feature sets. At the time, our adblocker became the benchmark in terms of network filtering performance. The algorithms allowing us to reach this level of efficiency are the result of years of research and a development in the open, allowing fast iterations, without the frictions caused by final products considerations.
Although many adblockers are only available as browser extensions, we decided early on to develop the core of ours in the open, as a self-contained TypeScript library making very few assumptions about how it would be integrated in user-facing projects. This choice paid off, as the library has already been successfully used in many different contexts, both by ourselves and the open-source community[1]. Finally, our work on a more efficient adblocker design enabled others to implement fast network request filtering engines.
In this post we detail some of the design choices and implementation details which made Cliqz’s adblocker more efficient than other contenders. But first, let us give a general overview about how adblockers work, to get an idea about the complexity involved. If you are already familiar with concepts such as network filtering and element hiding, feel free to skip the following two sections.
What’s in an adblocker?
Adblockers can be thought of as toolboxes: They offer a set of capabilities which can be programmed using lists of filters (sometimes referred to as blocklists). These capabilities can be divided into two categories: network filtering and cosmetics (also called element hiding).
Network filtering designates all capabilities which will alter network requests initiated by web pages. Things like blocking and redirecting requests or hardening Content Security Policy (CSP) headers fall into this category. They form the first line of defense against ads, as blocking requests before they are sent to a remote server brings the most benefits. It can be done early enough in the life-cycle of requests to prevent extra work for the user-agent, leading to performance improvements while browsing. There are cases where this mechanism does not suffice, though, which is why we also need cosmetics.
Cosmetics (or element hiding) allow to hide or defuse ads which were already loaded or partially loaded in a web page. Adblockers use such capabilities whenever blocking requests is not enough. In this category we can rely on: hiding DOM elements (based on CSS selectors), injecting custom scriptlets (small self-contained JavaScript snippets which will alter the web page to help prevent or remove ads), or adding custom stylesheets in the page (which can be useful to defuse some in-page popups).
Both categories are essential to block ads; they are complementary. It is important to keep in mind, though, that not all adblockers are able to leverage both network filtering and element hiding. For example, an adblocker implemented at the DNS level can only apply a subset of network filters—it can block based on domains only, to be specific, and it can do so across multiple apps and devices—but cannot apply cosmetic filters, or know about the URLs of network requests. This is problematic in many cases; consider domains which could appear both as first- and third-parties such as facebook.com
or google.com
, blanket blocking of those would break legitimate use cases, but white-listing them would expose users to tracking: a catch-22 situation! In contrast, a browser or extension can potentially do both network and cosmetic filtering, but it is restricted to the web pages it sees. That is, before Manifest v3 is deployed for all Chrome users, since it will seriously handicap the network filtering capabilities of any extension running in this browser—Firefox does not currently have any official plan to cripple its networking APIs.
What are blocklists?
At this point it should be clear that adblockers on their own are of little value, all the “intelligence” coming from blocklists (or filters lists)[2]. These are composed of tens of thousands of filters (aka rules), telling the adblocker how to behave. Cliqz supports the Adblock Plus (“ABP”) filter syntax, as well as extensions introduced by the uBlock Origin and AdGuard projects[3]. This format is the de facto standard that a vast majority of adblockers support.
Because there are two distinct categories of capabilities used to block ads, we also rely on two kinds of filters to program their respective behaviors:
Network Filters
Network filters are used to tell the adblocker which network requests should be blocked or redirected while users browse the Web. For this purpose a network request contains information about: (1) the URL of the request, (2) the URL of the page, and (3) the type of request. Filters can target requests based on this information; here are some of the most common triggering mechanisms:
- Plain Patterns—allow targeting a request based on a string of characters which should be found in its URL. The filter
/ads.js
would match any URL containing this specific string of characters (e.g.https://example.com/ads.js
). - Wildcard Patterns—extend plain patterns with globbing capabilities: the special symbol
*
can be used to match any number of characters. Filter/scripts/*/ads.js
would match any request URL containing the substring/scripts/
, followed by zero or more characters, then/ads.js
(e.g.https://example.com/scripts/v1/ads.js
). - Hostname Anchors—can restrict a rule to a specific host, using the special prefix
||<hostname>
. Filter||ads.example.com
would block any request made to hostnameads.example.com
(or one of its subdomains). - Left and Right Anchors—introduce a special
|
character which, when appearing at the start or the end of a pattern, forces a rule to match the beginning or end of a URL. Filters|https://
and/ads.js|
would both match URLhttps://example.com/ads.js
. - Regular Expressions—are a powerful addition to the adblocker arsenal, allowing to harness the flexibility of RegExps to match requests. Filter
/^https://(sub1|sub2).example.com/
would match requests to eithersub1.example.com
orsub2.example.com
domains. This feature should be used with care as it is usually slower to evaluate.
These features can also be combined to form more complex filters. For example, the filter ||example.org/*/ads.js|
would match request URL https://example.org/scripts/ads.js
.
Last but not least, each filter can be annotated with additional options (following the optional $
character, separated by comas) to further narrow down the requests to block, using information such as their type, the domain of the page, whether the request is first- or third-party, etc. The filter /ads.js$script,third-party,domain=example.com
would match any third-party, string
request, emitted from a page of domain example.com
, if the URL contains /ads.js
.
Cosmetic Filters
Cosmetic filters are not triggered by requests, but modify pages (or frames, to be precise). Their targeting logic is thus slightly simpler—the complexity coming mostly from the sheer amount of filters which can be applied on a given page, as well as the implementation of some of the behaviors (see below). They can optionally specify a list of hosts on which they operate, followed by an expression describing the desired effect; these filters are of the form: [hostname]##[expression]
. Let us briefly go through some of the most common ones:
- Element hiding—is the simplest form of cosmetics. It allows specifying CSS selectors targeting nodes from the DOM which should be hidden on a given page. The filter
example.com##.ads
would hide any element with class.ads
from pages of domainexample.com
. Multiple element hiding filters can target the same page, and the adblocker is in charge of applying them efficiently. - Procedural filters, aka extended CSS selectors—are a more advanced form of element hiding. Instead of relying on declarative CSS selectors, they allow “scripting” the selection of nodes to be hidden using custom operators[4].
- Scriptlets—are small, self-contained JavaScript snippets which run in the context of the page and alter it to help prevent ads from loading, or reduce breakage. The filter
example.com##+js(nobab)
would inject bab-defuser in pages from theexample.com
domain to defuse “BlockAdblocker”. - HTML filters—allow specifying elements from the HTML to be removed before it is parsed by the browser. This capability can be implemented using the StreamFilter API available in Firefox only, which enables an extension to filter the raw response of network requests in a streaming fashion. The filter
example.com##^script:has-text(pattern)
would drop<script>
tags from the HTML if they containpattern
.
It is important to note that cosmetics will not actually improve the privacy protection or provide a speed benefit; they are mostly used to hide ads, remove clutter from pages or circumvent counter-measures deployed by websites to prevent adblocking from taking place.
In the article, we will focus on the network filtering aspect of adblocking, for it is both the primary measure involved in blocking ads and the most CPU intensive one. Efficient element hiding would deserve its own separate article. Let us now dig deeper into the realm of network filtering engines…
What does ‘efficient’ mean?
When it comes to qualifying adblockers, adjectives such as: fast, efficient or advanced are often used. Unfortunately, these are not metrics which can be employed to compare adblockers. Instead, we need to define objective dimensions along which contenders can be measured. We already proposed a few of these in our adblockers performance study:
- Time to match a network request.
- Time to initialize the adblocker (cold start).
- Amount of memory used while operating.
These are some of the most relevant and widely used metrics to measure the performance of network filtering[5]. However, there are other important aspects with the potential to impact the user experience in some ways:
- The size of cached data structures used for faster startup—when supported by the adblocker—needs to be considered when optimizing startup time, especially on mobile devices with poor IO performance.
- Peak memory allocation during initialization is also often ignored, but is very relevant. It represents the maximum amount of memory used while initializing the adblocker[6]; if it is too high, it could get dangerously close or even exceed the memory available on some low-end mobile devices—the most common amount of RAM for an Android phone in 2017 was 1GB, some devices even running on 512MB or less.
It is important to understand at this point that any adblocker design will strike a balance between these metrics. The goal being to be as good as possible in all of them, but being optimal everywhere is extremely difficult.
To illustrate this trade-off, consider that it is possible to trade extra memory usage against increased speed—by implementing a caching layer, for example (though this is only one way). As a consequence, it would be fairly easy to create a fast adblocker requiring a high amount of memory to operate; this would not be desirable. A good adblocker should be both fast and memory efficient.
Challenges of Network Filtering
For the purpose of blocking ads, network filtering means deciding whether to block or not to block each request triggered while loading web pages. Any of the tens of thousands of filters could potentially match a request. Page loads involve an average of 75 requests (median), which means that matching needs to be implemented efficiently to limit its overhead.
Also consider that while loading a page, many components of the browser are competing for resources: parsing HTML, evaluating JavaScript, rendering the page, but also other privacy protections such as anti-tracking or anti-phishing, and of course, the adblocker.
Finally, the engine powering Cliqz’s adblocker is required to run on a wide variety of devices, from the slowest mobile phone[7] to the most high-end desktop computer.
To solve the challenge of network filtering, multiple approaches can be explored. Let us briefly describe two of them, before detailing our current design.
Linear Scan
For the sake of completeness, but also to get a sense of the scale of this problem, let us first consider the most naive solution: a linear scan through all filters, for every single request.
For each request, we would need to iterate through ~60k filters—this is close to the amount of network filters enabled in the Cliqz browser by default, and happens to be the number of filters used in our benchmarks as well. Given our statistic of 75 requests per page, we now need to evaluate 4.5 million rules for each page on average (sometimes much more).
To evaluate this approach, we integrated it into our benchmarks and obtained an average decision time of 5.7 milliseconds per request—which already required that the evaluation of each rule be heavily optimized. This would translate into half a second on our average web page on a fairly high-end laptop—in an ideal scenario where all CPU resources are available for filtering requests—, but up to a few seconds for a mobile phone. This alone is enough of a reason to discard the approach completely.
Hierarchical Index
There are two key observations to be made when looking at network filtering: (1) most requests are not blocked—only 20% were blocked in our study—, and (2) it is enough to find a single filter that triggers[8] for a request to be blocked. This means that a good approach needs to quickly discard filters with no chance to match a given request and only inspect a small set of promising candidates.
One way to achieve that is to rely on a hierarchical data structure which allows quickly finding a good set of filter candidates sharing some similarity with an input request. In our context this idea can be applied by leveraging different dispatching criteria such as type of the request (image
, script
, etc.), domain of page, domain of request and finally, tokens (alphanumeric substrings extracted from both request URLs and filter patterns which are used to create a reverse index). Using all of these, we end up with a tree where each level is used to dispatch according to one particular dimension (e.g. type of requests), and where each leaf contains a subset of the filters which match the conditions of this particular “path” in the tree (from root to leaf). Given a network request, we can then traverse the tree to collect “compatible” candidate filters.
Let us give an example to illustrate what this data structure would look like with a few hand-picked filters. For simplicity’s sake, we will only make use of the domain of the page, domain of the request, type of the request, and a single token from the filter to dispatch. For now let us assume that we have a way to select a good token for each filter; we will explain in the next section how this can be done optimally. Given the following filters and their associated token:
/ads/
ads
/ads-
ads
/pixel.$image
pixel
/1x1.png|$image
png
||example.com/ads.js$script
example
||tracker.com$domain=example.com
tracker
/tracking/$script,domain=foo.com
tracking
We visualize the resulting tree as a table where each line is a path from root to leaf. The left-most column is the root of the tree, where we start from to match a network request. Matching is then performed by visiting each level of the tree (i.e. column) from left to right, and branching into the correct sub-tree each time, based on the request information. The result from a lookup is composed of all the filters found in any leaf (i.e. right-most column) of the tree which are compatible with the request. Note that the special value any means that we need to visit the sub-tree regardless of the value from this same attribute in the request; consequently, a lookup will potentially traverse several parallel sub-trees and consider the filters from multiple leaves.
Root | Page Domain | Request Domain | Type | Token | Filters |
---|---|---|---|---|---|
* | any | any | any | ads | (1) and (2) |
* | any | any | image | pixel | (3) |
* | any | any | image | png | (4) |
* | any | example.com | script | example | (5) |
* | example.com | tracker.com | any | tracker | (6) |
* | foo.com | any | script | tracking | (7) |
Let us take an example request and walk through the lookup algorithm quickly using the above tree (i.e. table). Given a request of type script
with URL https://example.com/scripts/foo.js
on a page from domain example.com
, we first extract tokens from its URL which we will use at the Token stage of the dispatching table: https
, example
, com
, scripts
, foo
, and js
. We then start at the root and iteratively recurse through the tree to narrow down the list of candidate filters. To visualize this process, we list the filters which are considered at each step:
- Root—all filters are considered: (1), (2), …, (7).
- Page Domain—we branch in all except
foo.com
line: (1), (2), …, (6),(7). - Request Domain—we branch in all except
tracker.com
: (1), (2), …, (5),(6). - Type—we branch in first line (any) and fourth line (script): (1), (2),
(3),(4), (5). - Token—we branch only in
example
:(1),(2), (5).
It is important to realise that up to this point, we did not have to look at the filters themselves; the information contained in the index was enough to identify a very small subset of filters which have a chance to match the request (i.e. in this case, only one filter). These candidates are then individually evaluated[9] against the request to find out if they match.
In practice, this approach is able to discard more than 99% of all filters which would not have any chance to match the input request. This is the single most effective mechanism we found to speed up adblocking: doing less work.
But we are not quite done yet! This approach still has a few drawbacks. In particular, although it is very fast, it also uses a lot of memory due to the complexity of the dispatching data structure. It is also harder to efficiently serialize and deserialize. The first implementation of Cliqz’s adblocker was based on a variation of this architecture—at the time we relied on built-ins JSON.parse
and JSON.stringify
for caching (it was the fastest way we found), which led to a serialized size of 30-40MB (un-compressed). The memory usage was also in the ballpark of 30-40MB, which was not suitable for mobile devices. We needed something else, something more… compact…
Current Design—Flattening the tree
To go further, we wanted to keep the runtime speed of the hierarchical approach, but improve both its memory usage and speed of serialization (i.e. the time it takes to create a cached version of the engine as a string or bytes array from its in-memory representation) and deserialization of the engine back from said cache. This was critical for the success of our new Android browser, which we wanted to run fast even on low-end devices.
Firstly, we realized that most of the dispatching could be done using tokens only, and that removing all other layers (page domain, request domain and type) did not hurt the performance too much. Secondly, instead of indexing each filter once for each token (e.g. /foo/bar-
would be indexed using both foo
and bar
), it was enough to use one good token for each filter; this assumption holds because, to get a match between a filter and a URL, all tokens from the filter need to also appear in the URL. This would simplify the lookup logic since we now had the guarantee that each filter could only appear once in the final set of candidates—whereas we had to rely on a set beforehand to not evaluate the same filter multiple times for a given request.
Picking the best token for each filter
Since this approach now relied exclusively on tokens to identify candidate filters, it was crucial to find a good way to pick the best token for each filter. In our case, this translates into finding the most discriminative, or rarest token. In other words, we always index a filter using the token that was least seen in other filters, leading to most “buckets” containing a single filter (i.e. indexed with a token which was not seen in any other filter). To implement this approach, we first tokenize all filters, then compute a histogram of frequencies where we keep track of the number of occurrences of each, and finally, use this information to pick the token for each filter.
We now had a much simpler data structure—basically a map or hash table, which we implemented using JavaScript’s Map
—where keys were tokens, and values were buckets of filters containing this particular token. Unfortunately, removing the dispatch on domains and request types introduced a few corner-cases: Some buckets containing a lot of filters. In particular:
- Many filters did not have any token at all, e.g.
$script,domain=zdnet.fr,3p
. - Many filters were indexed with very common tokens, e.g.
|https://$domain=Y
, indexed usinghttps
.
This meant that each request still had to be evaluated against a fairly high number of filters (around 100-200), which was not optimal. To circumvent this issue, we had to perform a conceptual shift. Instead of only restricting tokens to be substrings of filter patterns, we could use any information attached to a filter, e.g. domain of the page or type of targeted request. In this sense, we were getting back to the kind of dispatch that our previous hierarchical data structure was capable of, while keeping it flat. The only requirement being that while matching requests, we also had to perform lookups with the same “virtual tokens”—when matching a request from a page of domain example.com
, we would use tokens from the URL, as well as example.com
, as tokens to find filters in our index. Using this trick, we were able to reduce the size of our biggest buckets drastically, and matching a request would now require to only inspect around 10 candidate filters on average, instead of the tens of thousands present in the index.
Although our new approach was much simpler (which is always good) than before, serializing to cache and initializing the engine back to its in-memory representation still required a lot of work and memory copies. Serialization consisting of “stringifying” our big instance of Map
by creating an intermediary object and then using JSON.stringify
; loading it back required performing the inverse process. This was still expensive…
Going low-level with typed arrays
What if we could use the same memory representation for both the engine and the cache? This would allow us to not have to copy any data or rely on the expensive “stringification” process. As an added benefit, we would not have to pay the cost of deserializing all filters, while only a small fraction are needed in practice[10]. If this sounds familiar, it is because this technique is already used in performance-critical code, mostly in low-level languages like C, C++ or Rust—it is much less common for JavaScript projects, which cannot use mechanisms like mmap.
In particular, we had to find a new data structure which would be both representable as a single typed array (i.e. Uint8Array
in JavaScript), and allow matching requests without having to load the data contained in the array into another in-memory representation beforehand (which was the case with the previous algorithm using a Map
). Such a data structure would allow to start matching request using the same representation used for caching, and lazily load the small subset of filters required. This shift required a few drastic changes in the code base:
- All classes (i.e. ReverseIndex and NetworkFilter) had to become “serializable” to a typed array. To ease this process, we implemented our own DataView abstraction which allowed to quickly dump any data to an array and load it back in its original form.
- Using this DataView abstraction, we came up with a new compact data structure to representing the index, stored as a single
Uint8Array
, containing all filters organized into buckets, grouped by token (i.e. the buckets presented above); conceptually mimicking the instance ofMap
that we aimed to replace. - We also had to come up with a new way to perform fast lookups on this custom compact data structure to replace
Map#get
. In essence, we were implementing a custom hash table using typed arrays as memory storage. This algorithm would probably deserve its own article, but let us describe it quickly. In a nutshell, the array is divided into three areas: (1) a “filters store”, where all filters are represented in their serialized, compact format, (2) a “buckets store”, containing pairs of<token, filter>
, grouped by token, and associating each token to the offsets of serialized filters from the “filters store”, and (3) a “jump table” allowing to quickly find the area of the array where buckets associated with a given token are located. Looking a token up in the index now consists of using the jump table to identify the list of buckets associated with the token, iterating through said buckets with a linear scan to collect the list of indices from “filters store” where candidates are located, then deserializing the subset of filters from their compact representation into instances of theNetworkFilter
class, for matching the request. This description purposefully omits most of the low-level details of the data structure and we encourage the curious reader to check the implementation directly.
This new compact data structure allowed us to only ever deserialize the “useful subset of filters”, needed to match requests from visited domains. This implementation was slightly slower than JavaScript’s Map
, though, and to make sure we only pay the cost once, we introduced a small caching layer on top to keep deserialized filters ready for future use; making the overhead negligible.
There was one last catch with this approach, though, as we had to find a way to allocate the right amount of memory for the array while initializing the adblocker engine for the first time. Given a list of filters, we do not know a priori the size of the final typed array (it depends on the serialized size of all filters as well as the size of the index metadata needed to implement the “compact map”). There are three approaches that we used in the past:
- Allocate a big array upfront (of a size bigger than the maximum expected), use it to store the compact index, then resize it to the exact size in the end—once this value is known. This approach has two downsides: (1) it wastes memory because we need to overallocate (2) it is not able to recover if more memory than expected is needed (e.g. if more filters are used).
- Dynamically resize the typed array when the current size is exceeded. This is a classic approach to solve this problem (e.g. used for the dynamic vector data structure). The downside is that we need to check if size is exceeded for each “store operation”, which can become expensive.
- The current approach consists in estimating the exact size of the compact data structure upfront, based on the size of each filter indexed as well as the expected size of the index metadata itself. The downside of this approach is that it is more complex, but it allows performing exactly one allocation and no copy, which is great for performance, and allows to handle an arbitrarily high number of filters.
Our adblocker engine now consists of a single typed array (of around 3-7 MB depending on the number of filters enabled), as well as a few lightweight views on top of it to allow performing lookups with a high-level API. A nice benefit of this architecture is that the total amount of memory used by the adblocker is fairly predictable, as it roughly equals the size of the typed array (memory usage can increase slightly while matching due to the small caching layer keeping used filters in-memory, but this is optional). Serializing the engine to cache now consists in storing the inner array on disk without having to perform any copy beforehand, and initializing the adblocker back from this cache only requires instantiating a few views (i.e. instantiating some classes, which is very cheap) on top of the array. This leads to a virtually instantaneous loading of the engine; we like to think of this tactic as “mmap for JavaScript”.
Reducing size further with string compression. More recently, we realized that half of the memory used by the adblocker could be attributed to raw strings from filter patterns. We need to keep these around to be able to match them against requests. To reduce this overhead, we implemented a custom “small strings compressions” algorithm, inspired by smaz, as part of the adblocker. It was trained on real data to be able to learn how to compress adblocker filters efficiently. This method allowed us to reduce the size of the typed array (hence, memory usage) by an extra 25%.
How you can use it in your own project
Cliqz’s adblocker employs some complex tricks under the hood to achieve state of the art performance. Yet, we put a lot of care into making sure that using the project is as easy as possible. The adblocker is distributed via platform-specific packages allowing an easy integration in any browser extension, Electron app or Puppeteer script. If you do not know which blocklists to use, we offer three presets to get started and cover most use cases:
- Blocking ads only.
- Blocking ads and trackers.
- Blocking ads, trackers and ‘annoyances’.
For full documentation, feel free to check the GitHub page of the project. We summarize below how to get started for each platform with minimal code snippets.
Browser Extension
To integrate Cliqz’s adblocker in a browser extension, install @cliqz/adblocker-webextension from NPM then include this snippet in your background.js
script:
import { WebExtensionBlocker } from '@cliqz/adblocker-webextension';
WebExtensionBlocker.fromPrebuiltAdsAndTracking().then((blocker) => {
blocker.enableBlockingInBrowser();
});
Electron
For Electron, we use defaultSession
to attach the adblocker. You also need an implementation of fetch
to download blocklists. We recommend installing cross-fetch
alongside @cliqz/adblocker-electron then:
// Assuming global 'session' from Electron is available.
import { ElectronBlocker } from '@cliqz/adblocker-electron';
import fetch from 'cross-fetch'; // required 'fetch'
ElectronBlocker.fromPrebuiltAdsAndTracking(fetch).then((blocker) => {
blocker.enableBlockingInSession(session.defaultSession);
});
Puppeteer
Getting started in Puppeteer is also a breeze, with only a few lines needed to enable the full-blown adblocker. We assume that a page
was already created and can be used in this scope. Install cross-fetch
and @cliqz/adblocker-puppeteer then add the following to your script:
// Assuming global 'page' from Puppeteer is available.
import { PuppeteerBlocker } from '@cliqz/adblocker-puppeteer';
import fetch from 'cross-fetch'; // required 'fetch'
PuppeteerBlocker.fromPrebuiltAdsAndTracking(fetch).then((blocker) => {
blocker.enableBlockingInPage(page);
});
Conclusion & Future Perspectives
Cliqz’s adblocker has now been running in production for three years. During this time, we evolved it from a desktop-browser-only feature to a core part of our products, for both mobile and desktop. Its efficiency has been continuously improved, up to the point where it became the benchmark in terms of speed. This could look like an “end of the road”, but there are many challenges ahead…
Firstly, the advertising industry is fighting back and continuously finding new ways to circumvent blocking: By increasingly relying on first-parties to serve tracking or advertisement scripts, randomizing HTML attributes to obfuscate the DOM, or using CNAME cloaking to hide third-party domains. To combat each of these, adblockers need to invent new ways to protect users; and they can only do so by leveraging the tools and APIs provided by the platforms they operate on (i.e. usually a browser or operating system).
Secondly, platforms are becoming increasingly hostile to adblockers—and privacy protection in general. The fast pace of development required to keep up with the advertising industry is only possible if platforms provide the necessary flexibility. Unfortunately, the trend is going in the opposite direction, with changes like Manifest v3, or closed ecosystems like Apple’s operating systems and browser; offering only limited and unsatisfactory extensibility. At the time of writing, it seems that only Mozilla is willing to empower developers to build the tools necessary for a safer Web experience.
Not everything looks bad, though! We see awareness rising around the topics of privacy and data collection. Initiatives like the Ad-Blocker Developer Summit, frequent collaborations between adblocker developers, and new companies deciding to ride the wave of privacy, are reasons to look ahead hopeful. No matter what, we will keep fighting for a better Internet.
Footnotes
We did not keep track of all projects using the library, but to illustrate their diversity, here are a few examples: implementing the adblocker in Cliqz browser for desktop and Android, targeting WebExtension or React Native environments, Firefox and Chrome browsers, but also Puppeteer (e.g. to speed-up crawlers), Electron applications as well as backend processing jobs running in Node.js. ↩︎
These blocklists are operated by hundreds of volunteers, in communities such as Easylist, uBlock Origin or AdGuard. They are the undercover heroes who make blocking ads at scale possible. It should be noted tough, that due to the great complexity of the task at hand, they are subject to noise. Our ‘cliqz.com’ domain is still part of some minor blocklists; an issue that we are currently trying to address. ↩︎
Based on the original filter syntax created by Adblock Plus, other adblockers have since then implemented extended capabilities to make filtering more flexible and powerful. Feel free to read the AdGuard knowledge base, uBlock Origin wiki or the Cliqz compatibility matrix for further information. ↩︎
The implementation of procedural filters in Cliqz is currently work in progress and will be deployed early 2020. It should be noted, though, that they represent a very small fraction of all cosmetic filters—about 0.6%—and are mostly used to fine-tune the behavior of the adblocker on specific websites. ↩︎
Since the study was published, both our requests dataset and benchmarking code have been used extensively by the most popular adblockers (e.g. uBlock Origin, Adblock Plus, Brave) to measure and improve their own performance. ↩︎
We showed at the Adblocker Dev Summit in last September that peak memory usage while initializing an adblocker from blocklists could be up to 25x more important than when loading from cache - figure page 5, slides. ↩︎
The latest iteration of our efficient filtering engine design was motivated by the development of a new Android browser, which should run smoothly even on the slowest phone we could find; internally, we called it “potato phone”. ↩︎
This is a slight over-simplification of reality for the purpose of the explanation. One thing we did not mention is that there are special filters such as exceptions or important rules which have different priorities and need to be handled accordingly. ↩︎
Although it is not covered in detail in this article, the implementation of the matching logic for each filter type is also difficult. In particular, there are many corner cases to handle and the semantics is not always well defined; sometimes relying on how maintainers use the features. Lastly, this matching needs to be as fast as possible. The interested reader can find the implementation on GitHub. ↩︎
While working on our performance study, we realized that only around 10% of the filters were used once or more on the top 500 most popular domains. This means that only a very small subset of all filters apply on a lot of network requests, and that there is a long tail of filter which apply either rarely or never. Consequently, minimizing the memory and performance overhead of unused filters is critical. ↩︎