What is the OpenDNS Security Graph?
Security Graph is OpenDNS’s technology that automates protection against both known and emergent threats. It analyzes a cross-section of the world’s Internet activity to observe attacks being staged before an attack is launched. This predictive intelligence powers our business security product—Umbrella.
Where and how we collect data
As one of the largest global recursive DNS providers, OpenDNS resolves and routes over 60 billion daily Internet requests, from 50 million active users. Examining two percent of the world’s Internet activity reveals billions of unique combinations of domain names, origin and destination IP addresses.
Traditional databases cannot store and process the millions of events per second and terabytes of data streaming through our systems. Our infrastructure engineers use Apache Hadoop , Hive , Pig , and other cutting edge technologies to overcome the big-data challenges just like Google, Netflix, and Amazon.
But no security provider can do it all. So we cultivated a community of 200 security partners—not just public feeds. For example, some partners share attack attribution data that we correlate with domain names to uncover a deeper understanding of threats.
- What data do we collect at the DNS and BGP layer? The list is very long, but here are some key elements that we collect: domain names requested, IP addresses stored per DNS record, authoritative nameservers per DNS record, other DNS record attributes (e.g. registered owner), the geographic distribution and traffic volume per DNS request, and the BGP routes between IP networks. We can even route the traffic for some domain names through our own proxies to discover malicious content. We track this data historically to observe where attack infrastructures are staged.
- Why do we exchange data with security partners? Our partnerships enable us to correlate attacks with infrastructures that are being mobilized for a developing attack that has not yet been launched. We do not reverse engineer malware to uncover what it exactly does on devices or networks. We also do not create signatures. But other vendors do, which is why we have over 200 partnerships to layer on security intelligence.
Why we use machine learning
We constantly observe new unusual DNS request patterns, atypical domain names, and suspicious DNS record or BGP route changes. As a result, it is impossible to hire an army of researchers to analyze it all. So we hire data scientists to train machines how to identify malware, botnets, phishing, and advanced threats linked to this live activity. Machine learning scales the analysis of huge volumes of data faster by removing humans from the process.
Our customers benefit by discovering and often predicting domains and IP addresses related to threats before antivirus vendors, reputation systems, or sandboxes flag them as malicious. Ultimately, this enables your security team to stay ahead of attacks.
How we use algorithmic classifiers to analyze Big Data sets
To discover patterns and detect anomalies, we design algorithmic classifiers, which are mathematical formulas that categorize and score data. Many classifiers analyze spatial relationships, such as graphing the associations of and connections between networks and systems across the Internet. Some classifiers analyze time-based relationships, such as discovering domain co-occurrences as a result of consecutive DNS requests over very short timeframes, repeated by thousands of users. Other classifiers analyze statistical deviations from normal activity, such as measuring the geographic distribution of IP networks requesting a domain name.
To prevent or contain a threat, our machine learning systems automatically combine and correlate the output of these classifiers to accurately predict whether a domain name or IP address should be blocked.These are some of the algorithmic classifiers we use:
- SecureRank: is a classifier based on graph theory that effectively finds domains guilty by association. It creates a large bipartite graph of the Internet, and examines where known compromised systems are going, as well as other systems that are hitting malicious locations. Then this classifier finds other common locations that those systems are visiting and determines those locations that are malicious. This classifier ranks the security risks of all domain names by applying an iterative process similar to Google’s PageRank algorithm .
- C-Rank: is a classifier derived from co-occurrence patterns among domains. Co-occurrence of domains means that a statistically significant number of clients have requested both domains consecutively in a short timeframe.
- Domain Name Bigrams / Trigrams : to identify whether a domain has been created by a Domain Generated Algorithm (DGA), we analyze features like domain name length and character entropy. A DGA appears as a random string of letters and numbers in a domain name.
- Geo-Diversity & Geo-Distance Scores: these classifiers look at where DNS requests originate, and how far away they are from the IP address’ geolocation. They then compare the geographic distribution of DNS requests with the predicted one for the top level domain (e.g. RU, CN, COM).
- Lexical Feature Scores: these classifiers are used to detect fast flux . Fast flux is a DNS technique that botnet command and control infrastructures use to hide behind a compromised system, which acts like a proxy.
- RIP / Prefix / ASN Reputation Scores: these classifiers compute a score for domains based on the IP addresses in their DNS records. Unlike most reputation systems that are only based on the IP or the domain, we combine them into pairs.
- Popularity & PageRank Scores: we compute a “popularity score” based on the number of distinct origin IP addresses having visited a domain name. This is a Bayesian average , similar to reputation scores.