What is the OpenDNS Security Graph?
Security Graph is the OpenDNS technology that automates detection of both known and emergent threats. Security Graph analyzes a cross-section of the world’s Internet activity to observe infrastructure being staged before an attack is launched. This predictive intelligence powers Investigate, our threat intelligence product, and Umbrella, our security enforcement product.
How we collect data
As one of the largest global recursive DNS providers, OpenDNS resolves and routes over 70 billion Internet requests daily, from 65 million active consumer and enterprise users across 160+ countries. This diverse data set reveals billions of combinations of domain names, origin, and destination IP addresses that we can use to find where attacks are staged and launched, how widespread the attack is, and can even predict future threats.
Traditional databases cannot store and process the millions of events per second and terabytes of data streaming through our systems in real time. Our infrastructure engineers use Apache Hadoop , Hive , Pig , and other cutting edge technologies to overcome the big data challenges.
But no security provider can do it all. So we cultivated a community of 200 security partners—not just public feeds. For example, some partners share attack attribution data that we correlate with domain names to uncover a deeper understanding of threats.
- What data do we collect at the DNS and BGP layer? The list is very long, but here are some key elements that we collect: domain names requested, IP addresses stored per DNS record, authoritative nameservers per DNS record, other DNS record attributes (e.g. registered owner), the geographic distribution and traffic volume per DNS request, and the BGP routes between IP networks. We can even route the traffic for some domain names through our own proxies to discover malicious content. We track this data historically to observe where attack infrastructures are staged.
- Why do we exchange data with security partners? Our partnerships enable us to correlate attacks with infrastructures that are being mobilized for a developing attack that has not yet been launched. We do not reverse engineer malware to uncover what it exactly does on devices or networks. We also do not create signatures. But other vendors do, which is why we have over 200 partnerships to layer on security intelligence.
How we analyze data
We constantly observe new unusual DNS request patterns, atypical domain names, and suspicious DNS records or BGP route changes. The volume of information makes it impossible for even an army of researchers to analyze it all. So we hire data scientists to train machines how to identify malware, botnets, phishing, and advanced threats based on this real-time and historical activity.
OpenDNS Security Labs is our team of data scientists, engineers, mathematicians, and security researchers that is constantly innovating on security. Our Security Labs team uses visualization, advanced data mining techniques, and security domain expertise to develop algorithmic classifiers to categorize and score data. These classifiers are used to automatically reveal patterns, detect anomalies, classify malicious domains, and predict future malicious sites.These are some of the techniques used by our Security Labs team:
- Visualization It’s hard to find patterns by looking through millions of rows of log data. We take a different approach by applying 3D modeling and other visualization techniques to our data. OpenGraphiti is an interactive open source data visualization engine we created to enable security analysts, researchers and data scientists to pair visualization with big data to create 3D representations of threats. This helps our Security Labs team quickly identify patterns and relationships that would otherwise be hard to find. Visualization helps us see how the Internet changes over time and see where to dig deeper in our research.
- Data mining The Security Labs team builds algorithmic classifiers using data mining methods including graph theory, machine learning, and artificial intelligence. We train and tune the classifiers over time, so they automatically analyze and score all of our data.
- Domain expertise The Security Labs team supplements our machine learning with human intelligence. Not only do our researchers have years of security domain expertise, but we also collaborate with over 200 partners in the security community.
Intelligence gained from Big Data analysis
To discover patterns and detect anomalies, we design algorithmic classifiers to categorize and score data. Many classifiers analyze spatial relationships, such as graphing the associations of and connections between networks and systems across the Internet. Some classifiers analyze time-based relationships, such as discovering domain co-occurrences as a result of consecutive DNS requests over very short timeframes, repeated by thousands of users. Other classifiers analyze statistical deviations from normal activity, such as measuring the geographic distribution of IP networks requesting a domain name.
Our systems automatically combine and correlate the output of these classifiers to accurately predict whether a domain name or IP address should be blocked.These are some of the algorithmic classifiers we use:
- SecureRank: is a classifier based on graph theory that effectively finds domains guilty by association. It creates a large bipartite graph of the Internet, and examines where known compromised systems are going, as well as other systems that are hitting malicious locations. Then this classifier finds other common locations that those systems are visiting and determines those locations that are malicious. This classifier ranks the security risks of all domain names by applying an iterative process similar to Google’s PageRank algorithm .
- C-Rank: is a classifier derived from co-occurrence patterns among domains. Co-occurrence of domains means that a statistically significant number of clients have requested both domains consecutively in a short timeframe.
- DGA Score: to identify whether a domain has been created by a Domain Generated Algorithm (DGA), we analyze features like domain name length and character entropy. A DGA appears as a random string of letters and numbers in a domain name.
- Geo-Diversity & Geo-Distance Scores: these classifiers look at where DNS requests originate, and how far away they are from the IP address’ geolocation. They then compare the geographic distribution of DNS requests with the predicted one for the top level domain (e.g. RU, CN, COM).
- Lexical Feature Scores: these classifiers are used to detect fast flux . Fast flux is a DNS technique that botnet command and control infrastructures use to hide behind a compromised system, which acts like a proxy.
- RIP / Prefix / ASN Reputation Scores: these classifiers compute a score for domains based on the IP addresses in their DNS records. Unlike most reputation systems that are only based on the IP or the domain, we combine them into pairs.
- Popularity & PageRank Scores: we compute a “popularity score” based on the number of distinct origin IP addresses having visited a domain name. This is a Bayesian average , similar to reputation scores.
- NLPRank: this is one of our newest classifiers that leverages natural language processing (NLP) techniques to detect cyber-squatting and targeted phishing domains. For example, NLPRank detects fraudulent branded domains, such as paypai.com, which may serve as a malicious domain for an attack against paypal.com.