Skip to content
New Report: Get your copy of The 2024 State of the Internet Report! | Download Today
Blogs

Fuzzy Matching to Find Phishy Domains

Summary

  • Spoofed domains and brand impersonators are still a prevalent problem, and one of the difficulties is timeliness in finding the impersonators.
  • By using “fuzzy matching” with Censys data and BigQuery, organizations can proactively find and block domain impersonators, thus protecting their users.

The Internet is a vast place, and there can be a lot of pitfalls for users. Technology has made it easier for malicious actors to spin up fraudulent websites quickly and easily, and threat actors can use these spoofed domains or brand impersonations to trick users into forfeiting sensitive information. Threat actors often also target specific organizations by pretending to be the organization itself, thus tricking unsuspecting employees and gaining access to internal infrastructure.

Organizations often have tools that enable them to proactively protect their employees from this sort of attack, but that requires both knowing the domains and then blocking them as quickly as possible. In other words, time and knowledge are both critical to reducing harm from domain/brand impersonators.

However, with Censys, BigQuery, and a bit of help from the Levenshtein Distance, this problem becomes as simple as a query and allows you to blocklist suspicious domains faster.

At Censys we constantly scan the Internet, which means we are able to find a lot of information quickly, including potential impersonators. Examining all of this data through Search can be challenging, though, especially if you are trying to filter on multiple different data fields. As such, for this use case we’ll utilize BigQuery, Google’s serverless data warehouse, to find suspicious domains. A primer on how to search through Censys data via BigQuery is linked here.

By using BigQuery’s built-in Javascript User Defined Functions (UDFs), we can implement the Levenshtein Distance algorithm and “fuzzy match” all known hosts in our dataset for a specific domain. Levenshtein’s Distance is a measure of distance between two strings, which is a simple and ideal calculation for us to use when we are trying to find impersonator domains that look similar to a legitimate domain. In this write-up, we use bankofamerica[.]com, a large US financial institution, as an example.

Using Levenshtein’s to Examine Different Aspects of a URL in BigQuery

Since Levenshtein Distance is sensitive to small changes in the strings, we’ll tokenize and examine different parts of the URL, specifically the full URL and the domain. It is possible to break down these queries even further to look at subdomains specifically, but we only look at these two iterations of the URL for simplicity. Moreover, we remove the TLD information, because it is trivial for an attacker to purchase an alternate TLD, but keeping it in the comparison can drastically change the results.

Thus, the following query queries the IPv4/IPv6 address of a host, all of it’s dns.names[], and partitions the dns.names[] into a URL without a TLD and a domain with a TLD, and then computes the Levenshtein’s Distance algorithm on a scale of 0 to 1 (0 is no match, 1 is exact match). This query only examines instances where the Levenshtein’s algorithm outputs 0.8 or higher (and is not 1), but this threshold can also be modified for your use.

Combining Levenshtein's in a BigQuery UDF with Censys Data
CREATE TEMPORARY FUNCTION levenshtein(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
if(a == null) return 0;
if(b == null) return 0;
if(a.length == 0) return b.length;
if(b.length == 0) return a.length;
var matrix = [];
var i;
for(i = 0; i <= b.length; i++){
 matrix[i] = [i];
}
var j;
for(j = 0; j <= a.length; j++){
 matrix[0][j] = j;
}
for(i = 1; i <= b.length; i++){
 for(j = 1; j <= a.length; j++){
   if(b.charAt(i-1) == a.charAt(j-1)){
     matrix[i][j] = matrix[i-1][j-1];
   } else {
     matrix[i][j] = Math.min(matrix[i-1][j-1] + 1,
                             Math.min(matrix[i][j-1] + 1,
                                      matrix[i-1][j] + 1));
   }
 }
}
var strlen = Math.max(a.length, b.length);
return 1.0 - (matrix[b.length][a.length] / strlen);
""";
WITH
uids_token AS (
SELECT
  host_identifier.ipv4,
  host_identifier.ipv6,
  dns_name as url,
  NET.PUBLIC_SUFFIX(dns_name) as tld,
  SPLIT(NET.REG_DOMAIN(dns_name), ".")[SAFE_OFFSET(0)] AS domain_no_tld,
FROM
  `censys-io.universal_internet_dataset_v2.base`,
  UNNEST(dns.names) dns_name
WHERE
  DATE(snapshot_date) = '2023-10-31'
),  uids_token_no_tlds AS (
SELECT
  *,
  REPLACE(url, CONCAT('.', tld), '') as url_no_tld
FROM uids_token
), uids_score AS (
SELECT
  *,
  levenshtein(LOWER(domain_no_tld),"bankofamerica") domain_no_tld_lev_score,
  levenshtein(LOWER(url_no_tld), "bankofamerica") url_no_tld_lev_score,
FROM
  uids_token_no_tlds )
SELECT
*
FROM
uids_score
WHERE
domain_no_tld_lev_score != 1 AND
(domain_no_tld_lev_score > 0.8 OR url_no_tld_lev_score > 0.8)
ORDER BY
domain_no_tld_lev_score DESC,
url_no_tld_lev_score DESC

BigQuery Output

A screenshot showing the output of the first BigQuery query.

As we can see, there are a number of interesting urls/domains that are worth further investigation or blocking. However, the analysis does not need to stop here.

We can append additional Censys data about these hosts to help filter even further. A slight modification to this query will append Autonomous system name, location data, and certificate issuer to the results, which could allow faster identification of suspicious infrastructure. For example, the output of this query shows a number of hosts located in the Proofpoint ASN, which may not be notable as Bank of America could be a Proofpoint customer. However, there are also a number of other results in different ASNs that have certificate issuers that are different from Bank of America’s homepage (Entrust, Inc.). These results could be worth blocking or digging into further.

Combining Levenshtein's in a BigQuery UDF with Additional Censys Data
CREATE TEMPORARY FUNCTION levenshtein(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
if(a == null) return 0;
if(b == null) return 0;
if(a.length == 0) return b.length;
if(b.length == 0) return a.length;
var matrix = [];
var i;
for(i = 0; i <= b.length; i++){
 matrix[i] = [i];
}
var j;
for(j = 0; j <= a.length; j++){
 matrix[0][j] = j;
}
for(i = 1; i <= b.length; i++){
 for(j = 1; j <= a.length; j++){
   if(b.charAt(i-1) == a.charAt(j-1)){
     matrix[i][j] = matrix[i-1][j-1];
   } else {
     matrix[i][j] = Math.min(matrix[i-1][j-1] + 1,
                             Math.min(matrix[i][j-1] + 1,
                                      matrix[i-1][j] + 1));
   }
 }
}
var strlen = Math.max(a.length, b.length);
return 1.0 - (matrix[b.length][a.length] / strlen);
""";
WITH
uids_token AS (
SELECT
  host_identifier.ipv4,
  dns_name as url,
  NET.PUBLIC_SUFFIX(dns_name) as tld,
  SPLIT(NET.REG_DOMAIN(dns_name), ".")[SAFE_OFFSET(0)] AS domain,
  autonomous_system.name as asn_name,
  location.country as country,
 certificate_issuer_org,
FROM
  `censys-io.universal_internet_dataset_v2.base`,
  UNNEST(dns.names) dns_name,
    UNNEST(services) svc
 LEFT JOIN UNNEST (svc.tls.certificates.leaf_data.issuer.organization) certificate_issuer_org
WHERE
  DATE(snapshot_date) = '2023-10-31'AND
 svc.tls IS NOT NULL
),  uids_token_no_tlds AS (
SELECT
  *,
  REPLACE(url, CONCAT('.', tld), '') as url_no_tld
FROM uids_token
), uids_score AS (
SELECT
  *,
  levenshtein(LOWER(domain),"bankofamerica") domain_lev_score,
  levenshtein(LOWER(url_no_tld), "bankofamerica") url_no_tld_lev_score,
FROM
  uids_token_no_tlds )
SELECT
*
FROM
uids_score
WHERE
domain_lev_score != 1 AND
(domain_lev_score > 0.8 OR url_no_tld_lev_score > 0.8)
ORDER BY
domain_lev_score DESC,
url_no_tld_lev_score DESC
A screenshot showing the output of the second BigQuery query

This screenshot shows how additional metadata from Censys can be added to more quickly filter out legitimate use cases.

This write-up is meant to be a jumping point for your own investigations, and can be further modified to your organization’s own needs. To find out more about how to use BigQuery with Censys, check out our help docs, and also check out more about BigQuery. We hope that by showing how to combine BigQuery and Censys to fuzzy match the phishers, we can empower your organization to protect users more quickly!

About the Author

Ariana Mirian
Senior Security Researcher
Ariana Mirian is a Senior Security Researcher at Censys. She works at the intersection of Internet measure and security, and uses large-scale measurement to answer interesting questions about the Internet and improve Censys' data.

Similar Content

Back to Resources Hub
Attack Surface Management Solutions
Learn more