Efficient, Offline Access for Neustar IP Intelligence Data
IP intelligence is useful for applications such as localization and enforcing security policy. Duo uses such information to power parts of our recently released Platform Edition. Two popular vendors in this space are Neustar and MaxMind. MaxMind’s GeoIP services tend to cost less or are entirely free, both of which have contributed to greater availabilty of open-source tools. Neustar’s GeoPoint service provides additional and different data; however, it isn’t as widely used, and, as such, there isn’t as strong of an open-source community. Additionally, GeoPoint is delivered in CSV format, which is computationally difficult to query in real time.
We have developed and released a solution to solve these two problems (lack of community-maintained tools and computational query cost) by converting Neustar GeoPoint data into the database format used by MaxMind GeoIP. This allows Neustar customers to take advantage of the speed of MaxMind’s offline database format and the collection of community-supported tools for MaxMind databases.
The GeoPoint data is a CSV file that contains 30+ fields, such as:
- Start IP (int)
- End IP (int)
- ASN (int)
- Country Code (str)
One access method suggested by Neustar is loading the data into a database. Dumping the data into a RDBMS would result in substantially increased load on production database servers, particularly since some of our Platform Edition features query the GeoPoint data for every authentication attempt. On the other hand, the naïve offline alternative, grepping a 14 GB CSV file, would be orders of magnitude slower than we could tolerate. Given the goal of achieving the best of both worlds (efficient and offline), I turned to a popular offline single-file database format: MaxMind DB.
MaxMind GeoIP supports API bindings in seven languages; community-maintained bindings are available for at least 13 other languages. Given the broad support and wide usage, including within our products, I built a tool to convert the GeoPoint data to the MaxMind database format. MaxMind databases consist of trees of IP addresses (IPv4 or IPv6) and does longest-prefix matching to return a JSON object, which we fill with data from the GeoPoint CSV file.
The flow of data is similar to MapReduce. To illustrate the process, we’ll walk through the diagram below:
preprocess.py: Express the non-CIDR ranges (e.g., 18.104.22.168--22.214.171.124 in the first record) as the corresponding CIDR blocks.
reduce.py: Condense equivalent adjacent blocks into the fewest possible CIDR blocks. Equivalency depends on which fields from the GeoPoint file you’re interested in. In the diagram below, the assumption is that we’re only interested in the value of “foo”, which enables merging the three CIDR blocks into one record.
generate_mmdb.pl: Feed the output into a Perl program that uses MaxMind’s writer library to construct an MMDB file.
Downloading and Contributing
neustar2mmdb v1.0 is available as open-source software (under the MIT license) on GitHub: https://github.com/duo-labs/neustar2mmdb. We’re actively working on improving performance, and pull requests are welcome. Usage information is in the README file.