amuck-landowner

First step: Geo database

wlanboy

Content Contributer
After thinking about and coding a I am now building up my own geo database service.

The goal - if reachable - would be a selfmade geolocation based service to redirect website calls to a closer server.

I want to talk about my first steps and hopefully will get some input about different open items.

First step is to create a list of countries and (US) states and their geo location (using midpoints).

I have finished that point - but hopefully someone can point me to a open source based list of geo locations of all cities.

The workflow for the application is:

  • Use geolite2 from Maxmind to get longitude and latitude of IP (alternatives?)
  • Search for the closest country (later city) to the geo point returned by Maxmind

Distant calculation can be done by the haversine formula:

haversine.jpg

Parameters are:

  • φ latitude
  • λ longitude
  • R radius of earth
The script has to calculate the distance of each geo point (of the country and state list) to the returned geo location and sort the result by distance. Next step is to check the list of servers with the sorted list of distances to find the closest server.

Perfect match for map/reduce functionality of MongoDB.

Maxmind sometimes has a quite high miss rate of geo locations. Like ips in Detroit which are located in Kansas... maybe because I am using the free edition.

To provide a first sight to my project I have created a form where you can set one ip and three server locations - the closest server location will be returned.
 

bauhaus

Member
You lost me at the Haversine formula :p

Interesting stuff as always. In the past I have used this database http://www.geonames.org/export/

It is free with a common creative license.

This one for USA http://geonames.usgs.gov/domestic/download_data.htm at least is free.

And this one http://earth-info.nga.mil/gns/html/namefiles.htm worldwide.

The only issues I have with geonames.org and nga.mil databases in my projects (not dns related) was the fact that they are huge, kinda every little place is already geotagged, so maybe you can cross reference with some major cities or important cities of the world list and trim then down.

Good luck, pretty nice stuff you are doing.
 
Last edited by a moderator:

wlanboy

Content Contributer
Interesting stuff as always. In the past I have used this database http://www.geonames.org/export/

It is free with a common creative license.

And this one http://earth-info.nga.mil/gns/html/namefiles.htm worldwide.

The only issues I have with geonames.org and nga.mil databases in my projects (not dns related) was the fact that they are huge, kinda every little place is already geotagged, so maybe you can cross reference with some major cities or important cities of the world list and trim then down.

Good luck, pretty nice stuff you are doing.
Thanks for the kind words.

Geonames.org is the project I am looking for.

For me that database would be the source pool to get geo information.

The system itself does only need to check places that are "in use" because it does not have to check places where no server is placed in.

  1. Get all locations tagged by a domain
  2. Translate location into coordinates
    Like Choopa:

    101 Possumtown Road

    Piscataway Township, NJ 08854, United States

    ->

    40.55690 -74.48473
  3. Calculate distance between ip and all servers
  4. Choose closest location
  5. Generate redirect URI for that location
or use something like sqlite or postgres with geo extensions.
If I look back to my last tutorials a lot of people do not like to install extra stuff.

I got a lot of requests e.g. for a bash version of my failover DNS script.

No Ruby, no Phyton, bare minimum stuff.

Sqlite might be Step 2 ;)

Ya, use anything really that's has built-in geohashing.  Map-reduce isn't known to be... um... fast or efficient.

Since wlanboy is an astute tinkerer who actually takes time to understand stuff (which I wholly encourage), here's a primer link to geohashing

http://www.bigfastblog.com/geohash-intro
Map/Reduce can be fast - if proper used.

I allready looked into geohashing because I like the idea to simplify problems.

It is a way to minimize the problem by bounding boxes.

You do not need to search for points but for the box that contains the point.

If you compare two geohashs and the first 5 characters match the distance between them is lower than 3803 meters, because that is the maximum distance between two center points of two adjacent bounding boxes (for the precision of 5).

So looking at the task it might be quite a easy to compute problem:

  • Translation service for city -> geohash / ip -> geohash
  • Small lib doing the compare stuff
  • Small lib doing the redirect stuff
 

drmike

100% Tier-1 Gogent
Seems somewhat overly complicated.  When the officially heady crazy math formula arrives most of us tune out ;)

Let's divide things up, the known and unknown.

Known ---> your server locations and prior lat/lon geocoding.

Unknown ---> requesting party's location and lat/lon info.

The known should be accompanied by a heap of other known information, namely MaxMind or similar IP-to-geo data used to geocode your front end nodes and to query over and over for inbound requesting party info.

To determine the unknown info need to compare that inbound  IP against the MaxMind data (stored in MySQL usually).   Looking for a range that that it falls in.  Depending on the dataset you are working with, should include country, states and locality as well as corresponding geo-code information.

This route requires no big calculations and is reasonably fast, when optimized.  Match should be 1 record and lat/lon pair provided (not everything will return such/fall out and require special handling fallback, like to country for instance).

At this point, post lookup comparison. you should have the requesting party's IP address + structured query that provide lat/lon or equivalent.

Next step is a query of your front end nodes to calculate the distance between the requesting party and each node and sort them by distance ASCending.

This requires no map reduce or other proprietary features.... So your SQL should be portable and fast.
 

dcdan

New Member
Verified Provider
Wouldn't it be better to tie not to country/state but to actual ping/network performance?

Example:

1) Visitor comes from IP 1.2.3.4 to your "default" server (A-record has a small TTL)

2) Your server shows website, as usual

3) Meanwhile, your server measures latency to the customer's ip from multiple locations

4) If customer's ip is not pingable - try pinging next hop

5) Whatever location "wins" -> bgp lookup -> put their whole subnet into the database and tie to the "fastest" server

6) Next time, return A record of the "fastest" server + good TTL

You could also "pre-scan" all IP ranges and populate the database beforehand to avoid slower initial connection. Or, use Maxmind to determine "best" server for initial connection.

This way you are actually serving content faster based on routing and not distance/country/etc.
 

manacit

New Member
This strikes me as a waste of time and tinkering (no offense) as a practical tool to speed a website up. It seems like you're basically planning on letting the end user hit your web server, then wait until you do all of this processing and send them, via HTTP redirect, to a server closer to them. The only thing that's going to get you is longer page load times for the initial visit, and a (negligible) decrease in load times after that. This also means you'll need to manage cookies across multiple subdomains and a lot of other stuff that's going to make a simple website significantly more complicated.

This is best solved with anycast, which is prohibitively expensive to roll out for personal use of course - but it's also bikeshedding that's entirely unnecessary for someone who's not running the kind of operation that can't afford to do it. 
 

Aldryic C'boas

The Pony
Although to be fair, there are better qualifications than great circle distance to consider when choosing a location to download from.
 

drmike

100% Tier-1 Gogent
Even with Anycast geo-dns functionality is being calculated in such manners (i.e. utilizing often MaxMind or similar data to correlate IP to lat/lon and then make comparisons to geocoded or regionally labeled A records).

Talking about 100-200ms lookup time on such records, when optimized --- and there are lots of assumptions in such an approach and often working from a reduction data set to reduce lookup times.

I've used this base approach for all sorts of stuff over the years.  Download servers and CDN ends being the most common.
 

manacit

New Member
... download server.
I suppose, but why not just give the user a list of servers and their location? That way they can choose for themselves, instead of relying on MaxMind (which is often wrong) and your calculations (which could easily result in choosing the wrong mirror). 
 

tchen

New Member
I suppose, but why not just give the user a list of servers and their location? That way they can choose for themselves, instead of relying on MaxMind (which is often wrong) and your calculations (which could easily result in choosing the wrong mirror).
{ insert joke about Americans and geography }
 

Aldryic C'boas

The Pony
Sites such as geoiptool.com.  If you're just running the occasional IP though, most don't mind if you whip up a bash/curl script to do so - just ask them first.
 
Top
amuck-landowner