First step: Geo database

Discussion in 'Coding, Scripting & Programming' started by wlanboy, Feb 8, 2014.

Tags:
  1. wlanboy

    wlanboy Content Contributer

    2,126
    1,169
    May 16, 2013
    After thinking about and coding a I am now building up my own geo database service.

    The goal - if reachable - would be a selfmade geolocation based service to redirect website calls to a closer server.

    I want to talk about my first steps and hopefully will get some input about different open items.

    First step is to create a list of countries and (US) states and their geo location (using midpoints).

    I have finished that point - but hopefully someone can point me to a open source based list of geo locations of all cities.

    The workflow for the application is:

    • Use geolite2 from Maxmind to get longitude and latitude of IP (alternatives?)
    • Search for the closest country (later city) to the geo point returned by Maxmind

    Distant calculation can be done by the haversine formula:

    haversine.jpg

    Parameters are:

    • φ latitude
    • λ longitude
    • R radius of earth
    The script has to calculate the distance of each geo point (of the country and state list) to the returned geo location and sort the result by distance. Next step is to check the list of servers with the sorted list of distances to find the closest server.

    Perfect match for map/reduce functionality of MongoDB.

    Maxmind sometimes has a quite high miss rate of geo locations. Like ips in Detroit which are located in Kansas... maybe because I am using the free edition.

    To provide a first sight to my project I have created a form where you can set one ip and three server locations - the closest server location will be returned.
     
    MannDude and TruvisT like this.
  2. Nett

    Nett Article Submitter Verified Provider

    761
    189
    Nov 27, 2013
    That's some extreme rocket science!!
     
  3. bauhaus

    bauhaus Member

    67
    20
    Sep 8, 2013
    You lost me at the Haversine formula :p

    Interesting stuff as always. In the past I have used this database http://www.geonames.org/export/

    It is free with a common creative license.

    This one for USA http://geonames.usgs.gov/domestic/download_data.htm at least is free.

    And this one http://earth-info.nga.mil/gns/html/namefiles.htm worldwide.

    The only issues I have with geonames.org and nga.mil databases in my projects (not dns related) was the fact that they are huge, kinda every little place is already geotagged, so maybe you can cross reference with some major cities or important cities of the world list and trim then down.

    Good luck, pretty nice stuff you are doing.
     
    Last edited by a moderator: Feb 8, 2014
  4. tchen

    tchen New Member

    276
    124
    Sep 1, 2013
    zzrok likes this.
  5. zzrok

    zzrok New Member

    104
    41
    Jul 14, 2013
    tchen likes this.
  6. tchen

    tchen New Member

    276
    124
    Sep 1, 2013
    Ya, use anything really that's has built-in geohashing.  Map-reduce isn't known to be... um... fast or efficient.

    Since wlanboy is an astute tinkerer who actually takes time to understand stuff (which I wholly encourage), here's a primer link to geohashing

    http://www.bigfastblog.com/geohash-intro
     
  7. wlanboy

    wlanboy Content Contributer

    2,126
    1,169
    May 16, 2013
    Thanks for the kind words.

    Geonames.org is the project I am looking for.

    For me that database would be the source pool to get geo information.

    The system itself does only need to check places that are "in use" because it does not have to check places where no server is placed in.

    1. Get all locations tagged by a domain
    2. Translate location into coordinates
      Like Choopa:

      101 Possumtown Road

      Piscataway Township, NJ 08854, United States

      ->

      40.55690 -74.48473
    3. Calculate distance between ip and all servers
    4. Choose closest location
    5. Generate redirect URI for that location
    If I look back to my last tutorials a lot of people do not like to install extra stuff.

    I got a lot of requests e.g. for a bash version of my failover DNS script.

    No Ruby, no Phyton, bare minimum stuff.

    Sqlite might be Step 2 ;)

    Map/Reduce can be fast - if proper used.

    I allready looked into geohashing because I like the idea to simplify problems.

    It is a way to minimize the problem by bounding boxes.

    You do not need to search for points but for the box that contains the point.

    If you compare two geohashs and the first 5 characters match the distance between them is lower than 3803 meters, because that is the maximum distance between two center points of two adjacent bounding boxes (for the precision of 5).

    So looking at the task it might be quite a easy to compute problem:

    • Translation service for city -> geohash / ip -> geohash
    • Small lib doing the compare stuff
    • Small lib doing the redirect stuff
     
  8. drmike

    drmike 100% Tier-1 Gogent

    8,573
    2,717
    May 13, 2013
    Seems somewhat overly complicated.  When the officially heady crazy math formula arrives most of us tune out ;)

    Let's divide things up, the known and unknown.

    Known ---> your server locations and prior lat/lon geocoding.

    Unknown ---> requesting party's location and lat/lon info.

    The known should be accompanied by a heap of other known information, namely MaxMind or similar IP-to-geo data used to geocode your front end nodes and to query over and over for inbound requesting party info.

    To determine the unknown info need to compare that inbound  IP against the MaxMind data (stored in MySQL usually).   Looking for a range that that it falls in.  Depending on the dataset you are working with, should include country, states and locality as well as corresponding geo-code information.

    This route requires no big calculations and is reasonably fast, when optimized.  Match should be 1 record and lat/lon pair provided (not everything will return such/fall out and require special handling fallback, like to country for instance).

    At this point, post lookup comparison. you should have the requesting party's IP address + structured query that provide lat/lon or equivalent.

    Next step is a query of your front end nodes to calculate the distance between the requesting party and each node and sort them by distance ASCending.

    This requires no map reduce or other proprietary features.... So your SQL should be portable and fast.
     
  9. dcdan

    dcdan New Member Verified Provider

    171
    54
    Aug 18, 2013
    Wouldn't it be better to tie not to country/state but to actual ping/network performance?

    Example:

    1) Visitor comes from IP 1.2.3.4 to your "default" server (A-record has a small TTL)

    2) Your server shows website, as usual

    3) Meanwhile, your server measures latency to the customer's ip from multiple locations

    4) If customer's ip is not pingable - try pinging next hop

    5) Whatever location "wins" -> bgp lookup -> put their whole subnet into the database and tie to the "fastest" server

    6) Next time, return A record of the "fastest" server + good TTL

    You could also "pre-scan" all IP ranges and populate the database beforehand to avoid slower initial connection. Or, use Maxmind to determine "best" server for initial connection.

    This way you are actually serving content faster based on routing and not distance/country/etc.
     
  10. manacit

    manacit New Member

    108
    43
    May 17, 2013
    This strikes me as a waste of time and tinkering (no offense) as a practical tool to speed a website up. It seems like you're basically planning on letting the end user hit your web server, then wait until you do all of this processing and send them, via HTTP redirect, to a server closer to them. The only thing that's going to get you is longer page load times for the initial visit, and a (negligible) decrease in load times after that. This also means you'll need to manage cookies across multiple subdomains and a lot of other stuff that's going to make a simple website significantly more complicated.

    This is best solved with anycast, which is prohibitively expensive to roll out for personal use of course - but it's also bikeshedding that's entirely unnecessary for someone who's not running the kind of operation that can't afford to do it. 
     
  11. tchen

    tchen New Member

    276
    124
    Sep 1, 2013
    ... download server.
     
    Aldryic C'boas likes this.
  12. Aldryic C'boas

    Aldryic C'boas The Pony

    2,313
    2,652
    Apr 18, 2013
    Aldryic
    Although to be fair, there are better qualifications than great circle distance to consider when choosing a location to download from.
     
  13. drmike

    drmike 100% Tier-1 Gogent

    8,573
    2,717
    May 13, 2013
    Even with Anycast geo-dns functionality is being calculated in such manners (i.e. utilizing often MaxMind or similar data to correlate IP to lat/lon and then make comparisons to geocoded or regionally labeled A records).

    Talking about 100-200ms lookup time on such records, when optimized --- and there are lots of assumptions in such an approach and often working from a reduction data set to reduce lookup times.

    I've used this base approach for all sorts of stuff over the years.  Download servers and CDN ends being the most common.
     
  14. manacit

    manacit New Member

    108
    43
    May 17, 2013
    I suppose, but why not just give the user a list of servers and their location? That way they can choose for themselves, instead of relying on MaxMind (which is often wrong) and your calculations (which could easily result in choosing the wrong mirror). 
     
  15. tchen

    tchen New Member

    276
    124
    Sep 1, 2013
    { insert joke about Americans and geography }
     
    fisle likes this.
  16. peterw

    peterw New Member

    800
    189
    Jun 14, 2013
    What is the best way to locate one ip? Any free sources?
     
  17. Aldryic C'boas

    Aldryic C'boas The Pony

    2,313
    2,652
    Apr 18, 2013
    Aldryic
    Sites such as geoiptool.com.  If you're just running the occasional IP though, most don't mind if you whip up a bash/curl script to do so - just ask them first.
     
  18. sv01

    sv01 Slow but sure

    426
    87
    May 17, 2013
    freegeoip.net