WhoisRefresh
DevelopmentTeam 10-9, 8-6, 4-2, 4)
This project actually brings together several smaller projects into one. Essentially, it involves a rewrite of the page creation bot to use several new data sources.
- (10)
WhoisParsing(Arif Iqbal) (7-10) - RewritePageCreationBot
- MediaWiki:InitialDomainPage
- RewritePageScrapeBot
- (4) WhoisRefreshRunRefresh (Jason and Ali)
Contents
Steps to DoneDone
- Branch whois-refresh
- Get it running on our local machines
- Assess where we currently are
- Ruthlessly prune ... spin off any tasks that aren't absolutely essential onto their own pages that will be considered later
Synopsis
A new domain-page is scraped, and populated
- When a red-link is clicked
- When a search for a domain doesn't return a page
AcceptanceTest
Look at 50 pages that we know don't have contact info and verify that the contact info is coming in.
Background
- rfc3912 the current protocol specification
- Domain Statistics by TLD
- 100 oldest dot com domains
- Registrar Stats
The ranking and percentages come from http://populicio.us/toptlds.html and are at least as stale as November 2006.
Rank | TLD | Purpose | Percentage | Whois Server |
---|---|---|---|---|
1 | .com | commercial organizations | 58.3973 | whois.verisign-grs.com |
2 | .org | organizations | 12.8734 | whois.pir.org |
3 | .net | network infrastructures | 7.3600 | whois.verisign-grs.com |
4 | .uk | United Kingdom | 3.2604 | whois.nic.uk |
5 | .edu | educational establishments accredited in the United States | 2.7008 | whois.educause.edu |
6 | .jp | Japan | 2.6159 | whois.jprs.jp |
7 | .de | Germany | 2.1484 | whois.denic.de |
8 | .br | Brazil | 0.8066 | whois.registro.br |
9 | .ca | Canada | 0.7208 | whois.cira.ca |
10 | .gov | governments and their agencies in the United States | 0.6832 | whois.dotgov.gov |
11 | .au | Australia | 0.6463 | whois.aunic.net |
12 | .info | informational sites | 0.5674 | whois.afilias.net |
13 | .nl | Netherlands | 0.5380 | whois.domain-registry.nl |
14 | .fr | France | 0.5108 | whois.nic.fr |
15 | .us | United States | 0.5030 | whois.nic.us |
16 | .ru | Russian Federation | 0.4610 | whois.ripn.net |
17 | .it | Italy | 0.3527 | whois.nic.it |
18 | .cn | China | 0.3480 | whois.cnnic.net.cn |
19 | .ch | Switzerland | 0.2761 | whois.nic.ch |
20 | .tw | Taiwan | 0.2727 | whois.twnic.net.tw |
21 | .es | Spain | 0.2699 | |
22 | .se | Sweden | 0.2493 | whois.iis.se |
23 | .dk | Denmark | 0.1957 | whois.dk-hostmaster.dk |
24 | .be | Belgium | 0.1956 | whois.dns.be |
25 | .pl | Poland | 0.1816 | whois.dns.pl |
26 | .at | Austria | 0.1659 | whois.nic.at |
27 | .il | Israel | 0.1559 | whois.isoc.org.il |
28 | .tv | Tuvalu | 0.1553 | |
29 | .nz | New Zealand | 0.1233 | whois.srs.net.nz |
30 | .biz | business use | 0.1188 | whois.biz |
?? | .eu | European Union | ??? | whois.eu |
Information We Need
- Date of lookup
- Registrant Address
- Admin Address
- Phone
- Question: do we need both the registrant and admin addresses or is one enough? In the past we've only used one. - Ray | talk
Next
-
Get gpMakeImage working so that tests pass - Send email
- Given an address get a lattitude/longitude
- Obfuscate address
<address>fe565342385cbcce7cb35b486876b8d5</address> . . . <address asgraphic="HASH" latitude="" longitude="" error="This is where an error message goes"> ...This address is displayed as a graphic to make it more difficult for web robots to harvest it. If you would like to change the address that is displayed, simply replace these instructions with the new address and then save the page... </address>
Latitude/Longitude
Need a table so that we can associate one or more lat/long pairs with a page.
Some neat regular expressions
These are from the book, "The Ruby Way"
- The following regex matches a phone number in the NANP format (North American Numbering Plan). It allows three common ways of writing such a phone number:phone = /^((\(\d{3}\) |\d{3}-)\d{3}-\d{4}|\d{3}\.\d{3}\.\d{4})$/
"(512) 555-1234" =~ phone # true "512.555.1234" =~ phone # true "512-555-1234" =~ phone # true "(512)-555-1234" =~ phone # false "512-555.1234" =~ phone # false
- Here is a regex to match a U.S. ZIP Code (which may be five or nine digits):
zip = /^\d{5}(-\d{4})?$/
- The following regex matches all the 51 usual codes (50 states and the District of Columbia):
state = /^A[LKZR] | C[AOT] | D[EC] | FL | GA | HI | I[DLNA] | K[SY] | LA | M[EDAINSOT] | N[EVHJMYCD] | O[HKR] | PA | RI | S[CD] | T[NX] | UT | V[TA] | W[AVIY]$/x
Whois Records We Need
We need 50 more whois records covering the range of formats for each of the following registrars:
- public domain
- ONLINENIC
- STRATO
- BASICFUSION.COM
- DOMAINNAMESALES | domain name sales
- core
- METAPREDICT.COM
We have a few but need 50 whois records covering the range of formats for each of the following registrars:
- ascio
- beijing Innovative
- belgium Domains
- capitol_domain
- discount_domain
- domaindiscover
- domain_doorman
- domainsite
- dotregistrar
- dotster
- fabulous.com
- gandi
- hichina
- innerwise
- joker.com
- Key systems
- Mark Monitor
- Melbourne IT
- Moniker
- NameKing
- Name.Net
- Names4ever
- namesdirect
- nameview
- nicline
- ovh
- psi-usa
- registerfly
- schlund
- srsplus
- wild west domains