Difference between revisions of "Rewrite PageCreationBot"
Arif Iqbal (talk | contribs) (Added stuff from RewritePageCreationScraper so we can nuke RewritePageCreationScraper) |
|||
Line 7: | Line 7: | ||
* Still relies on Java/Tomcat to do crawling (for now) | * Still relies on Java/Tomcat to do crawling (for now) | ||
* Carefully tested | * Carefully tested | ||
− | + | * This is essentially a 1-1 rewrite of the scraping pieces of the bot in ruby instead of Java. | |
== Why this is important == | == Why this is important == | ||
Line 13: | Line 13: | ||
* We need to have control over the pages that our created on our site. | * We need to have control over the pages that our created on our site. | ||
* The old bot was known to pollute the database; we need control over all the access points that could screw up our data. | * The old bot was known to pollute the database; we need control over all the access points that could screw up our data. | ||
+ | * Gaining mastery over the code so that we can add new features easily. | ||
Revision as of 07:34, 16 August 2007
What (summary)
- New page-building bot
- Still relies on Java/Tomcat to do crawling (for now)
- Carefully tested
- This is essentially a 1-1 rewrite of the scraping pieces of the bot in ruby instead of Java.
Why this is important
- We need to have control over the pages that our created on our site.
- The old bot was known to pollute the database; we need control over all the access points that could screw up our data.
- Gaining mastery over the code so that we can add new features easily.
DoneDone
- Creates news pages based on a template
- Monitoring and logging have been added (tests whether or not the bot succeeds)
- Hooked in to all the old points Bot was
- Checks robots.txt before spidering the website.