StructuredDataFromWikiPages

Rating: 0 - 0 votes

Company Logo

Company Name

Company Contact

Page Type

This page is about a company.

OurWork

What (summary)

Extract data that users have entered onto Wiki pages and turn them into structured data for easier manipulation.

We need to extract

Contact info
- Address
- Phone #

What is structured data?

What else besides address is important for us?

Why this is important

We are moving towards using more highly-structured data, but need to leverage the large quantity of data users have entered onto our site.

DoneDone

Easily identify which data has been added or changed to a wiki page by human edits. (standard diff may work?)
Apply heuristics (section, regular expression, machine learning, something else) to determine if a piece of data should belong in a structured field.
If all human-added data can be extracted, indicate that the entire wiki page should be deleted.
If there remain human-added data that can't be identified and extracted, return wikitext containing only the non-identified human data, with all bot-created data removed.

Steps to get to DoneDone

Build many test cases--pick many random human-edited pages.
- Pull out revision histories (or at least diffs to compare to original bot scrape)
- Identify and extract human-edited data yourself. Great fun!
Make test cases pass. (In order below?)
- First identify all human-edited data
- Then classify and extract said data
- Then determine if a page should be deleted, and, if not, which data should be left behind.
Throw a wild and crazy party

Discussion

Zoetrope is a research project at University of Washington that carries this idea further by adding history and a drag and drop interface. Watch the video:

http://uwnews.org/article.asp?articleID=45255

StructuredDataFromWikiPages

Company Logo

Company Name

Company Contact

Page Type

What (summary)

What is structured data?

What else besides address is important for us?

Why this is important

DoneDone

Steps to get to DoneDone

Discussion

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating

Company Logo

Company Name

Company Contact

Page Type

Edit Page Image

Edit Name

Edit Contact Information

Edit Page Type

Map

Edit Page Rating