Difference between revisions of "IncreasePagesWithAdvertisements"

(Current Numbers)
(Current Numbers)
Line 30: Line 30:
 
==Current Numbers ==
 
==Current Numbers ==
 
Numbers and percentages as of Today(November 22, 2007)
 
Numbers and percentages as of Today(November 22, 2007)
Correctly marked Adult: 25495
+
*Correctly marked Adult: 25495
Correctly marked as NotAdult: 1023
+
*Correctly marked as NotAdult: 1023
Incorrectly marked as Adult: 395
+
*Incorrectly marked as Adult: 395
Incorrectly marked as NotAdult: 10613
+
*Incorrectly marked as NotAdult: 10613
 
: 1023 / (395 + 1023) = 72.1 %
 
: 1023 / (395 + 1023) = 72.1 %
 
: 10613 / (25495 + 10613) = 29.3 %
 
: 10613 / (25495 + 10613) = 29.3 %

Revision as of 12:44, 22 November 2007

OurWork Edit-chalk-10bo12.png

What (summary)

Improve the AdSidebar adult content filter so that fewer non-adult pages are flagged and advertisements show up on more page. Even more importantly, catch more of the pages that actually do have adult content on them so that Google doesn't get angry at us for showing ads on adult pages.

Why this is important

Revenue determines how many resources we have to spend on developing cool features and tools for our community. We need to improve our revenue.

DoneDone

  • Advertisements show up on at least 90% of the pages that are not adult content
  • Advertisements sow up on at most 5% of the pages that are adult content

Steps to get to DoneDone

  • Collect a sample of pages
    • hand-audit for adult pages in sample
  • Partition the adult content keywords into levels of suggestiveness
  • If 1 of these words is detected ... label as adult
  • If 2 or more of these words show up at least once each in the page ... label as adult
  • If 3 or more of these words show up at least once each in the page ... label as adult

Current Status

Today we worked on two different approaches. One was to experiment with a scheme which uses different regular expressions based on their suggestiveness. Based on these REs, we tried to improve the numbers. Using simple counts, this method was an improvement but not a significant one.

The second approach we tried was a statistical one. We wrote code to train and then calculate probabilities of specific features from documents.

As a next step, we will improve the approach to find out the probabilities and implement a basic structure to determine whether a given document is adult or not based on the probabilities of the words contained within the document.

I'm glad to hear that you guys are trying lots of different things and learning what works and what doesn't. I look forward to an improvement on the percentages today! --Brandon 21:42, 21 November 2007 (PST)

Current Numbers

Numbers and percentages as of Today(November 22, 2007)

  • Correctly marked Adult: 25495
  • Correctly marked as NotAdult: 1023
  • Incorrectly marked as Adult: 395
  • Incorrectly marked as NotAdult: 10613
1023 / (395 + 1023) = 72.1 %
10613 / (25495 + 10613) = 29.3 %
  • Total: 37,526
  • Correctly Marked as Adult: 33,447
  • Correctly Marked as Not Adult: 1,097
  • Incorrectly Marked as Adult: 321
  • Incorrectly Marked as Not Adult: 2,661
1097/(321+1097)*100.0 = 77.3%
2661/(33447+2661)*100.0 = 7.3%

Interesting Cases

  • NiceRoundAsses.com ... because all of the naughty words are buried in domain names ... but there are a lot of them

See



Retrieved from "http://aboutus.com/index.php?title=IncreasePagesWithAdvertisements&oldid=12520736"