August 25, 2006

Aggregated Content: Spam or Service?

Yesterday, Tim Converse wrote a very interesting article discussing the “>challenges of discerning the difference between quality aggregation of content and spam. This task can be a major challenge for search engines - what baseline decides the difference between a resource like Google News and your average feed scraper?

Google news provides excellent, high-quality results and has stringent requirements for news providers. Average feed scrapers scrape, well, whatever they can find. But can an algorithm tell the difference?

Tim notes, in particular, the interesting recursive nature of searching aggregators. Since many aggregators are scraping results from other search engines, it’s not impossible to have some very complex results.

As an in-between case ask yourself this: if you’re doing a websearch (on Google, Yahoo!, MSN, …) do you want any of the results to be … search-result pages themselves (from Google, Yahoo!, MSN)? That is, if you search for “snorklewacker” on MSN web search, and you click on result #4, do you want to find yourself looking at a websearch results page for “snorklewacker” on Yahoo! Search, which in turn has (as result #3) the Google search results page for “snorklewacker”?

Altogether, an interesting question - no really conclusive answers, however.

Filed under: Spam, Web Services

June 16, 2006

6 Degrees of Web page linking

The six degrees of separation meme theorizes that anyone on earth can be connected to any other person on the planet through a chain of acquaintainces containing at most five intermediaries.

Does this theory also apply to websites?

There are a number of problems in determining how to apply this kind of theory. When dealing with people, it’s clear that people know a relatively small and finite group of acquaintances. With websites, there are certain sites (directories and search engines, for example) who are linked to enormous numbers of sites. Thus, it is ineffective to treat the six degrees meme in web terms at the site level - instead, it must be treated at the page level. A link to Google.com’s index now means only a link to a handful of other Google properties.

This idea occurred to me as a means of identifying your proximity to any given spam website resource. It would be intriguing to analyze particular pages and trace the number of connections necessary to get from your page to a known spam or banned website.

There are a number of technological barriers to this, of course. Google contains all the information one would need to do this - but is it possible to access it? I’m not sure it is - you’d need to be able to determine whether Google had flagged a site and you’d need to be able to identify the tracks of Google’s crawler to determine the site path. It’s certainly possible to program your own, as well - but well beyond my puny programming skills!

Is the good link/bad link dichotomy accurate? I wouldn’t think so. The scale of link quality, in my opinion, is really made up of shades of gray - one link may be the darkest pits of Hades, but another may be more like a slightly off-color joke. Not something you’d say in front of your parents, but only really inappropriate in certain company.

The six degrees meme could be one way of determining the potential quality of a linking site, or of identifying the existence of Googlebowling (if such a thing is indeed possible.) It could perhaps also be a way of determining the risk of a site becoming associated with questionable sites - if a site has a number of pages which are currently only 2 degrees separated from highly questionable sites, this suggests that their existing link partners are making poor choices, which may come to reflect on the site itself.

If you engage in link exchange - be careful. You never know who else your link partners might be exchanging with. (I guess this could apply to many other things, as well!)

Filed under: Links, Spam

« Previous Page