Friday, March 5, 2004

Mining the Tagged Web: "Several years ago, researchers at the IBM Almaden Research Center in San Jose, Calif., began an effort to study the Web as a mathematical graph—a collection of nodes (representing Web pages) and lines (representing hyperlinks). They were interested in studying various properties of this graph, including its diameter and connectedness, to obtain insights into algorithms for crawling and searching the Web and to characterize the Web's sociological evolution.

"To obtain data, the researchers conducted Web crawls that encompassed 200 million pages and 1.5 billion hyperlinks. They confirmed that the distribution of pages and link number follows a simple mathematical relationship known as a power law. In essence, most pages incorporate just a few outgoing links, whereas a few pages have a huge number."

"'In a sense, the Web is much like a complicated organism, in which the local structure on a microscopic scale looks very regular (like a biological cell), but the global structure exhibits interesting morphological structures (body and limbs) that are not obviously evident in the local structure,' Ravi Kumar of IBM and his coworkers concluded in a paper presented in 2000 at the Ninth World Wide Web Conference.

"The effort to amass data about the structure and content of the rapidly growing Web didn't end there. It continued and now encompasses about half of the Web and includes much "informal" communication, such as Web logs, newsgroups, and chat rooms. The resulting panoply of data has become the basis of an ambitious commercial service that IBM recently launched called WebFountain."

"Both Google and WebFountain stemmed from academic research about text mining and the insight that the best way to find information is to focus on the biggest and most popular sites and Web pages. WebFountain goes one step further in trying to make sense of the pages themselves by tagging the information in a clear, consistent way. Any data miner that comes along now has a vast playing field on which to test its skill and prove its value."

No comments:

Blog Archive

2013

You don't launch a popular blog,
you build one.
Seth Godin