Thursday, November 12, 2009

Reading Notes for 11/17 Class

"Web Search Engines," Parts 1 and 2
In the first section of this series, the author makes the point that there is simply far too much data on the web for every page to be indexed; automatically generated data and updates make the number of pages infinite. He goes on to explain that all major search engines have a similar infrastructure and search algorithm, though the specifics are highly guarded secrets. Part Two explains how web sites are indexed within the search algorithm; information is often indexed according to link popularity, common phrases and words. Since the process can be slow, search engines often take shortcuts to retrieve information quickly, such as skipping parts of the data set and caching the most popular sites (like Wikipedia) for quick returns.

"Current Developments and Future Trends for the OAI Protocol for Metadata Harvesting"
The goal of the Open Archives Initiative (OAI) is to promote standards that support interoperability. To this end, the development of their Protocol for Metadata Harvesting (PMH) sets standards for metadata based on the common standards of the XML, HTTP and DublinCore models, making it easier for institutions to share information without any conflicts over differing systems of metadata. The system has developed fairly effectively, by setting standards, incorporating a good search engine, and allowing data to be effectively processed by crawlers. I think this is a good program for libraries to understand and work with, since cooperation between institutions is a way to save on resources while promoting efficiency.

"The Deep Web: Surfacing Hidden Value"
The essential point of this article is that traditional search engines- Yahoo, Google, etc.- only skim the surface on the web's content. Their search algorithm only returns results with the highest number of links, which is a relatively small portion of the internet. The author shows that the Web is much, MUCH larger (up to 500 times larger) than we believe it to be, but a huge portion of the content is in "deep web" sites that can only be found through direct queries. The author then supports the idea of the BluePlanet search engine, which is specifically designed to search for articles in the deep web. This is an interesting technology, since, as the author says, valuable information is being ignored simply because it cannnot be easily accessed. The BluePlanet search methods may be of interest to libraries simply because it can help extend and deepen results to give more thorough answers to patrons' queries.

4 comments:

  1. I'm willing to be that most people aren't aware of the size of the web. Many probably think that Google can find everything!!

    ReplyDelete
  2. I think that it would be fun to search through the deep web in your spare time. I don't know if people coming into the library are looking for anything more than a quick, simple answer unless they are doing intense research.
    I tend of think of public libraries whenever I read these articles so it might be helpful from a different perspective.

    ReplyDelete
  3. I wonder how you go about searching the Deep Web, and if the BluePlanet search engine is available to use. So often, if people want to do serious research then they will go to the library where they can have access to journals and databases. I wonder if the average person even cares what they are missing as long as they get the information they are looking for (the recipe they were looking for, or that great quote to use in a speech..."

    ReplyDelete
  4. I don't think too many people really do care what information is missing. The important thing is that they get the results they are looking for. I wonder though what else is out there on the web. There have been times that I was unable to find what I was looking for on the web. I am curious to find out if some of that info is out there on the deep web.

    ReplyDelete