Webmasters May Shape Search Results

advertisement

STATE COLLEGE — Not all Internet search engines yield the same results for the same query, but that may have as much to do with how Web sites are managed as how search engines work.

Web site administrators increasingly are barring some search engines from all or part of their sites, while granting others more access, according to a recent study by Penn State University researchers.

C. Lee Giles, an information sciences and technology professor and the lead author of the study, said site administrators may allow crawlers from Google Inc. the most access among search engines because they know Google produces a lot of traffic.

"When they first grew, did administrators say, 'Hmmm, this is really good, let's give them better access?'" he said. "And as a consequence, they're getting even (more)."

Search engines comb the Internet using programs known as "robots," "spiders," "crawlers" or "bots." A programmer can use a "robots.txt" file to police the crawlers trying to access a site.

"It's not that the search engines are better, but it's the people out there making policy and making decisions on what to crawl to," Giles said.

Robot.txt files are not mandatory, though they are becoming more popular. More than one-third of the 7,600 sites Giles and his team studied between December 2005 and October 2006 had such files.

The vast majority of robots.txt files — nearly 94 percent — managed overall access to a given site, the study found. Other files named spiders from specific places. They welcomed crawlers from Google most often, followed by the engines at Microsoft Corp.'s MSN and Yahoo Inc.

"There's definitely a bias toward the traffic benefits Google offers vis a vis other search engines," Kevin Heisler, executive editor of Search Engine Watch, wrote in an e-mail.

But Rahul Lahiri, vice president of search product management for Ask.com, owned by IAC/InterActiveCorp, said Google's advantage makes little difference to consumers because other search engines can get to blocked sites through other links.

Ask doesn't use robots.txt files in marketing or product strategies but does use them to contact sites that may block it, Lahiri said.

Google Inc. spokeswoman Jessica Powell said the company has "worked with many Web publishers (and) done a lot of outreach to make their content discoverable," but sites are ultimately the ones to decide which search engines to let in.

Representatives of Yahoo Inc. and Microsoft Corp. did not immediately respond to requests for comment.

___

On the Net:

Read the study at: http://botseer.ist.psu.edu/

  • 1 Vote
  • Enjoy this article? Help vote it up the 'Vine.

Back To Top

Published to:

What's this?
Who's leading the conversation?
This visualization below allows you to see the impact that each user has on the current conversation. The top row contains the group of users who have had the most impact, the 2nd row the group of users who have had the 2nd most impact (et cetera). Users with similar impact are grouped together, and the average score of the group is shown to the left of the group. The author of the article is also shown on the left, in their corresponding group. Each user's score is based on the number of comments the user has made plus the number of votes their comments have received. The scores are calculated relative one another, so while their absolute value is not particularly important, their relative difference does indicate a larger difference in impact on the conversation.
0.5
{"commentId":1224379,"authorDomain":"rallyrulz"}

This is more likely to do with webmasters being idiots and not having a clue on how to use a robots.txt
There is absolutely no reason to block another search engine, its just dumb!

{"commentId":1224379,"threadId":"182449","contentId":"1128150","authorDomain":"rallyrulz"}
    Reply#1 - Wed Nov 28, 2007 6:57 PM EST
    {"commentId":1230213,"authorDomain":"Borisz"}

    I think the authors haven't thought enough about why website owners might "ban" other search bots.

    While we were developing the search tool at Zuula (www.zuula.com), our site was regularly "crawled" by bots from many different search engines. Some of those bots were incredibly aggressive, with one of them almost crashing one of our servers due to the volume of requests sent by the bot to the server.

    At no time, however, was the bot misbehavior due to bots from one of the major search engines. It was always due to bots from lesser known engines.

    And, frankly, Google's bot was the best behaved, always adhering to all the requests we put in our robots.txt file.

    And I've found that our experience is not unusual. Other website owners -- in forums and other online discussions -- have echoed these sentiments. Yes, website owners are more open to crawling by Google due to the traffic Google can bring. But they'd be happy to let other bots crawl their sites, as well, if those bots would simply behave themselves better.

    {"commentId":1230213,"threadId":"182449","contentId":"1128150","authorDomain":"Borisz"}
      Reply#2 - Fri Nov 30, 2007 2:00 PM EST
      {"canLink":false,"threadId":"182449","isPrivate":false}
      Leave a Comment:
      You're in Easy Mode. If you prefer, you can use XHTML Mode instead.
      As a new user, you may notice a few temporary content restrictions. Click here for more info.
      {"threadId":"182449","contentId":"1128150"}
      Start TrackingStart Tracking
      Stop TrackingStop Tracking