搜索引擎如果判断和处理404错误页面和登陆页面的

  • A+
所属分类:[网络资讯]

   在语言上面同样的语句可能别人回理解成不同的意思,如果你在美国东中部地区,没有听清别人说什么,你可以说"exceuse me",去要别人重复下刚才你没有听清楚的内容,但是如果这样的事情你发生在爱没有南部,你对别人说说""exceuse me",你可能对引起别人吃惊的目光,别人可能会感觉莫名其妙.

    并不只是在语言交流中有这样册差别,在搜索引擎行业里面同样也有.

   许多的网站都建立的不是很正确,当一个访问者或者是搜索引擎程序试图去访问一个不存在的网站页面,程序会自动在屏幕上跳转到 404或者403或者5XX页面.但是这个页面的标题 被设置成为了"200"或者正常的标题.反馈给搜索引擎服务器信息是这个页面正常,不存在任何问题.很多的页面可能只是临时的无法访问,可能是数据库故障什么的.虽然如此,但是准确的说,这个时间是不应该返回200信息的.

    让访问者从一个无法访问的页面直接跳转到网站首页是很多站长经常做的设置.

   很多时候这个设置却让搜索引擎程序的思维混乱.程序会这样的页面是正常的,搜索引擎会保持这个页面的引用和收录,并且也不会删除这个页面的引索,但是这样的页面是不应该被放在搜索引擎上面的~这样会降低搜索引擎的服务质量,让用户多艘很多无用的信息.所以我们一般要尽可能的返回正确404或者404,5XX信息.这样会提高搜索引擎跟网站的友好度.

   另一个方面,我们有时候在文章里,会经常引用或者链接一些外部的URl,可能我们是登陆的时候访问,我们自己是可能正常访问的,当我们的访问者访问一个有样URL的页面的时候,访问者点击后是进不去,可能会跳转到一个登陆的页面,或者提示权限不够什么的.这样也是正确.我们要防止搜索引擎从我们的页面中采集到这样的url作为引索,因为这样做也会降低搜索引擎对我们的评分.

   当一个访问者打开一个无法访问的页面,看到的是404错误提示,但是服务器那头放回的是一个200信息,这样会欺骗搜索引擎服务器,搜索引擎有一个专门的名词来定义这个行为"soft  404",因为这个情况搜索引擎服务器会把这个404页面,当作一个正常的200页面来处理,这是一中欺骗服务器的行为.

   yahoo的报告中明确的指出很多这样的纯在着很多这样的404欺骗.他们已经为此写了专门的软件来进行检测.

  在一个正常的万维网中,错误的页面信息不应该被从一个页面到另一个网站的传播,网站的管理者应该检查自己的页面,保证这样的问题不会出现,这样将有利于网站主和搜索引擎服务器,也有利于这个网络资源的有效性.

  目前搜索引擎处理的方法下~找到直接登陆页面,然后把错误的连接更换为登陆页面或者首页,但是这样将以牺牲搜索服务器的大量运算为代价.服务器在很多时候为了保证速度肯定会屏弃这样的页面的.

  下面是来至yahoo的 soft 400报告 有兴趣的E 问好的朋友可以了解下:

  

The patent application is:

Unsupervised Detection of Web Pages Corresponding to a Similarity Class
Invented by Mahesh Tiyyagura
Assigned to Yahoo
US Patent Application 20090157607
Published June 18, 2009
Filed December 12, 2007

In addition to a class for soft 404 error pages, other classes might also be determined, such as for pages that indicate:

  • Out of stock
  • Program exception
  • Permission denied and
  • Login required

The crawling of web pages usually happens independently of the indexing of content on those pages. Before the pages are indexed, some analysis of the content and URLs found on a site may take place, including a process like the one described in this patent filing, which may determine similarity classes of the web pages.

Why a Search Engine Might Want to Identify Soft 404s

Some of the reasons why a search engine might want to determine if there are soft 404 pages on web sites can include:

1) A recognition that that the soft 404 pages and their URLs do not pertain to useful information, which means that a search engine wouldn’t need to index those pages.

2) Reducing (or decaying) a “freshness” value for pages linking to those soft 404 pages, which those pages might have gained based upon a link-based ranking algorithm. In other words, pages with dead links may rank less highly in terms of “freshness.” If a search engine doesn’t recognize that one or more links on a page point to soft 404 pages, it might rank that page more highly based upon a freshness factor. Identifying soft 404s means that a search engine won’t give a page a ranking boost based upon freshness.

3) For pages on sites that might show advertising from search engines, where a soft 404 is shown or a requirement to login, or another similarity class that doesn’t provide useful information, the patent filing tells us that it is assumed that visitors are likely to want to navigate quickly away from such pages. We’re also told that more generic advertising might be shown on those pages, or ads that occupy more screen real estate than for other pages on a site.

The patent filing provides some details on how pages might be clustered together based upon their content, and how URLs might be determined to be similar. The paper Syntactic Clustering of the Web is mentioned as an example of a clustering and shingling technique that could be used, as is the process described in the patent Method for Clustering Closely Resembling DataObjects.

Conclusion

This patent application from Yahoo describes a process that might be used when a site isn’t set up properly to communicate such things as a proper 404 (not found) server message when a visitor might see a 404 message on a page that they view, but their browser and search engine crawling programs get a 200 (ok) message instead.

It’s recommended that site owners fix problems like soft 404s rather than relying upon processes like the ones described in this patent filing. It’s to the benefit of the search engine and site owners to reecognize when miscommunications like soft 404s happen, but it’s even better if the wrong messages weren’t sent in the first place.

 

历史上的今天
六月
21
  • 我的微信
  • 这是我的微信扫一扫
  • weinxin
  • 我的微信公众号
  • 我的微信公众号扫一扫
  • weinxin
广告也精彩
avatar
广告也精彩

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: