Official Google Blog: Introduction to Google Search Quality

栏目:8-转载      31 views      2 枚回复

停用域名将近一个月,被降权了,赶紧贴一篇牛人的文章,就当学习英文。原文:

Search Quality is the name of the team responsible for the ranking of Google search results. Our job is clear: A few hundreds of millions of times a day people will ask Google questions, and within a fraction of a second Google needs to decide which among the billions of pages on the web to show them — and in what order. Lately, we have been doing other things as well. But more on that later.

For something that is used so often by so many people, surprisingly little is known about ranking at Google. This is entirely our fault, and it is by design. We are, to be honest, quite secretive about what we do. There are two reasons for it: competition and abuse. Competition is pretty straightforward. No company wants to share its secret recipes with its competitors. As for abuse, if we make our ranking formulas too accessible, we make it easier for people to game the system. Security by obscurity is never the strongest measure, and we do not rely on it exclusively, but it does prevent a lot of abuse.

The details of the ranking algorithms are in many ways Google’s crown jewels. We are very proud of them and very protective of them. By some estimate, more than one thousand programmer/scientist years have gone directly into their development, and the rate of innovation has not slowed down.

But being completely secretive isn’t ideal, and this blog post is part of a renewed effort to open up a bit more than we have in the past. We will try to periodically tell you about new things, explain old things, give advice, spread news, and engage in conversations. Let me start with some general pieces of information about our group. More blog posts will follow.

I should take a moment to introduce myself. My name is Udi Manber, and I am a VP of engineering at Google in charge of Search Quality. I have been at Google for over two years, and I have been working on search technologies for almost 20 years.

The heart of the group is the team that works on core ranking. Ranking is hard, much harder than most people realize. One reason for this is that languages are inherently ambiguous, and documents do not follow any set of rules. There are really no standards for how to convey information, so we need to be able to understand all web pages, written by anyone, for any reason. And that’s just half of the problem. We also need to understand the queries people pose, which are on average fewer than three words, and map them to our understanding of all documents. Not to mention that different people have different needs. And we have to do all of that in a few milliseconds.

The most famous part of our ranking algorithm is PageRank, an algorithm developed by Larry Page and Sergey Brin, who founded Google. PageRank is still in use today, but it is now a part of a much larger system. Other parts include language models (the ability to handle phrases, synonyms, diacritics, spelling mistakes, and so on), query models (it’s not just the language, it’s how people use it today), time models (some queries are best answered with a 30-minutes old page, and some are better answered with a page that stood the test of time), and personalized models (not all people want the same thing).

Another team in our group is responsible for evaluating how well we’re doing. This is done in many different ways, but the goal is always the same: improve the user experience. This is not the main goal, it is the only goal. There are automated evaluations every minute (to make sure nothing goes wrong), periodic evaluations of our overall quality, and, most importantly, evaluations of specific algorithmic improvements. When an engineer gets a new idea and develops a new algorithm, we test their ideas thoroughly. We have a team of statisticians who look at all the data and determine the value of the new idea. We meet weekly (sometimes twice a week) to go over those new ideas and approve new launches. In 2007, we launched more than 450 new improvements, about 9 per week on the average. Some of these improvements are simple and obvious — for example, we fixed the way Hebrew acronym queries are handled (in Hebrew an acronym is denoted by a (“) next to the last character, so IBM will be IB”M), and some are very complicated — for example, we made significant changes to the PageRank algorithm in January. Most of the time we look for improvements in relevancy, but we also work on projects where the sole purpose is to simplify the algorithms. Simple is good.

International search has been one of our key focus areas in the past two years. This means all spoken languages, not just the major ones. Last year, for example, we made major improvements in Azerbaijani, a language spoken by about 8 million people. In the past few months, we launched spell checking in Estonian, Catalan, Serbian, Serbo-Croatian, Ukranian, Bosnian, Latvian, Filipino Tagalog, Slovenian and Farsi. We organized a network of people all over the world who provide us with feedback, and we have a large set of volunteers from all parts of Google who speak different languages and help us improve search.

Another team is dedicated to new features and new user interfaces. Having a great engine is necessary for a great car, but it is not sufficient. The car has to be comfortable and easy to drive. The Google search user interface is quite simple. Very few of our users ever read our help pages, and they can do very well without them (but they’re good reading nevertheless, and we’re working to improve them). When we add new features we try to ensure that they will be intuitive and easy to use for everyone. One of the most visible changes we made in the past year was Universal Search. Others include the Google Notebook, Custom Search Engines, and of course, many improvements to iGoogle. The UI team is helped by a team of usability experts who conduct user studies and evaluate new features. They travel all over the world, and they even go to people’s homes to see users in their natural habitat. (Don’t worry, they do not come unannounced or uninvited!)

There is a whole team that concentrates on fighting webspam and other types of abuse. That team works on variety of issues from hidden text to off-topic pages stuffed with gibberish keywords, plus many other schemes that people use in an attempt to rank higher in our search results. The team spots new spam trends and works to counter those trends in scalable ways; like all other teams, they do it internationally. The webspam group works closely with the Google Webmaster Central team, so they can share insights with everyone and also listen to site owners.

There are other teams devoted to particular projects. In general, our organizational structure is quite informal. People move around, and new projects start all the time.

One of the key things about search is that users’ expectations grow rapidly. Tomorrow’s queries will be much harder than today’s queries. Just as Moore’s law governs the doubling of computing speed every 18 months, there is a hidden unwritten law that doubles the complexity of our most difficult queries in a short time. This is impossible to measure precisely, but we all feel it. We know we cannot rest on our laurels, we have to work hard to meet the challenge. As I mentioned earlier, we will continue providing you with updates on search quality in the coming months, so stay tuned.

因blogspot.com被墙,大家可以搜文章标题用快照看。

再转载国内翻译:

说它最权威是因为是Google工程副总裁,负责排名算法的Udi Manber,发表在Google官方博客的一个帖子。下面捡主要内容翻译一下,想看完整原文的请参考Google官方博客Google 搜索质量简介

搜索质量组是Google内部负责搜索结果排名的。每天Google处理无数查询,Google需要在不到一秒的时间内从数以亿计的网页中选择出应该返回哪些,以及以哪样的顺序显示。

Google对排名算法一直都比较保密,主要原因有两条:竞争及防止被滥用。

Google排名算法的细节是Google皇冠上的珍珠,我们以它为傲,并且非常注意保护。但是完全保密有时候也不是理想状况,所以Udi Manber等人决定与站长多沟通,谈一谈有什么新鲜事,解释一些老的内容,给点建议,参与对话等。这篇帖子是第一篇,以后还会有其他内容。

这个部门的心脏是核心排名小组。排名是相当困难的,比大部分人所想象的更困难。其中一个原因是语言都是模棱两可的,文件也没有任何规则,怎样理解信 息没有标准。所以我们需要理解任何人,因为任何原因所写的任何网页。这只是一部分。我们还需要理解用户的查询,再将查询投射到我们所理解的文件上。更不要 说不同的人有不同的需求。而且我们需要在几毫秒之内完成这一切。

Google排名算法最出名的部分就是PageRank。PR 现在还在使用中,不过已经是一个更大的一系统中的一部分。其他部分还包括语言模块(处理短语,同义词,方言,拼写错误等的能力),查询模块(不仅仅是语 言,还包括人们怎样使用语言),时间模块(有的查询返回一个30分钟前刚创作的网页最合适,有的时候返回已经存在很长时间的网页更合适),个性化模块(不 是每个人都需要相同的东西)。

另外一个组负责评估我们做的怎样。目标是改善用户体验,这不是主要目标,而是唯一的目标。有每分钟进行的自动评估,阶段性整体质量评估,更重要的还 有个别算法调整的评估。当某个工程师有个好主意,开发一个新算法后,我们对这个主意进行测试。一组统计学家会检查数据,确定这个新主意的价值。

2007年,我们做了450次以上的调整,差不多每个星期9次。比如今年1月份,我们对PR算法做了大幅调整。大多数时间我们都是寻找相关性改善的 方法,有时我们也致力于简化算法,简单就是好的。(Zac的注释:关于简化算法这句话,是紧接着PR算法调整之后说的,不确认他是讲简化了PR算法,还是 说笼统的对算法的简化。我的感觉是PR算法确实有了很大的改变。给我的感觉是,基本上给我们看的工具条PR是不准的,尤其很多该有PR值得内页显示PR为 零。也许这就是Udi Manber所说的算法简化造成的。)

在过去两年中,国际搜索是我们的主要焦点之一,包括所有的语言,而不仅限于主要语言。

另外一个小组专门负责新功能和用户界面。Google的用户搜索界面相当简单,当我们增加新功能时,尽量确保对用户来说是简单易用的。在过去一年 中,主要的改变包括整合搜索,Google Notebooks,自定义搜索引擎以及iGoogle的改进。用户界面小组有一组易用性专家,在辅助研究用户及评估新功能。

还有一个小组专注于反垃圾和其他各种被滥用的技巧,这也就是Matt Cutts所负责的小组。这个组识别新的垃圾技术,并且以可扩展的方式克服这些垃圾。和其他组一样,这个组也是要处理国际多语种。反垃圾组与Google站长工具组紧密配合。

还有其他专门的项目组。整体来说我们的组织结构相当非正式,人员调动频繁,新项目也随时开始。


  文章标签: ,

  转载注明:转自居者鸿儒 文章网址:http://www.zhuhong.org/archives/official-google-blog-introduction-to-google-search-quality.html

  博客遵循:署名-非商业性使用-禁止演绎 3.0 共享协议

  收藏分享: 新浪围脖 / 开心 / 豆瓣 / QQ书签 / 百度收藏 / 谷歌书签 / Facebook / Twitter / 鲜果


2 枚回复


  1. 卢松松 说:

    牛人啊。以后多多互访啊

  2. @卢松松, 松哥这么夸奖我,受之有愧!


发表回复

/卓越亚马逊畅销书特价Amazon.cn


XHTML: 您可以使用如下代码:<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>