Google States Case for Online News in WSJ
December 4, 2009
Update: The Wall Street Journal is running a piece from Google CEO Eric Schmidt on how Google can help newspapers. It's an interesting read.
Original Article: Google has created a new web crawler specifically for Google News. What this means is that publishers who do not want Google News to index their content can more easily control that. That also applies to publishers who don't wish to completely cut out indexing, but wish to limit/restrict certain elements of their content from being indexed.
Google offers this new crawler at a time when Google's relationship with online news is a heavy focus of discussion throughout the industry, with the FTC's meeting of the media minds taking place. This week Google already announced some changes to how it handles paid content (by offering a five-article limit for the "first click free" plan). Now the company appears to be further extending its olive branch to concerned publishers (whether or not that will be enough is another discussion).
In the past, publishers have been able to block Google from content via robots.txt and the Robots Extension Protocol (REP). They have also been able to keep content out of Google News and stay in Google Search, by using a contact form provided by Google. Now, Google is making it so publishers don't even have to contact them.
"Now, with the news-specific crawler, if a publisher wants to opt out of Google News, they don't even have to contact us - they can put instructions just for user-agent Googlebot-News in the same robots.txt file they have today," says Google News Senior Business Product Manager Josh Cohen. "In addition, once this change is fully in place, it will allow publishers to do more than just allow/disallow access to Google News. They'll also be able to apply the full range of REP directives just to Google News. Want to block images from Google News, but not from Web Search? Go ahead. Want to include snippets in Google News, but not in Web Search? Feel free. All this will soon be possible with the same standard protocol that is REP."
"While this means even more control for publishers, the effect of opting out of News is the same as it's always been," says Cohen. "It means that content won't be in Google News or in the parts of Google that are powered by the News index. For example, if a publisher opts out of Google News, but stays in Web Search, their content will still show up as natural web search results, but they won't appear in the block of news results that sometimes shows up in Web Search, called Universal search, since those come from the Google News index."
Cohen says Google News users shouldn't notice any difference in their experience with the service. It will be interesting to see the reaction from disgruntled publishers, and whether or not this will make any significant difference in how they view Google News.
Have You Read This?
> Google Changes How it Handles Paid Content
> Minds of the Media Gather to Discuss Future of News
> Google Okay With Blocking News Corp.
> Is it Really Crazy to Block Google?
Google May Change Your Page Titles
November 12, 2009
In case you were not aware, Google "reserves the right" to change the titles of your pages in search results. Google's Matt Cutts has released a video discussing why and how they go about doing this.
Cutts says Google wants to show the titles that it thinks are most useful. "For example, suppose the title of your page is 'Untitled' or if there is no title. If that's the case, we try to show a relevant, useful title."
"We reserve the right to try to figure out what's a better title, what's a more descriptive title or snippet to show the users," he continues.
According to Cutts, if you have a title that's really long, they may still use that in their scoring, but in the snippet, they might try to find a "better title." This is presumably based on what the user is looking for.
As Cutts has said in the past, sometimes Google will use snippets right from the Open Directory Project (DMOZ). Sometimes, they'll simply use snippets from the page or the meta description tag. "We do a bunch of different things to find the best description that we can," he says.
"If you have a bad title or a title that we don't think helps users as much, we can try to find a better title, and one we think will be an informative result so that users will know whether that's a good result for them to click on," he says.
Have you noticed Google changing your titles? Did they find better ones? Discuss here.
Have You Read This?
> Why Your Email Address May Show up in Google Search Results
> Why Your Robots.txt Blocked URLs May Show up in Google
> Does Google Recognize the Name of Your Business?
Why Your Robots.txt Blocked URLs May Show up in Google
October 7, 2009
Matt Cutts has appeared in yet another Google Webmaster Video, and this time he has a whiteboard with him so he can illustrate what he's talking about. What he's talking about this time are uncrawled URLs in search results.
Cutts says Google gets a lot of complaints from webmasters who say the search engine is violating their robots.txt files, with which they intend to keep Google from crawling certain pages. Sometimes those URLs still end up in search results.
According to Matt, what is happening in most cases is that when someone's saying "I blocked example.com/go" in robots.txt, it turns out that the snippet Google returns in search results just brings back a URL with no text for the snippet. The reason for this is that Google didn't actually crawl the page.
"It did abide by robots.txt. You told us this page is blocked, so we did not fetch this page," says Matt. It is a URL reference. "We saw a link to it, but we didn't fetch the page itself," he explains.
Google didn't actually fetch the page itself, and that's why there's no text snippet. In case you were wondering what the point of showing them at all is, Cutts breaks out an example looking at the California DMV, whose site is: www.dmv.ca.gov.
Cutts notes that at one point the California Department of Motor Vehicles had a robots.txt that blocked all search engines. "Now these days pretty much every site is savvy enough, you know, at one point the New York Times and eBay and a whole bunch of different sites would use robots.txt," he says.
If someone searches for "California DMV" in Google, there's pretty much only one answer, he says. So that is the answer that Google wants to return. Luckily for Google a lot of people were linking to that page with the anchor text "California DMV". That helps Google be able to return the result without having to crawl the page.
Cutts also says that they can get descriptions from a directory like the Open Directory Project (DMOZ). He cites Nissan and Metallica.com as examples of sites that used to block Google with robots.txt. They had been listed in the Open Directory Project, however, and Google went and got the information from there to include as the snippet.
When this type of thing happens, it looks like the page was crawled, when in fact it wasn't. "So we are able to return something that can be very helpful to users without violating robots.txt by not crawling that page," says Cutts.
He also notes that when you don't want pages to show up, you can use the "noindex" meta tag at the top of the page. When Google sees this tag, it drops the page from its search results completely. Another option is the URL removal tool.
Google, Bing Reps Talk Best Practices
October 7, 2009
Google and Bing employees know tons of useful stuff about search engines, and many people find their tips useful. But let's admit: a lot of SEO experts have seen the same pieces of advice repeated at conference after conference for years, and so at SMX East, two speakers went a little bit off the beaten path into the subjects of speed and security.
(Coverage of the SMX East conference will continue at WebProNews Videos. Keep an eye on WebProNews for more notes and videos from the event this week.)
Maile Ohye, who is a developer programs tech lead at Google, in fact started by talking about things site owners should disallow via robots.txt or no-index/no-follow tags. Login pages and calendars are two examples she gave, and this idea extends to anything that lacks meaningful content that you would want to have indexed.
Ohye then got into some tips about improving page speed. In order to reduce load time, she suggested ordering the CSS and external scripts on your pages efficiently. Placing style sheets above external scripts allows for parallel loading, she noted, which is much faster than the alternative.
Next, Sasi Parthasarathy, a program manager at Bing, swung things in a different direction by stating, "Security is the biggest problem plaguing the Web right now." Spammers and malware providers are always looking for sites to hack in order to get link popularity, and he said that if you if you fail to take security seriously, you may lose credibility among both users and search engines.
So stick to good practices like creating strong passwords and keeping your software up to date. Parthasarathy suggested using anti-malware tools to look for security vulnerabilities, as well.
Finally, try to ensure that config files are not too accessible, and carefully monitor forums and blogs.
