Search the web through Google without an API Key

In a previous article, I detailed the anatomy of a Google web search result page. In this article, I will talk about an open source Java API (developed by me) to use Google’s web search without an API key. The open source license used is the Apache 2.0 License.

As you most likely know, Google searches are based on query strings. For example, if I wanted to search for the term “test”, I would usually go to http://www.google.com, enter the search term in a textfield, and hit submit. When I hit submit, the URL of the search result page comes back. An example of what this might look like is http://www.google.com/#hl=en&source=hp&q=test&btnG=Google+Search&aq=f&aqi=g10&oq=test&fp=2755c6b3e9b2e9. However, we do not really need to worry ourselves what are all these parameters in the query string. But if you are curious (and I bet you are), this site (Tony 2009) is a great place to start learning. On the other hand, you may bypass the Google search form and enter directly into the web browser the address http://www.google.com/search?q=test. The results should be nearly identical (differences depend on whether you are logged into Google or not, search preferences saved, local toolbars installed, etc…) using either approach.

The API

The API to use the Google web search engine without a private key is very simple to use. It is composed of 8 interfaces and 8 corresponding implementations of these interfaces. The interfaces are listed alphabetically and as follows.

  • com.vang.jee.google.intf.AnchorLink
  • com.vang.jee.google.intf.GoogleCookieGenerator
  • com.vang.jee.google.intf.GoogleSearch
  • com.vang.jee.google.intf.GoogleSearchResult
  • com.vang.jee.google.intf.GoogleSearchResultSet
  • com.vang.jee.google.intf.GoogleSearchResultSetFactory
  • com.vang.jee.google.intf.PagingLink
  • com.vang.jee.google.intf.RelatedTermLink

The interfaces, AnchorLink, PagingLink, and RelatedTermLink, specify the types of links that comes back from a Google web search. An AnchorLink specifies a generic link (the URL and the title) and is extended by the PagingLink, RelatedTermLink and GoogleSearchResult interfaces. A PagingLink represents the pages of the results (usually at the bottom of a Google web search result page). A RelatedTermLink represents a link to a new search query (search term, related term, similar query, etc…) that may be like the current one (usually at the bottom of a Google web search result page as well).

The interfaces, GoogleSearch, GoogleSearchResult, and GoogleSearchResultSet define the actual search and result contracts. GoogleSearch specifies how a search should be performed. A GooglSearchResult is the actual link that a user may click on. It also has additional getters and setters for fields such as the link to the cached copy of the webpage, the link to similar pages, and the snippet that Google provides with each link. A GoogleSearchResultSet holds a set of GoogleSearchResults. It also holds paging links as well as related term links. You may think of a GoogleSearchResultSet as a web search result page. The GoogleSearchResultSetFactory defines how a GoogleSearchResultSet is constructed.

The interface GoogleCookieGenerator is a necessary evil. It defines how to generate Google cookies. This interface is needed because if we simply hammered away at the Google web search engine, it would think we are crawlers, spiders, robots, or other evil, automated programs trying to abuse it (how come Google can crawl us, but it won’t allow us to crawl it? reciprocity please). By generating valid cookies, Google seems to be forgiving and won’t confront and challenge us with a crazy CAPTCHA to identify us as human. This interface is the whole reason why we can search the web through Google without an API key.

As stated, there is a corresponding implementation for each of these interfaces. The code uses JTidy to parse HTML and Apache’s HttpClient to handle HTTP requests and responses.

A Simple Example showing Use

To use the API described in this article is very simple. An example is demonstrated (with comments) below.

public static void main(String args[]) throws Exception {
		//just specify the query string part; 
		//limit each web search result page to 10 results/links per page
		String query = "q=test&num=10";  

		//create a GoogleSearchResultSetFactory
		GoogleSearchResultSetFactory rsFactory = new JTidyResultSetFactory(); 
		//get a GoogleSearchResultSet
		GoogleSearchResultSet resultSet = rsFactory.getResultSet(query); 
		//print out the GoogleSearchResultSet to the console
		System.out.println(resultSet); 
		
		//paging is simple, you simpl call the hasNext method on the 
		//GoogleSearchResultSet object, and can continue to move
		//to the next page of the web search result
		while(resultSet.hasNext()) {			
			PagingLink pagingLink = resultSet.getNextPagingLink();
			query = pagingLink.getLink();
			resultSet = rsFactory.getResultSet(query);
			System.out.println(resultSet);
		}
	}

How to Get the API

You may download the binary and source distribution from the following URLs.

Future Direction

There are a lot of improvements that can be made with this API. One could be to extend the API to cover the other types of Google searches besides web searching (i.e. video, images, news, etc…). Another could be to extend the API to include other search engines (i.e. Bing, Yahoo!, Ask, etc…). How sweet would it be to compare the results between these different search engines, or, to merge them?

Summary

In summary, we can use Google’s web search capabilities without an API key from Google using this open source Java API (no name at the moment). The use is simple and programmer friendly. There are many improvements that can be made including extensions to different search types and search engines.

Happy programming!

References

Tony. Disentangling the Google Search Query String. Wordcat. October 15, 2009. http://www.wordcat.co.uk/articles/disentangling-the-google-search-query-string/80/

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s