Part 1 Caught in a Web Scraping Maze: Google Search API

Today I sat down to see what I could figure out about web scraping. I’ll warn you now: I tried several routes to wind my way through this maze, but I haven’t yet found a good way out. 😦 But for now, I wanted to document what I found (mostly didn’t find).

I started by searching for “scraping Google search results using Python” and quickly found myself in the middle of a conundrum. According to many of the results, using a non-Google API is actually against the TOS (Terms of Service), so people recommended using Google’s own API to scrape searches. So I checked out the Google Custom Search page.

Google Search API

Google Search API

First it starts by asking you to set up your custom search engine. It turns out the main use for this API (it appears) is to put a custom search box/engine in your own webpage, and the search engine will search within a specified set of pages (within your own website). Not exactly what I was looking for: I want to search Google, not my own pages!

Second, there is a section in the documentation that mentions getting an API key and downloading data as JSON (sound familiar?). I thought at first that that service was only available if you opt for the paid service, but in reviewing this process to write this blog, I discovered the API page where you can execute requests. So I input my request and, with breath bated, I hit Execute Request.

Google Search API

Google Search API

Wham bam! My request generated an error: “Need to provide a Custom Engine ID.” So it doesn’t look like I can use this API to collect the search results I find from Google.

Error: Need to provide Custom Search ID

Error: Need to provide Custom Search ID

Interestingly enough, when I re-executed the request to take a picture of it (I wanted a different time stamp), it said I made too many requests. I’m pretty sure I made maybe three. 😐 Ah well, it’s okay Google. You keep your secrets. But be warned, someday I may return with better tools to unlock what you’re hiding.

Error: Daily Limit Exceeded

Error: Daily Limit Exceeded

As I searched for more information on the Google API, I came across this article, which explained that the main reason Google and other search engines prohibit scraping programs is because a bot can take up a lot of resources if it performs multiple (hundreds or thousands) of searches in a short amount of time. So they suggest building in breaks between searches. Alright, I can do that… that feels morally acceptable… So I turned to Python to show me the way.

Advertisements

5 thoughts on “Part 1 Caught in a Web Scraping Maze: Google Search API

  1. Etta

    February 2, 2012I ran into this page on accident, sunlyisirgpr, this is a wonderful website. The site owner has carried out a superb job of putting it together, the info here is really insightful. You just secured yourself a guarenteed reader.

    Like

    Reply
  2. http://www.netglobal.tv/

    Soliti disturbati mentali per cui meglio castrati che dare soddisfazione alla moglie.Per fortuna hanno fatto il manifesto pertanto nessuno potrà negare l’ennesima clamorosa figuraccia dei PCI-PDS-DS-PD ecc.. cittadini.

    Like

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s