Today I sat down to see what I could figure out about web scraping. I’ll warn you now: I tried several routes to wind my way through this maze, but I haven’t yet found a good way out. 😦 But for now, I wanted to document what I found (mostly didn’t find).
I started by searching for “scraping Google search results using Python” and quickly found myself in the middle of a conundrum. According to many of the results, using a non-Google API is actually against the TOS (Terms of Service), so people recommended using Google’s own API to scrape searches. So I checked out the Google Custom Search page.
First it starts by asking you to set up your custom search engine. It turns out the main use for this API (it appears) is to put a custom search box/engine in your own webpage, and the search engine will search within a specified set of pages (within your own website). Not exactly what I was looking for: I want to search Google, not my own pages!
Second, there is a section in the documentation that mentions getting an API key and downloading data as JSON (sound familiar?). I thought at first that that service was only available if you opt for the paid service, but in reviewing this process to write this blog, I discovered the API page where you can execute requests. So I input my request and, with breath bated, I hit Execute Request.
Wham bam! My request generated an error: “Need to provide a Custom Engine ID.” So it doesn’t look like I can use this API to collect the search results I find from Google.
Interestingly enough, when I re-executed the request to take a picture of it (I wanted a different time stamp), it said I made too many requests. I’m pretty sure I made maybe three. 😐 Ah well, it’s okay Google. You keep your secrets. But be warned, someday I may return with better tools to unlock what you’re hiding.
As I searched for more information on the Google API, I came across this article, which explained that the main reason Google and other search engines prohibit scraping programs is because a bot can take up a lot of resources if it performs multiple (hundreds or thousands) of searches in a short amount of time. So they suggest building in breaks between searches. Alright, I can do that… that feels morally acceptable… So I turned to Python to show me the way.