At about this point, I started to think, if all these people are creating their own web scrapers, why can’t I? How hard can it be to pull some links off a page anyway….
So I went back to Google and inspected the elements on the page to see if I could identify the URLs of the search results. Using the Inspect Element tool in Chrome, I found the tags tied to the URLs of the search results.
They are deeply embedded in div within div within div within.. you get the point.
Alright, it’s getting late so I’m going to try and cut to the chase.
Since I knew that scraping Google search results was different from scraping html content (with rvest), I started by Googling “scrape google results R”, and this result about httr came up. I installed the httr package, then ran the example script. Cue drumroll!
…aaaand error. xpathSApply does not exist! Some searching revealed that it’s a function in the XML package, and since I don’t work much with XML data, this was a good chance to get my feet wet with it.
So I installed the XML package and tried again. Any luck this time?
Mmmm… sort of? At least it pulled a URL, but not really the right one. I tried running their exact script, but it still didn’t yield usable URLs. (I just realized that in my script, I didn’t put “+” between the words of my search. But when I did it just now, I got the same results as below).
Okay, so then I turned to rvest to see where it could get me. I tried a number of things like referencing the HTML nodes, then CSS ones, and even XML ones. Here are the links I used to guide my quest out of the web scraping maze: rvest documentation, web scraping with R tutorial (CSS), Stackflow diving into nodes, and even a really handy-looking site (from Stanford might I add) for once the URLs are gathered (pin that for later). When I dove in, this is what I found.
First, pulling the html document of the Google Search Results revealed this:
I could tell there were some differences in the output and in what I saw through the Inspect Elements, but at first glance, this output looked fairly reasonable, so I moved forward.
As a first test, I looked to see what I would get if I pulled the html text out of the <a> tags.
Hm, interesting. So it seems the accessible URL links are ones that are standard on Google search pages.
Then I tried a bunch of different calls to see what kind of tags html_nodes take (can it take a class name? …seems the answer is no.)
Nada. Alright let’s try a different approach. I tested one of the examples described in the rvest documentation, pulling data from the A-Team site on boxofficemojo.
Woo hoo! I love it when a plan comes togther …at least in the case, and at least it looks like the script works and the source of my woes is coming from the Google Search Results in particular.
I tried calling the divs from the Google Search Results page, but the results were odd. Some of the divs at the first level were present in the output, but some that were in the output I couldn’t find through Inspect Elements.
And then when I looked for the id of the first level div containing (eventually) the div containing the URLs, it wasn’t in the output. (Below is the first level div containing the URLs.)
I tried using XML-specific calls, but encountered similar results.
Even when I drilled down those divs, it went down… the first one? I’m not sure where it went down.
So I’m really not sure why I can’t drill down to the data I want, but it feels like something is blocking my way, that Google Search Results is doing something special to hide their key info on this page. But I have no idea.
I think what I’ll try next (another day) is to download the page source and scrape the plain text file. I should be able to at least do that. That means I’ll still have to go in and download the page source of about 10 pages of search results for my project. But maybe I could also write a Python script that can pull the page source for me? Most of the time the purpose of these scraping programs is to track daily or by the minute (or second!) changes in pages on the web. But for me, my goal is to take a snapshot of what discussions of teamwork look like in America and Korea, so capturing the data is just a one-time thing and that kind of solution could suit my purposes. But for now that’s future Alyssa’s problem!