Category Archives: Troubleshooting

Part 3 Caught in a Web Scraping Maze: httr and rvest in R

At about this point, I started to think, if all these people are creating their own web scrapers, why can’t I? How hard can it be to pull some links off a page anyway….

So I went back to Google and inspected the elements on the page to see if I could identify the URLs of the search results. Using the Inspect Element tool in Chrome, I found the tags tied to the URLs of the search results.

Inspect element

Inspect element

They are deeply embedded in div within div within div within.. you get the point.

Alright, it’s getting late so I’m going to try and cut to the chase.

Since I knew that scraping Google search results was different from scraping html content (with rvest), I started by Googling “scrape google results R”, and this result about httr came up. I installed the httr package, then ran the example script. Cue drumroll!

httr script

httr script

…aaaand error. xpathSApply does not exist! Some searching revealed that it’s a function in the XML package, and since I don’t work much with XML data, this was a good chance to get my feet wet with it.

So I installed the XML package and tried again. Any luck this time?

httr script (with XML)

httr script (with XML)

Mmmm… sort of? At least it pulled a URL, but not really the right one. I tried running their exact script, but it still didn’t yield usable URLs. (I just realized that in my script, I didn’t put “+” between the words of my search. But when I did it just now, I got the same results as below).

httr reproduction

httr reproduction

Okay, so then I turned to rvest to see where it could get me. I tried a number of things like referencing the HTML nodes, then CSS ones, and even XML ones. Here are the links I used to guide my quest out of the web scraping maze: rvest documentation, web scraping with R tutorial (CSS), Stackflow diving into nodes, and even a really handy-looking site (from Stanford might I add) for once the URLs are gathered (pin that for later). When I dove in, this is what I found.

First, pulling the html document of the Google Search Results revealed this:

html(google search results)

html(google search results)

I could tell there were some differences in the output and in what I saw through the Inspect Elements, but at first glance, this output looked fairly reasonable, so I moved forward.

As a first test, I looked to see what I would get if I pulled the html text out of the <a> tags.

html_nodes("a") %>% html_text()

html_nodes(“a”) %>% html_text()

Hm, interesting. So it seems the accessible URL links are ones that are standard on Google search pages.

Then I tried a bunch of different calls to see what kind of tags html_nodes take (can it take a class name? …seems the answer is no.)

character(0)

character(0)

Nada. Alright let’s try a different approach. I tested one of the examples described in the rvest documentation, pulling data from the A-Team site on boxofficemojo.

A-Team html_nodes("center")

A-Team html_nodes(“center”)

A-Team html_nodes("center") %>% html_nodes("td")

A-Team html_nodes(“center”) %>% html_nodes(“td”)

Woo hoo! I love it when a plan comes togther …at least in the case, and at least it looks like the script works and the source of my woes is coming from the Google Search Results in particular.

I tried calling the divs from the Google Search Results page, but the results were odd. Some of the divs at the first level were present in the output, but some that were in the output I couldn’t find through Inspect Elements.

teamwork %>% html_nodes("div")

teamwork %>% html_nodes(“div”)

And then when I looked for the id of the first level div containing (eventually) the div containing the URLs, it wasn’t in the output. (Below is the first level div containing the URLs.)

Inspect element: First level divs

Inspect element: First level divs

I tried using XML-specific calls, but encountered similar results.

xpathSApply(teamwork, "//div")

xpathSApply(teamwork, “//div”)

Even when I drilled down those divs, it went down… the first one? I’m not sure where it went down.

xpathSApply(teamwork, "//div//div//div")

xpathSApply(teamwork, “//div//div//div”)

So I’m really not sure why I can’t drill down to the data I want, but it feels like something is blocking my way, that Google Search Results is doing something special to hide their key info on this page. But I have no idea.

I think what I’ll try next (another day) is to download the page source and scrape the plain text file. I should be able to at least do that. That means I’ll still have to go in and download the page source of about 10 pages of search results for my project. But maybe I could also write a Python script that can pull the page source for me? Most of the time the purpose of these scraping programs is to track daily or by the minute (or second!) changes in pages on the web. But for me, my goal is to take a snapshot of what discussions of teamwork look like in America and Korea, so capturing the data is just a one-time thing and that kind of solution could suit my purposes. But for now that’s future Alyssa’s problem!

Advertisements

Part 2 Caught in a Web Scraping Maze: xgoogle Python module

During my investigation for web scraping methods in Python, I came across this Stackflow discussion that used the Python module, xgoogle, to scrape Google search results while also building in a wait time between searches. It looked like a promising method, so I tried it out.

Unfortunately, as is the case with many young programmers, at every step I ran into trouble.

trouble

First, I tried to import the module into IPython. That was silly; I hadn’t installed it yet! So then I tried to install it using commands in the Terminal (similar to what I would do in R). But unlike R (it seems), Python has no way of knowing where to pull the installation files or what I’m asking it. So it kept telling me, “Sorry I have no idea what you’re trying to do.” Eventually I did manage to set it up (described later in this post), but first I have to share something very strange that happened.

In the process of trying to install xgoogle, I noticed something in my Terminal that horrified me. Instead of the Terminal reading my Home Directory as “Alyssa-Fus-Macbook-Air” it was reading as “jims-iphone”. What the!? Who in the world is this Jim, and what is his iPhone doing on my computer!?

Jim's iPhone infiltrating my laptop

Jim’s iPhone infiltrating my laptop

I’ll admit, I panicked. I’m very new to working in the Terminal and I was convinced that somehow I had been hacked. But like any calm and composed programmer, I swallowed my fear and turned to the one thing that could save my computer from inevitable doom: Google.

After starting with several vague and terrified search terms (“WHO IS INVADING MY COMPUTER AND HOW DO I STOP THEM?! …just kidding I didn’t search anything that histrionic), I finally whittled down my search to something specific: “terminal name different from user name”.

The links I found helped me to investigate the problem and in the end, solve it. First I looked into how far this “name change” went. Was I somehow accessing Jim’s iPhone or was this a superficial name change and I was still accessing my own computer and files? So I changed the directory to my Desktop, and I checked what was in it. (I suppose I could have just looked in the directory itself, but I wanted to see what would happen if I went “deeper” into “Jim’s iPhone”). This helped me confirm that though my laptop was taken over by the evil iPhone of Jim, at least everything seemed to be where it was supposed to be.

Jim can take my laptop's name, but he can't take my laptop's contents!

Jim can take my laptop’s name, but he can’t take my laptop’s contents!

So then I checked to see if my computer name was still the same, or if Jim had taken that too.

Sorry Jimmy Jim, you can take my name, but not my name name!

Sorry Jimmy Jim, you can take my name, but not my name name!

Okay, so I’m starting to calm down. Then I found this article discussing this problem, and I focused on the second response about computer sharing. I looked at my Sharing Preferences and was shocked to see Jim had infiltrated at this level.

Jim just wants to share

Jim just wants to share

Why, Jim, why?! What did I ever do to you…

So at this point I’m wondering, when I installed all those pandas and numpy modules and anaconda whatsits did I accidentally download something that changed my HostName to Jim? Or maybe since I’m always connecting to foreign networks (in cafes, in the library, in hotels), is that where I picked up a little friend named Jim?

This result suggests the latter explanation is most likely. “In short, the Mac will pick up a host name from the DHCP server. This does not affect your computer’s name as you have assigned it. This will only affect what you see at the command prompt.” Ah haaaa that’s plausible and no cause for alarm (I’m sure). (What’s the purpose of this though?)

And then this result gave a suggestion for how to change the HostName. I have read several cautionary tales of using “sudo” which makes changes directly to the OS (I think), but this seemed harmless enough. So I ran the command, and once I restarted my Terminal, everything was right as rain. Whew!

sudo scutil --set HostName Alyssa-Fus-Macbook-Air

sudo scutil –set HostName Alyssa-Fus-Macbook-Air

Right as rain

Right as rain

Alright! Now that that snafu has been fixed, let’s return to the real task at hand: installing xgoogle.

Eventually after several failed attempts, I found this Stackflow answer and used it to successfully install the module. I just had to download all the scripts from the xgoogle Github page, put the folder in my directory, change my directory to that folder, then run the script. And it worked beautifully!

Installing xgoogle

Installing xgoogle

Alright alright alright! Let’s put this puppy into action.

I ran the script straight (without the wait time) to see what I would get.

GoogleSearch results

GoogleSearch results

Unfortunately what I got were… 0 results. 0 results! How is that possible?

This response on Github suggested it was a bug and needed a patch available here as well as in the comments. …but I’ll admit, I had no idea how to apply the patch. I decided to run the code straight, because I figured this would replace the old function, and then I could rerun my code. But the patch code kept getting stuck.

When I copy-pasted from the comments (made by the creator of xgoogle himself), I got an Indented Block error:

Indented block error

Indented block error

So then I tried the RAW Paste Data from the pastebin site linked above:

Surprise! Unexpected indent

Surprise! Unexpected indent

Another indent error. I tried pasting them into TextWrangler to double-check the indents and reran the code. This time, there was no error for the patch code (omg, maybe it’s working?!) — I held my breath and ran the GoogleSearch script again…. still 0 results. Bah. Dejection…

PS. I just checked my afp address and now it looks like a perfectly normal random set of numbers. Whew! Good-bye Jim’s iPhone! It was fun, but I’m happy you are no longer on my computer.

Part 1 Caught in a Web Scraping Maze: Google Search API

Today I sat down to see what I could figure out about web scraping. I’ll warn you now: I tried several routes to wind my way through this maze, but I haven’t yet found a good way out. 😦 But for now, I wanted to document what I found (mostly didn’t find).

I started by searching for “scraping Google search results using Python” and quickly found myself in the middle of a conundrum. According to many of the results, using a non-Google API is actually against the TOS (Terms of Service), so people recommended using Google’s own API to scrape searches. So I checked out the Google Custom Search page.

Google Search API

Google Search API

First it starts by asking you to set up your custom search engine. It turns out the main use for this API (it appears) is to put a custom search box/engine in your own webpage, and the search engine will search within a specified set of pages (within your own website). Not exactly what I was looking for: I want to search Google, not my own pages!

Second, there is a section in the documentation that mentions getting an API key and downloading data as JSON (sound familiar?). I thought at first that that service was only available if you opt for the paid service, but in reviewing this process to write this blog, I discovered the API page where you can execute requests. So I input my request and, with breath bated, I hit Execute Request.

Google Search API

Google Search API

Wham bam! My request generated an error: “Need to provide a Custom Engine ID.” So it doesn’t look like I can use this API to collect the search results I find from Google.

Error: Need to provide Custom Search ID

Error: Need to provide Custom Search ID

Interestingly enough, when I re-executed the request to take a picture of it (I wanted a different time stamp), it said I made too many requests. I’m pretty sure I made maybe three. 😐 Ah well, it’s okay Google. You keep your secrets. But be warned, someday I may return with better tools to unlock what you’re hiding.

Error: Daily Limit Exceeded

Error: Daily Limit Exceeded

As I searched for more information on the Google API, I came across this article, which explained that the main reason Google and other search engines prohibit scraping programs is because a bot can take up a lot of resources if it performs multiple (hundreds or thousands) of searches in a short amount of time. So they suggest building in breaks between searches. Alright, I can do that… that feels morally acceptable… So I turned to Python to show me the way.