I am working on a project examining how conceptualizations of teamwork differ in the US compared to Korea (I hope no one is really paying attention to this. The danger of scooping is always out there in our suspicious academic world. **shifty eyes**). The method I am using to test this question is by collecting Google Search Results for key terms about teamwork and to compare how people are talking about teamwork in the US versus Korea. I won’t go over our specific hypotheses here (you’ll just have to wait for the publication!), but I’ll talk about the process.
Normally we would do this project by hand. I would have a team of undergraduate research assistants typing in the search term then meticulously document each URL and article title. I’m not even really sure how we would go about capturing the main text. I suppose similar to the URL and title it would be a straight never-ending task of copy-paste. But with the ability to scrape web data, why not take advantage of those tools?
Continuing from my previous post, now that rvest is installed and ready to go, I wanted to take it for a spin.
First I searched in Google for the phrase “why teamwork is important”. Here is a screenshot of those results:
Now to start pulling the data from these articles. I opened the first link, that looked like this:
In R, I created a variable, teamwork, and assigned it the URL to this article. Then I opened teamwork to see what was inside…
Neato-bandito! It’s the full HTML file.
Then I followed the example from the Lego Movie scraping and created a variable called, paragraph, that scraped any data within “p” tags – basically the main body of text. When I opened paragraph, I found…
Awesome sauce! It pulled the text in the main body of the article. From here it looks like it would be fairly easy to pull the article title (not included here, which makes sense. It’s probably inside a header tag). It does look like scraping over the “p” tag pulls other text that won’t be exactly what I’m looking for, but that can be a problem I figure out another day.
Looking ahead, I want to automate pulling the links from the Google Search Results page, then use a for loop to scrape the data from each of the articles. Another thing to figure out would be how to do this in Korean. But for now, this is a very promising start, a little window into what is possible. HOW EXCITING IS THAT!?
Oh another I want to do is go back to the documentation of rvest and read up on all the components of the code. The main thing I think is that I don’t really know what %>% means…
Oh also I forgot! I tested this script on a second web article just to get a sense of what it would be like to automate the process across multiple articles. Here are those results:
It also worked! Still a lot of extra fluff, but still pretty cool at how easy it is to accomplish Step 1 for this project. Woo hoo!