Web Scraping: A Test on Teamwork

I am working on a project examining how conceptualizations of teamwork differ in the US compared to Korea (I hope no one is really paying attention to this. The danger of scooping is always out there in our suspicious academic world. **shifty eyes**). The method I am using to test this question is by collecting Google Search Results for key terms about teamwork and to compare how people are talking about teamwork in the US versus Korea. I won’t go over our specific hypotheses here (you’ll just have to wait for the publication!), but I’ll talk about the process.

Normally we would do this project by hand. I would have a team of undergraduate research assistants typing in the search term then meticulously document each URL and article title. I’m not even really sure how we would go about capturing the main text. I suppose similar to the URL and title it would be a straight never-ending task of copy-paste. But with the ability to scrape web data, why not take advantage of those tools?

Why Not Zoidberg?

Why Not Zoidberg?

Continuing from my previous post, now that rvest is installed and ready to go, I wanted to take it for a spin.

First I searched in Google for the phrase “why teamwork is important”. Here is a screenshot of those results:

Why is teamwork important

Why is teamwork important

Now to start pulling the data from these articles. I opened the first link, that looked like this:

Happy Manager - Why Is Teamwork Important?

Happy Manager – Why Is Teamwork Important?

In R, I created a variable, teamwork, and assigned it the URL to this article. Then I opened teamwork to see what was inside…

teamwork = url("why-is-teamwork-important/")

teamwork = url(“why-is-teamwork-important/”)

Neato-bandito! It’s the full HTML file.

Then I followed the example from the Lego Movie scraping and created a variable called, paragraph, that scraped any data within “p” tags – basically the main body of text. When I opened paragraph, I found…

Scraping web data with html_nodes() and html_text()

Scraping web data with html_nodes() and html_text()

Awesome sauce! It pulled the text in the main body of the article. From here it looks like it would be fairly easy to pull the article title (not included here, which makes sense. It’s probably inside a header tag). It does look like scraping over the “p” tag pulls other text that won’t be exactly what I’m looking for, but that can be a problem I figure out another day.

Looking ahead, I want to automate pulling the links from the Google Search Results page, then use a for loop to scrape the data from each of the articles. Another thing to figure out would be how to do this in Korean. But for now, this is a very promising start, a little window into what is possible. HOW EXCITING IS THAT!?

Oh another I want to do is go back to the documentation of rvest and read up on all the components of the code. The main thing I think is that I don’t really know what %>% means…

Oh also I forgot! I tested this script on a second web article just to get a sense of what it would be like to automate the process across multiple articles. Here are those results:

Test 2 of Scraping Teamwork Articles

Test 2 of Scraping Teamwork Articles

It also worked! Still a lot of extra fluff, but still pretty cool at how easy it is to accomplish Step 1 for this project. Woo hoo!


One thought on “Web Scraping: A Test on Teamwork

  1. Emmy

    , they had found them each year and look forward to them. I bet they would feel both sad AND happy. Sad they never got to meet this Kellie that everyone so obviously loved and adored, but happy to feel a part of it and to witness this love in these ba1loons!I&#82l7;d like to think there’s someone out there like this maybe you can track one next time? Be interesting to see it’s journey!



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s