Category Archives: Python

Part 2 Caught in a Web Scraping Maze: xgoogle Python module

During my investigation for web scraping methods in Python, I came across this Stackflow discussion that used the Python module, xgoogle, to scrape Google search results while also building in a wait time between searches. It looked like a promising method, so I tried it out.

Unfortunately, as is the case with many young programmers, at every step I ran into trouble.

trouble

First, I tried to import the module into IPython. That was silly; I hadn’t installed it yet! So then I tried to install it using commands in the Terminal (similar to what I would do in R). But unlike R (it seems), Python has no way of knowing where to pull the installation files or what I’m asking it. So it kept telling me, “Sorry I have no idea what you’re trying to do.” Eventually I did manage to set it up (described later in this post), but first I have to share something very strange that happened.

In the process of trying to install xgoogle, I noticed something in my Terminal that horrified me. Instead of the Terminal reading my Home Directory as “Alyssa-Fus-Macbook-Air” it was reading as “jims-iphone”. What the!? Who in the world is this Jim, and what is his iPhone doing on my computer!?

Jim's iPhone infiltrating my laptop

Jim’s iPhone infiltrating my laptop

I’ll admit, I panicked. I’m very new to working in the Terminal and I was convinced that somehow I had been hacked. But like any calm and composed programmer, I swallowed my fear and turned to the one thing that could save my computer from inevitable doom: Google.

After starting with several vague and terrified search terms (“WHO IS INVADING MY COMPUTER AND HOW DO I STOP THEM?! …just kidding I didn’t search anything that histrionic), I finally whittled down my search to something specific: “terminal name different from user name”.

The links I found helped me to investigate the problem and in the end, solve it. First I looked into how far this “name change” went. Was I somehow accessing Jim’s iPhone or was this a superficial name change and I was still accessing my own computer and files? So I changed the directory to my Desktop, and I checked what was in it. (I suppose I could have just looked in the directory itself, but I wanted to see what would happen if I went “deeper” into “Jim’s iPhone”). This helped me confirm that though my laptop was taken over by the evil iPhone of Jim, at least everything seemed to be where it was supposed to be.

Jim can take my laptop's name, but he can't take my laptop's contents!

Jim can take my laptop’s name, but he can’t take my laptop’s contents!

So then I checked to see if my computer name was still the same, or if Jim had taken that too.

Sorry Jimmy Jim, you can take my name, but not my name name!

Sorry Jimmy Jim, you can take my name, but not my name name!

Okay, so I’m starting to calm down. Then I found this article discussing this problem, and I focused on the second response about computer sharing. I looked at my Sharing Preferences and was shocked to see Jim had infiltrated at this level.

Jim just wants to share

Jim just wants to share

Why, Jim, why?! What did I ever do to you…

So at this point I’m wondering, when I installed all those pandas and numpy modules and anaconda whatsits did I accidentally download something that changed my HostName to Jim? Or maybe since I’m always connecting to foreign networks (in cafes, in the library, in hotels), is that where I picked up a little friend named Jim?

This result suggests the latter explanation is most likely. “In short, the Mac will pick up a host name from the DHCP server. This does not affect your computer’s name as you have assigned it. This will only affect what you see at the command prompt.” Ah haaaa that’s plausible and no cause for alarm (I’m sure). (What’s the purpose of this though?)

And then this result gave a suggestion for how to change the HostName. I have read several cautionary tales of using “sudo” which makes changes directly to the OS (I think), but this seemed harmless enough. So I ran the command, and once I restarted my Terminal, everything was right as rain. Whew!

sudo scutil --set HostName Alyssa-Fus-Macbook-Air

sudo scutil –set HostName Alyssa-Fus-Macbook-Air

Right as rain

Right as rain

Alright! Now that that snafu has been fixed, let’s return to the real task at hand: installing xgoogle.

Eventually after several failed attempts, I found this Stackflow answer and used it to successfully install the module. I just had to download all the scripts from the xgoogle Github page, put the folder in my directory, change my directory to that folder, then run the script. And it worked beautifully!

Installing xgoogle

Installing xgoogle

Alright alright alright! Let’s put this puppy into action.

I ran the script straight (without the wait time) to see what I would get.

GoogleSearch results

GoogleSearch results

Unfortunately what I got were… 0 results. 0 results! How is that possible?

This response on Github suggested it was a bug and needed a patch available here as well as in the comments. …but I’ll admit, I had no idea how to apply the patch. I decided to run the code straight, because I figured this would replace the old function, and then I could rerun my code. But the patch code kept getting stuck.

When I copy-pasted from the comments (made by the creator of xgoogle himself), I got an Indented Block error:

Indented block error

Indented block error

So then I tried the RAW Paste Data from the pastebin site linked above:

Surprise! Unexpected indent

Surprise! Unexpected indent

Another indent error. I tried pasting them into TextWrangler to double-check the indents and reran the code. This time, there was no error for the patch code (omg, maybe it’s working?!) — I held my breath and ran the GoogleSearch script again…. still 0 results. Bah. Dejection…

PS. I just checked my afp address and now it looks like a perfectly normal random set of numbers. Whew! Good-bye Jim’s iPhone! It was fun, but I’m happy you are no longer on my computer.

Working with Nested Dictionaries in Python

In my first few posts, I described how to pull data from an API, convert JSON data for Python, and combine data into a table. The data I used was basic (short) user data from League of Legends including summoner ID, name, and profile icon. Very simple and not all that interesting.

Now I want to pull real game data to analyze trends in game play (and what predicts a win!).

First, I went to LoL’s API Reference page and selected the match history: https://developer.riotgames.com/api/methods#!/966/3312

I retrieved summoner ID to enter into the summonerId field (luckily I had a handy dandy table where my IDs were nicely listed). This set of data gives you the last 10 matches the person has played, and game data associated with those games like number of assists, champion level achieved, number of deaths, which items were bought, damage taken. Everything!

When you execute the request, you get a page that looks like this (remember, you have to insert your own API key into the URL):

Match History from API

Match History from API

More JSON data! With much more complicated nested dictionaries…

I imported the data into Python (using the same steps I mentioned in my last post) and tried to use the same DataFrame call. This is the result I got:

pd.DataFrame(data)

pd.DataFrame(data)

O.O All the keys within “matches” are in a single column instead of distributed across columns.

I found a JSON normalize function that seemed to do partly what I wanted (reference the web page discussing this option here), which got me to this:

json_normalize(data['matches'])

json_normalize(data[‘matches’])

Close, but no cigar. Some of the columns still contain nested data.

Here was another promising suggestion: http://pandas.pydata.org/pandas-docs/stable/io.html#normalization. But my different attempts still didn’t work. (I’ll have to figure out why later.)

json_normalize(data, [‘matches’, ‘participantIdentities’]) got me the data within participantId, but the player data was still a nested dictionary.

json_normalize(data, ['matches', 'participantIdentities'])

json_normalize(data, [‘matches’, ‘participantIdentities’])

json_normalize(data, ‘matches’, [‘matches’, participantIdentities’]) generated an error, even though as far as I can tell, it matches the example.

json_normalize(data, 'matches', ['matches', 'participantIdentities'])

json_normalize(data, ‘matches’, [‘matches’, ‘participantIdentities’])

json_normalize(data, [‘matches’, ‘participantIdentities’, [‘player’]]) got me the indexes I needed for the data, but no data!

json_normalize(data, ['matches', 'participantIdentities', ['player']])

json_normalize(data, [‘matches’, ‘participantIdentities’, [‘player’]])

So then I decided to take a different approach and figure out how to even call the nested data. This method actually generated something usable!

First I tried to call the ‘participants’ value/key from the ‘matches’ key.

data['matches']['participants']

data[‘matches’][‘participants’]

Nope! Error! It doesn’t like that I used a string (even though I’ve seen plenty of examples where it looked as though they called the data through the variable name, and it worked fine. For example, in the post from the link above, the person wrote:

for result in data['results']:
    result[u'lat']=result[u'location'][u'lat']
    result[u'lng']=result[u'location'][u'lng']
    del result[u'location']

which made me think I could call data[‘match’][‘participants’]).

But then I started to think more about what this for loop was doing and what it was looping over: the indexes within data[‘results’] (which is also what the error, “list indices must be integers, not str”, would suggest. Also I just noticed that the output from data[‘matches’] lists integer indexes, so that should have been a clear giveaway for how I should have called the data. Silly me!). So then I tried data[‘matches’][0].

data['matches'][0]

data[‘matches’][0]

Hoorah! It seems to have worked! It called the details from the first match. I swear, when I tried calling the first index from data[‘matches’] before, it didn’t work, but maybe I was trying data[0] instead.

data[0]

data[0] – But why doesn’t this work? Add this to the list of things to figure out later!

So then I kept playing around with the call and eventually got this:

data['matches'][0]['participants'][0]['stats']

data[‘matches’][0][‘participants’][0][‘stats’]

and this:

data['matches'][0]['participants'][0]['stats']['assists']

data[‘matches’][0][‘participants’][0][‘stats’][‘assists’]

Sweet!! I’m learning something new about calling data from dictionaries and nested dictionaries. This is pretty kickass awesome!

I also figured out one metric I could use to indicate whether I need to call the data through the index versus or through the key from the dictionary.

When I call data[‘matches’], this is what I get:

data['matches']

data[‘matches’]

Here the output starts with a square bracket, [, compared to output for data[‘matches’][0][‘participants’][0][‘stats’] (pictured above), which starts with a curly brace, {. I’m not sure exactly what that means… maybe the square brackets means the object is an array and therefore needs to be called by index, whereas [‘stats’] is a straight dictionary so the specific keys can be called. Either way, [‘participants’] also starts with a square bracket, [, so I used a [0] to call that dictionary, even though it was the only index in [‘participants’]. EDIT: I reached the part in the Python Codecademy course where they call a list-value from a key, and they used the index. So here, I called it an array, but I could have (should have?) called it a list. Now I’m not sure if it’s an array, but either way I’m excited that I figured something out that I’m also learning!

Okay! So now I have a way of pulling data from the nested dictionary, ‘matches’. After that it was smooth sailing to create a for loop to combine match history data together (at this point, the stats).

Creating data frame

Creating data frame

Awesome!! At this point I’m going to stop here because I have other things I need to work on, but it’s not a bad start! I need to add to the match ID to this data frame as well as all the other match data in ‘matches’. It shouldn’t be too hard to do, but I’ll have to play around with it another time.

It’s funny, when I look back at what I wrote, it seems completely obvious about what I should have done, but it wasn’t at the time. I hope I don’t sound too noob and incompetent (I’m sure half the terms I use aren’t right), but for now I’m happy with what I’ve accomplished!

Who are My Friends in League of Legends? Compiling User Data

For my first foray into the Python, API, JSON world, I decided to use League of Legends (LoL) data and see if I could call it, manipulate it, and make some use of it!

My goal for this task was to pull League of Legends data on specific users (using the API) and assemble that information (formatted with JSON) into a table, in this case, a data frame using Python.

In my other post, I described the general process of accessing the League of Legends API, so I won’t go over that here. Instead, I’ll talk about the specific steps I took to create a Python object containing a list of selected friends and their associated general user info from LoL. So the following script assumes you know how to pull data off of the League of Legends API.

Import the libraries you’ll need to run import data from a URL (request), read JSON data (json), and create a data frame (pandas). Importing pandas as pd allows for easy reference to functions in pandas.

import requests
import pandas as pd
import json

Create a list of the names you wish to pull (your League of Legends friends!).

names = [“kallyope”, “kallykallyope”, “zunger”, “dowentz”, “dead peon”]

If you want to add anyone to your names list, use the .append() call.

names.append(“grumpyII”)

Create an empty data frame for your friends.

friends = pd.DataFrame()

The pd lets Python know that you are pulling the call .DataFrame() from the library pandas (pd).

Create a for loop to pull each name from your names list that does the following:
(1) Gets the URL associated with that name (which calls the League of Legends profile information of the friend from their API)
(2) Reads the URL into Python
(3) Unloads the JSON data into a readable format in Python
(4) Creates a data frame for that name’s data (that is transposed, since the default will make the keys the index of the data frame, and the values the first column. EDIT: I just learned, this is only the case for nested dict objects, which this one is.)
(5) Appends that name’s data to the overall data frame, friends.

<key> is a specific API key that League of Legends generates for you. You should replace this with your own API key. Don’t share this with anyone!

for name in names:

url = “https://na.api.pvp.net/api/lol/na/v1.4/summoner/by-name/&#8221; + str(name) + “?api_key=<key>”
resp = requests.get(url)
data = json.loads(resp.text)
user = pd.DataFrame(data).T
friends = pd.DataFrame.append(friends, user)

Print the data frame friends to see what it looks like!

print friends

Screen Shot 2015-03-21 at 10.47.23 AM

Data frame of users in list “names”

Awesome!! Now I want to figure out how to pull down different kinds of data (e.g., win rate, items bought, time of deaths in game, time of CS in game if the API provides that information), manipulate the data, and run analyses on the data (Am I more likely to win games where my CS is high? Am I less likely to die when my CS is high? Am I more likely to win games when the jungler ganks the lanes?).

I also want to figure out how to link up this data to the web and create an interactive webpage. For example, it would be cool to create a webpage based on this table I just made where you could input your friends’ names in a field, and the page will create a table that grows as you add more friends to it. It would be neat!

Learning League of Legends API

To create a data science project, there is one crucial component any aspiring data scientist must have: data!

One source of abundant data comes from the massively popular online game, League of Legends, which luckily enough supports an API that makes it extremely easy to access user and game data.

Here is the website for the Riot Games API (Riot Games is the company that created and manages LoL): https://developer.riotgames.com/

Riot Games API

Riot Games API homepage

To use LoL data, you must have a League of Legends account. It’s free to sign up and free to play!

When you first log in, the site assigns you an API key. Since this is my first time using an API, I had no idea what this meant. I learned (through many questions) that this key is what allows you to send requests to the API for data. There is a limit to the number of requests that can be made at any given time (500 requests every 10 minutes, 10 requests every 10 seconds) per key. For development, this amount should be fine. But if you make your application public and you have multiple users on your site (making multiple requests), then you may need more keys, possibly even a production key. Every ping to the site counts as a request. For example, if you ask the site to send you kallyope’s profile information (that’s me!), that’s one request. Say you see I’m in a game, and you want that game’s information–that’s another request. You can see the other players that are in that game, and you want to know who they are. That’s another nine requests. So as you can see, the requests can add up very quickly!

Okay, so now you have an API key and you want to start pulling data off the site. How do you do that?

You have to access the API Reference through the API Documentation (in the top menu). Or you can just click on this link here: https://developer.riotgames.com/api/methods#!/960

(SIDENOTE: WOW! So as I was writing up this post, I started poking around more parts of the site. And my mind is blowing up with all the cool information you can access, for example about champions. To read more about the awesomeness that is currently making me see stars, check my blog for a future post.)

I started with the Summoner tabs, which is user data. I clicked the first “Get” set, which allows you to look up users based on their username.

Look up by summoner username

Look up by summoner username

At the bottom of each section is an Execute Request button. For some of the fields you don’t have to specify the information you are looking for (e.g., champion info). For others, you have to provide some information to guide the API to what you are seeking. For the summoner name, I typed in my username, kallyope.

Looking up kallyope

When I click Execute Request, I gain access to this link: https://na.api.pvp.net/api/lol/na/v1.4/summoner/by-name/kallyope?api_key=<key> You should replace <key> with your own API key. The link sends me to a plain text page with the information on it that I requested, in this case, information about the user kallyope.

{“kallyope”:{“id”:39017501,”name”:”kallyope”,”profileIconId”:607,”summonerLevel”:30,”revisionDate”:1426917635000}}

The data is stored as JSON data. They look a lot like dicts in Python, don’t they? There is the key “kallyope” followed by the values “id”, “name”, etc. Then within those values, there are key-value pairs. The key “id” is paired with the value of kallyope’s id, “39017501”.

One thing that is awesome about this link is I can just replace my username in the URL with any username that I know, and it will give me that user’s information. For example, when I look up my friend, zunger, here is his information.

https://na.api.pvp.net/api/lol/na/v1.4/summoner/by-name/zunger?api_key=<key>
{“zunger”:{“id”:33059657,”name”:”Zunger”,”profileIconId”:20,”summonerLevel”:30,”revisionDate”:1426921735000}}

And that is how you can access League of Legends API in a nutshell!

You may be asking yourself, okay, that’s nice, I can see this information on a webpage, but what can I DO with it!? I ask myself that same question. One easy option is to throw this information into Python and create a table that organizes the information. See this post for my first attempt to wrangle online data in Python!

Learning Python

Python is a simple and powerful language that can be used for data analysis, which is why it is one of the most important tools a data scientist should know. The primary statistical and plotting libraries include numpy, pandas, scikit, and matplotlib. (I’m excited to learn pandas; it sounds cute. ^^ Oh and extremely useful for statistical analysis, of course.)

When I first started learning Python, it was through the Code Academy course. While the language seemed pretty easy to pick up, I was frustrated because I wasn’t sure how it connected to data analysis.

So then I picked up the book, Python for Data Analysis book, by Wes McKinney. I had good luck with Practical Data Science with R, so I was optimistic about this book. These books, however, rely on some previous knowledge of the language, which, in the case of Python, I had very little. For example, from the start, the book discussed how data frames in Python are similar to lists of dicts and tuples, but I had no idea what tuples were (it sounded like a mix of triples and doubles to me, what does that even mean!?), and I only vaguely remembered what dicts were. Even the Appendix that covered Python basics at the end of the book was a little beyond me.

Python for Data Analysis Cover

Then I found the Google Python class. I watched the first day videos and some of the second day videos. From the very beginning, the instructor was using dicts and tuples, and now I finally understand what they are! (Tuples are immutable lists created with parentheses; dicts are made up of key-value pairs.)

I also learned that Python is already installed on the Mac OS and can be easily accessed from the Terminal. (In the Python book, they start with installing XCode for Mac, and every version I tried to install was incompatible with my OS, Mountain Lion. In the end I did install XCode command line tools, which I think has worked? I discuss this a little later in the post.)

I also downloaded some free books on Python, including A Byte of Python, Non-Programmer’s Tutorial for Python 2.6, and Python Programming. It was useful to read different approaches to Python. A Byte to Python is more prose-y, but I was dissatisfied with its explanations (they would define the terms using the same words as the terms. For example, they say a literal constant is “called a literal because it is literal – you use its value literally.” O_O

I liked the Python Programming book better. It was more matter of fact and just went through different terms and ideas one at a time. It almost felt like a more detailed dictionary, but I liked it.

After trying the Google class and reading through the beginning of these books, terms and concepts like lists, dicts, tuples, indexes, object calls started to take form in my head. In fact, I returned to the Python book a few days ago, and when I reread the introductory examples, they actually made sense! I was over the moon. It was actually an amazing experience to go from reading text that seemed like gibberish to find that when I revisited them, I could understand their meaning and picture what the authors were describing in Python. It was awesome!

Armed with this new knowledge, I turned to installing the libraries pandas and numpy. I went through several suggestions for installing these packages, such as using Anaconda and/or Miniconda, suggested here. It didn’t seem to work though.

Eventually I think what worked was installing the XCode command line tools (not sure where it went on my computer), and then installing pandas with a command to the Terminal. At first I thought it wasn’t working because the test the Python book recommended resulted in an error (to create a plot). It turned out I had just spelled the command wrong (arange, not arrange, sigh). Once I fixed that error, it worked! I got a plot! It was a very exciting day (though I can’t wait until I can actually plot my real data).

plot(arange(10))

Testing pandas with plot(arange(10))

After installing the libraries, I looked into setting up my Python environment. IPython was useful, but I was doing everything in the Terminal. I wanted a setup where I could record my scripts and run line by line.

I tried using PyCharm, but I was having trouble getting it to work with IPython. I tried running to Terminal from Text Wrangler, but it wasn’t connecting. I revisited the Python book and saw they recommended going from a text editor to the Terminal by copy-pasting. They had some tips about how to paste whole blocks of text (if you just paste multiple lines straight, they enter the Terminal line by line). This is now the method I’m using.

Screen Shot 2015-03-21 at 12.28.49 AM

Text editor + IPython in Terminal

So things are working out pretty well! I’m revisiting the Python course in Code Academy to remind myself of what they taught. I’m kind of amazed at how much they cover what I am currently learning. And now I can put what they are teaching into context and how I would use it for data analysis. It’s cool!

So I feel like I’ve gotten a foot in the door and that I’m setting up the foundation for figuring out what I can do in Python. I’ve been doing a mix of studying, learning, reading, Googling, and just trying stuff out. It’s been fun! Check out my future posts for updates on the specific tasks I’ve been doing/learning!