Part 5 – Power Query API – Refreshing Data Indeed!

After a number of hours spent experimenting and searching I’ve fallen a little short on this post ladies and gents! I was looking a simple way for the non-programming public to refresh data in an excel sheet on a regular basis. In an Ideal world I would have liked to be able to give the Indeed! Excel file to anyone and say follow these steps to accumulate data for a job search. Alas; the simplicity I was looking for I could not find. So what I will cover in this post is the least technical way I could come up with to refresh a Power Query worksheet. Then I’ll go over how I imagine most reports or data marts will access Power Query’s accumulating data. As always the files we have been working with can be downloaded here.

First a few links to technical solutions I discovered along my journey!

An Excel and VBS Solution

Kim Greenlee has a great blog from all the way back in 2006 that details setting up a visual basic script to open, refresh, save and close an excel file. The script he is using is contained in the article; in addition he provides a link to another blog on scheduling the script as a task in windows task scheduler. I think this method would be ideal if you’re coming from an excel background or are looking for something a little more robust than the hack non-technical solution; but not as technical as PowerShell or SSIS.

A PowerShell Solution

Brad Saide’s blog post detailing his difficulties and preferences for refreshing excel files through PowerShell will lead you to a TechNet discussion on PowerShell and a how to refresh Excel COM objects with PS. There are plenty of code examples in the TechNet discussion and more than enough to get you started refreshing your file or folder of files. If you’re in IT or are a DBA and already familiar with PowerShell this might be the ideal solution for you.

A SQL Server Integration Services Solution

A modified version of the code found in Matt Masson’s blog post would be my preferred method to refresh Power Query excel sheets; if for some reason I don’t have access to Power BI or SharePoint 2010+. He provides some sample code that could be easily modified in SSIS to refresh a folder of Excel Power Query files. I’ll be using this at home / work in lieu of Power BI for the time being; I’ll do a walk through at some point involving setting this up in SQL Server 2012 so stay tuned!

The most non-technical solution I could come up with!

Excel affords us the ability to refresh data at intervals when the workbook is open. In the case of a direct call to Power Query this can’t be a background query which means at interval we will be locked out of interacting with the data.

If you head to the data tab and select “Connections”…

A pop up will come up with all the connections you have made this far. Select “ParameterizedQuery(1)” in this case and go to properties. (I actually had two identically named queries in here which is why there is a (1) associated with the query we selected, I removed that connection).

In the connection properties we have the “Usage” tab and the “Refresh Control” menu. I selected an interval of 5 minutes here and then selected “Ok”.

If you simply leave the workbook open at this point your data will accumulate every five minutes from Indeed based on your parameters. I’ve done this for a few days at a time and it’s an ok solution (let’s be clear though this is an inelegant hack, but it works). I have a tendency to close these workbooks if I see them open though; just force of habit I suppose. Instead of retraining myself to leave these workbooks open I ended up creating a secondary user on my system. On a windows 8 machine you can hit WinKey+W then search on “Create an Account” or go to Control Panel\User Accounts and Family Safety\User Accounts\Manage Accounts\” and hit the link “Add a New User In PC Settings” in the bottom left hand corner of the screen.

This user is just my “Refresh Excel” user; the sole purpose of creating this user is to log in to that account; open the excel work books and let them refresh. If you switch users to your normal account at this point they will run until you restart your computer; continually refreshing.

This hack solution has a number of flaws:

  • If you restart your computer you will have to go back in and open up the workbooks to have them continue refreshing.
  • The Excel files are not saved; Excel will intermittently save on its own and when you open the file up again Excel will pop up with a screen asking you which version you would like to use; so you select the latest and save the file and your back to your last save point.
  • Working with a file that refreshes every five minutes; is annoying and cumbersome as you will be continually locked out of actions as it refreshes.

Reporting with another Power Query sheet

At least with the refresh you can do something about the lock out from a data refresh via Power Query. I took the worksheet from Part 4 of the series and placed in a folder called “C:\Temp”. I then logged in with my “Refresh Excel” user and opened the file. Then I switched back over to my normal user; so we now have a file we are going to work with in “C:\Temp”; refreshing and accumulating Indeed! Job data every 5 minutes. The next step to working with the data is to tap into the “Parameterized Query Results” tab from another Excel sheet so we can make some reports without ever having to deal with the 5 minute refresh or inadvertently altering our Power Query data store for Indeed data.

Create a new Excel file; then go over to the power query tab and select “From File” -> “From Excel”

Select the file we are auto refreshing from the “C:\Temp” directory

Once the file loads up we will be presented with the now familiar Power Query window. This is where we will have to go through a few steps to clean up our data again. Start off by renaming the query and then right click the table cell (uppermost left cell in the header) and select “Use First Row as Headers”

In my data set I’ve noticed some strange behavior in my “Posting_Date” column. There are some values which are not dates…the way I know they are not dates is that they don’t contain the standard “/” between day, month and year. The same goes for my time column; some of the data in “Posting_Time” is not coming through as I would expect. The Indeed API is not returning proper date / time data 100% of the time or something on my end has messed up the query; either way it needs to be fixed for reporting. Luckily in Power Query we can filter these values out. Select the little down arrow next to our “Posting_Date” column and go to “Text Filters” -> “Contains”.

In the new window type in “/” (no parenthesis) for your date columns; then click “Ok”

Repeat the step above for the “Load_Date” as well; then for the “Posting_Time” and “Load_Time” columns filter on “:”; your data should be in good shape at this point. This is just filtering for the error we noticed; there could still be bad data coming through from Indeed on other columns but we can deal with these as they arise.

Now we should be able to change our data types out to “Date” or “Time” respectively and not run into an error. So right click on the respective column and change out the type using “Change Type”.

You’ll notice if you’re following along that we picked up an extra grouping of columns; for my data store worksheet I now have columns 20 to 115 which are all null and just happened to be pulled along when we queried excel. So highlight all the columns we don’t want and right click on the column header; then click “Remove Columns”. Alternately you can highlight the columns you want to keep and select “Remove Other Columns”.

I also took this opportunity to change the salary field to a “Number” type. After that I think the Indeed! Job Power Query is in good shape so press “Done” and take a look at your data!

I think in a lot of ways this is how people will interact with the data from Power Query. There will be a data store somewhere accumulating interesting data points, be it locally, on your network or via Power BI in the cloud. People will hook into that read only data for reporting purposes or data marts via Power Query or other traditional providers. One file to accumulate (Produce) and then multiple files or programs to report (Consume) on the data is the most logical architecture for Power Query data. While I disagree with Microsoft that there will eventually be a position open in company’s called “Data Steward”; I do believe someone will be in charge of procuring and monitoring data stores of disparate data (such as IT or DBA’s).

I took one last step on my Indeed Job sheet; I went back into the “Connection Properties” and set the “Refresh data when opening the file” to true. I also noticed something interesting; because we are pulling from Excel we have the opportunity to do a background refresh. This means we could technically look at refreshing our new sheet every 10 or 20 minutes without the annoying lockout we had when querying the API’s directly.

Well that’s it for Part 5 in the series; I’m going to let my data accumulate over the next few days and come back with the final part in the series; the reporting feature via Power View and Power Map. It will be nice to visualize our little job search and see where we can get the most money; how frequently jobs are available and perhaps which companies have a lower / higher turnover rate. If you have any questions or comments please let me know!

2 thoughts on “Part 5 – Power Query API – Refreshing Data Indeed!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s