R is both an environment and a language. It is free software programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis, but is also notorious for being hard to learn. But with the right guide, you can learn R in no time with proper explanation of techniques such as the use of probability distributions, hypothesis testing, creating linear models, linear regression, logistic regression, anomaly models, and so on. Why is R worth learning? R is very versatile and it covers a whole range of statistical techniques and algorithms when it comes to data mining as well as analysis. It’s basically an integrated suite of software, packages and data that allows you to perform statistical analysis and data mining; and to create complex graphs.
Join Jagannath Rajagopal, Statistical Forecasting Analyst, as he reviews topics in increasing order of difficulty, starting with Data/Object Types and Operations, Importing into R, and Loops and Conditions. His Introduction to R course will get you to learn R in no time with almost 90 videos and 140+ exercise questions, over 10 chapters. Interested in learning more? Below is a transcript from one of his introductory chapters that defines R and its functions.
Hello everyone! In this chapter, we’re going to learn a little about the R language. We’re going to see some of its characteristics. We’re going to hear about some of the areas in the real world where R is being used for research, data analysis, as well as data mining. Finally, we’re going to review some of the resources available to us, both to the R project as well as on the internet.
Okay, so what is R? R is both an environment and a language. It’s an environment in the sense that it’s an integrated suite of software, packages and data that basically enable the R language. The R language in turn allows you to perform statistical analysis and data mining; techniques such as the use of probability distributions, hypothesis testing, creating linear models, linear regression, logistic regression, anomaly models, and so on. R is very versatile and it covers a whole range of statistical techniques and algorithms when it comes to data mining as well as analysis.
R allows you to create complex graphs – you can generate simple one-dimensional graphs or you can create more sophisticated two- or high-dimensional plots using R. You may also want to check out the R project website at R-project.org. It’s got a more comprehensive description of the R language as well as a formal definition.
So what are some of the characteristics of R? R is an expression language. It allows you to perform spot analysis as well as create programs. What does that mean? Let’s say, for example, there’s a data set of 25 numbers, and you want to get a feel for some of the statistics around the numbers – so the average and the spread. You can go about this in two ways using R. If you want to just get a one-time feel for the average and the spread, you can take the data set, assign that to a vector in R, and ask R to compute the mean and the standard deviation. Alternately, if you want a more repeatable approach, where you want to pass multiple data sets to a function or a program and spread out the statistics I mentioned before, again and again, then you want to create a program.
In that case, the program can be determined to take the vector as input and output the mean and the standard deviation. R allows you to do both. R is highly extensible. Its extensibility comes from two facets. Firstly, there is the stats packages, both to the R project as well as contributed by users. There’s a whole host of algorithms, functions, and techniques available to us and that makes it highly extensible. The functions themselves can be called within other functions, and the ways functions have been defined and designed in R, you can call the parameters of one function within another function if they are nested. That, along with all the packages that we have, make R very powerful and capable of performing a broad range of tasks and techniques.
R has several publicly available sets of real world data. Now, this is useful for me especially when I’m learning the language. When I’m creating routines and functions in scripts and I want some data to test, these data sets come in very handy.
Let’s talk a little more about R’s extensibility. The base functionality of the R language is enabled through packages such as Base, Stats, Graphics, et cetera. When you download and install the language on your computer, you will find that these packages come automatically installed and loaded. The CRAN, which is a comprehensive R archive network, makes available to you several packages like Mass and Lettice. When you download and set up R on your computer, these packages come installed but they don’t come loaded. These are optional. When you want the functionality associated with these, you can go and turn them on and you’ll have it available.
There’s a whole host of contributed packages. I mentioned that in the previous slide. These are available to you. They’re not installed when you download R and set it up on your computer, so you’ll have to go into R and install them. Once you install them, you’ll also have to load them if you want to use them. The full list of all of this is available on the CRAN R project website.
I mentioned a couple of times about how powerful R is, how many techniques, the range and functions that it has to offer and can do for you. The flip side of that is the documentation associated with some of these may not be as good as others. Specifically for me when I’m looking at the contributor packages, it is not clear what the packages are, what they do, in what situations I would use them, and how I would use them. These things, to some extent, you’ll have to feel your way around. If you’re an intro user, and even to some extent an advanced user, you will find that you can still do a lot of your work using packages that come with the software. If you want to use some of these other packages, then you’ll have to feel your way around them.
The other thing I want to say is that there is a lot of chatter in the social media, in blogs, in the cloud, about top R packages and so on. Do feel free to Google that. Later, I’m also going to show a thread of discussion around packages and which ones are popular.
Let’s quickly go into the R software and see about packages. When I load R, this window comes up. If I want to look at what packages I already have, I go to the packages menu and look at package manager. As I mentioned before, there’s a whole host of packages here. Some of these are already loaded, and things like data sets, graphics, methods, stats, et cetera. These are your base packages and these enable the basic R language.
I can extend that by loading packages such as Mass. Mass, for example, is not a base package but it does come when you install R. It comes installed but you have to load it if you want to use it. Now to the contributed packages – if you go to package installer, I can find a whole series of package repositories, and if I get the list of packages in there, as you can see, there is a lot of packages listed here. The list just goes on and on. Like I mentioned before, it’s not exactly clear to me, just looking at this what a particular package does, why it should be used, and I would go about inputting and outputting data from it.
That’s R packages. I also mentioned that social media thread so hold on a second. This is a discussion on Linkedin. There was a question asked about what the best packages are. You could have a look through this. The url is right here. You can also find it when you google “best R packages.” The listing comes up. Have a look. There’s lots of comments about the popular packages, you can get a good feel of that from here.
Next – data. The R environment provides a whole host of data sets of publicly available real world data. Some examples are listed here. In addition, if you want to get other data sets, you can go and comb through all the databases out there, universities, gambling institutions, census boards, et cetera. Some of them are listed here along with the urls. Feel free to go and have a look.
I want to quickly show you the data sets already available that come installed when you download R and set it up on your computer. I’m going to show you some examples there.
So back to R. If I click on the data manager, it pulls up the data sets. As an example, here is one of my favorites. This is data concerning flowers, their petals and set of lengths. Further down the course, when we’re looking into statistics and some graphics, I’m going to use this repeatedly.
When I click on a data set, the description of the data set is down below. Data format, columns, et cetera, all of that is explained in there as well.
Next! Some applications so far—I have some key words here and these are areas in which R is used. But really, if data needs to be analyzed, R usually finds a home there. What I’m going to do is I have about three or four examples of studies conducted. I want to just run them quickly by you to show you the kind of things that people are doing out there.
First one – this is an example of R being used in genetics. The author is trying to show a conundrum known as gene suppression in certain species of animal. So what he’s done is using a bar plot to show that in these species that particular gene is selectively suppressed compared to the other genes, without going into too much detail.
Another example – this one is from astronomy and nuclear physics. Dark matter – this is an example of a study where the author has plotted the affect of dark matter on galaxies. Again, without going in too much detail – as you can see the plots, the little white specks are the galaxies, and the clusters behind the blue coloring, that represents dark matter.
This study is about code ratability. It’s ratings given by students of different levels and the result of this is a box plot that the users created to show ratings given by different students of different levels of the same code snippets.
Like I said, highly diverse. I showed you three areas totally unrelated but R essentially finds a home in all these different areas.
Now for R resources. My favorite blog is R-bloggers.com. Let’s quickly go and have a look. R-bloggers.com is a blog aggregator. It’s an aggregate of all the R blogs out there. That list of contributing blogs is also available to you. It, like any good blog aggregator will, lists blog posts from various blogs in reverse chronological sequence. So I can go and look at this. It really gives you a feel for the different kind of things that people are working on using R.
So Linkedin groups – I have a couple of names there. The Linkedin thread that I showed earlier regarding top R packages is for the R Project for Statistical Computing Group. Go ahead, have a look, join these groups, participate, contribute.
Finally, R resources. So the standard R project – lots and lots of information there. R manuals, I would encourage you to download and use them. They serve as a good foundation and reference for any point in time in future you want to go and learn. You should go ahead and get those documents.
That brings us to the end of this chapter. Questions? Comments? Use the forum. See you next time!