CITS4009-R语言笔记

https://r4ds.had.co.nz/index.html。
library("swirl")
swirl()

1.Basic Building Blocks
2.Workshop and Files

getwd()//Determine which directory your R session is using as its current working directory
ls()//List all the objects in your local workspace using ls().
dir()//List all the files in your working directory using list.files() or dir().
args(list.files)//Using the args() function on a function name is also a handy way to see what arguments a function can take.
old.dir<-getwd()//Assign the value of the current working directory to a variable called "old.dir".
dir.create("testdir")//Use dir.create() to create a directory in the current working directory called "testdir".
setwd("testdir")//将您的工作目录设置为“testdir”。
file.create("mytest.R")//Create a file in your working directory called "mytest.R"
list.files()
file.exists("mytest.R")//Check to see if "mytest.R" exists in the working directory
file.info("mytest.R")//访问有关文件“mytest.R”的信息
file.rename("mytest.R","mytest2.R")
file.copy("mytest2.R","mytest3.R")
file.path("mytest3.R")//提供文件“mytest3.R”的相对路径
file.path("folder1","folder2")//你可以使用file.path要构造独立于运行R代码的操作系统的文件和目录路径。将“folder1”和“folder2”作为参数传递给file.path创建独立于平台的路径名。
dir.create(file.path("testdir2", "testdir3"), recursive = TRUE)//在当前工作目录下创建一个名为“testdir2”的目录和一个名为“testdir3”的子目录。
setwd(old.dir)//Go back to your original working directory
3.Sequences of Numbers
1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
pi:10
[1] 3.141593 4.141593 5.141593 6.141593 7.141593 8.141593 9.141593
15:1
[1] 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
?://在Tab上 seq(1,20) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 //This gives us the same output as 1:20. However, let's say that instead we want a vector of numbers ranging from 0 to 10, incremented by 0.5. seq(0, 10, by=0.5) does just that. Try it out. seq(0,10,by=0.5) [1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 my_seq<-seq(5,10,length=30) length(my_sql) [1] 30 1:length(my_seq) / seq(along.with = my_seq) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 seq_along(my_seq) rep(0,times=40)//A vector that contains 40 zeros. rep(c(0,1,2),times=10)//Vector contain 10 repetitions of the vector (0, 1, 2) [1] 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 rep(c(0,1,2),each=10) [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 4.Vectors(The simplest and most common data structure in R) Vectors come in two different flavors: atomic vectors and lists. An atomic vector contains exactly one data type, whereas a list may contain multiple data types. We'll explore atomic vectors further before we get to lists. 向量有两种不同的风格：原子向量和列表。一个原子向量只包含一个数据类型，而一个列表可能包含多个数据类型。在讨论列表之前，我们将进一步研究原子向量。 num_vect<-c(0.5,55,-10,6)//create a numeric vector num_vect that contains the values 0.5, 55, -10, and 6. tf<-num_vect<1//create a variable called tf that gets the result of num_vect < 1 my_char<-("My","name","is") paste(my_char)//my_char is a character vector of length3,将my_char的元素连接成一个连续的字符串 paste(my_char,collapse=" ")//当我们将my_char字符向量的元素连接在一起时，我们希望用单个空格分隔它们。 my_name<-(my_char,"Yuchen")//add (or 'concatenate') your name to the end of my_char paste(1:3, c("X", "Y", "Z"), sep = "")//将整数向量1:3与字符向量c(“X”、“Y”、“Z”)连接起来 [1] "1X" "2Y" "3Z" paste(LETTERS, 1:4, sep = "-") [1] "A-1" "B-2" "C-3" "D-4" "E-1" "F-2" "G-3" "H-4" [9] "I-1" "J-2" "K-3" "L-4" "M-1" "N-2" "O-3" "P-4" [17] "Q-1" "R-2" "S-3" "T-4" "U-1" "V-2" "W-3" "X-4" [25] "Y-1" "Z-2" LETTERS is a predefined variable in R containing a character vector of all 26 letters in the English alphabet. 6.Subsetting Vectors how to extract elements from a vector based on some conditions that we specify. x[1:10]//view the first ten elements of x x[is.na(x)]//A vector of all NAs y <- x[!is.na(x)] y[y > 0]//A vector of all the positive elements of y x[x>0] [1] NA 0.80810962 0.09681212 NA [5] NA 0.25269741 0.80461156 0.22822151 [9] NA 0.23587581 NA NA [13] NA NA NA 0.68959613 [17] NA NA NA NA [21] NA NA NA 0.07462134 [25] NA NA NA NA [29] 0.07794159 0.44153844 0.26593743 x[!is.na(x)&x>0] [1] 0.80810962 0.09681212 0.25269741 0.80461156 [5] 0.22822151 0.23587581 0.68959613 0.07462134 [9] 0.07794159 0.44153844 0.26593743 x[c(3, 5, 7)]//subset the 3rd, 5th, and 7th elements of x x[0] numeric(0) x[3000] [1] NA x[c(-2,-10)] / x[-c(2, 10)]//construct a vector containing all numbers 1 through 40 EXCEPT 2 and 10. [1] -0.99389344 NA 0.80810962 0.09681212 [5] NA -1.38296312 NA -0.89231747 [9] 0.80461156 -1.66807572 -0.90054390 0.22822151 [13] -0.03553009 NA 0.23587581 NA [17] NA -1.28852117 NA NA [21] NA 0.68959613 NA NA [25] NA NA NA NA [29] NA 0.07462134 NA NA [33] NA NA 0.07794159 0.44153844 [37] 0.26593743 -0.64443232 vect <- c(foo = 11, bar = 2, norf = NA) vect//each element has a name foo bar norf 11 2 NA names(vect)//get the names of vect [1] "foo" "bar" "norf" vect2<-c(11,2,NA)//create an unnamed vector vect2 names(vect2) <- c("foo", "bar", "norf")//add thenamesattribute to vect2 identical(vect,vect2)//check that vect and vect2 are the same [1] TRUE vect["bar"]//we want the element named "bar" bar 2 vect[c("foo", "bar")]//specify a vector of names foo bar 11 2 7.Matrices and Data Frames矩阵和数据框架 Both represent 'rectangular' data types, meaning that they are used to store tabular data，with rows and columns. my_vector<-1:20//use either c(1, 2, 3) or 1:3，remember that you don't need the c() function when using:. [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 dim(my_vector)//Since my_vector is a vector, it doesn't have adimattribute (so it's just NULL) NULL length(my_vector) dim(my_vector) <- c(4, 5)//The dim() function allows you to get OR set thedimattribute for an R object. assigned the value c(4, 5) to thedimattribute of my_vector. dim(my_vector) [1] 4 5 my_vector [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20 attributes(my_vector)//show you all of the attributes for the my_vector object $dim [1] 4 5 class(my_vector)//confirm it's actually a matrix [1] "matrix" "array" my_matrix<-my_vector my_matrix2 <- matrix(1:20, nrow=4, ncol=5)//create a matrix containing the same numbers (1-20) and dimensions (4 rows, 5 columns) identical(my_matrix,my_matrix2)//confirm that my_matrix and my_matrix2 are actually identical [1] TRUE patients<-c("Bill", "Gina", "Kelly", "Sean") cbind(patients, my_matrix)//add the names of our patients to the matrix of numbers 矩阵只能包含一类数据。当我们将字符向量与数字矩阵结合时，R被迫将数字“强制”为字符，因此使用双引号。This is called 'implicit coercion'(隐式转换) my_data <- data.frame(patients, my_matrix) my_data patients X1 X2 X3 X4 X5 1 Bill 1 5 9 13 17 2 Gina 2 6 10 14 18 3 Kelly 3 7 11 15 19 4 Sean 4 8 12 16 20 the data.frame() function allowed us to| store our character vector of names right alongside our matrix of numbers class(my_data) [1] "data.frame" cnames<-c("patient", "age", "weight", "bp", "rating", "test")//Create a character vector called cnames that contains the following values colnames(my_data) <- cnames// use the colnames() function to set thecolnames` attribute for our data frame
my_data
patient age weight bp rating test
1 Bill 1 5 9 13 17
2 Gina 2 6 10 14 18
3 Kelly 3 7 11 15 19
4 Sean 4 8 12 16 20

8.Logic
& 依次比较两个向量中的对应元素，而&&只比较两个向量的首个元素
&& 如果发现左边对象的值为FALSE，那么他就不会计算右边的对象了

xor(TRUE,FALSE)->TURE其他为FALSE
any(条件)
The any() function will return TRUE if one or more of the elements in the logical vector is TRUE. The all() function will return TRUE if every element in the logical vector is TRUE.

9.Functions
10.lapply and sapply

unique(c(3,4,5,5,5,6,6))//删除所有重复元素
[1] 3 4 5 6
11.vapply and tapply
tapply(flags $population,flags$ red,summary)//look at a summary of population values (in round millions) for countries with and without the color red on their flag
12.Looking at Data
class(XXX)//
[1] "data.frame"
dim(XXX)//
[1] 行数列数
norw(XXX)
[1]行数
ncol(XXX)
[1]列数
object.size(XXX)
745944 bytes //文件XXX占用空间
names(XXX)//a character vector of column
head(XXX)//preview the top of the dataset默认前六行
head(XXX,10)//前10行
tail(plants,15)//后15行
15. Base Graphics
data(cars)//Load the included data frame cars with data(cars).
plot(x=cars $speed,y=cars$ dist)
plot(x = cars $speed, y = cars$ dist, xlab = "Speed",ylab = "Stopping Distance")
plot(cars, main = "My Plot")//图像标题“My Plot”（上方）
plot(cars, sub = "My Plot Subtitle")//图像名称“My Plot Subtitle”（下方）
plot(cars,col=2)//plotted points are colored red
plot(cars,xlim=c(10,15))//limiting the x-axis to 10 through 15
plot(cars,pch=2)//Plot cars using triangles
data(mtcars)
boxplot(formula = mpg ~ cyl,data=mtcars)
boxplot(formula, data = NULL, ..., subset, na.action = NULL,
xlab = mklab(y_var = horizontal),
ylab = mklab(y_var =!horizontal),
add = FALSE, ann = !add, horizontal = FALSE,
drop = FALSE, sep = ".", lex.order = FALSE)
boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,
notch = FALSE, outline = TRUE, names, plot = TRUE,
border = par("fg"), col = "lightgray", log = "",
pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
ann = !add, horizontal = FALSE, add = FALSE, at = NULL)
Week2
x<-rnorm(100) #产生100个服从正态分布的随机数
x<-rnorm(100,3,4) //产生100个均值是3，标准差为4的随机数

getwd()//查看文件路径
setwd("路径")//设置文件路径
custdata<-read.table('custdata.tsv',header=T,sep='\t')//导入文件
mpg[,c(1,2,5)]//取1，2，5列
str(custdata)//查看这个数据框当中有哪些变量，以及变量的类型
summary(custdata)//获取描述性统计量，可以提供最小值、最大值、四分位数和数值型变量的均值，以及因子向量和逻辑型向量的频数统计等。
summary(custdata $income) sd(custdata$ income)//求income的标准差，sd=standard deviation
hist(data)用于绘制直方图的数据
x:该参数的值为一个向量
break:该参数的指定格式有很多种

指定一个向量，给出不同的断点
指定分隔好的区间的个数，会根据区间个数自动去计算区间的大小
freq: 逻辑值，默认值为TRUE , y轴显示的是每个区间内的频数，FALSE, 代表显示的是频率（= 频数/ 总数）
probability : 逻辑值，和 freq 参数的作用正好相反，TRUE 代表频率， FALSE 代表频数
labels: labels = c("A", "B", "C")显示在每个柱子上方的标签

hist(custdata $age) title('Distribution of age',xlab='age')--会出现重合-->hist(custdata$ age,main='Distribution of age',xlab='age')标题‘Distribution of age’，横坐标名称‘age’
ggplot(data=custdata)+geom_histogram(mapping=aes(x=age),binwidth=5,fill="grey")+geom_density(aes(x=age))
data:载入你要画的数据所在的数据框，指定为你的绘图环境，载入之后，就可以免去写大量的$来提取data.frame之中的向量
aes:任何与数据向量顺序相关，需要逐个指定的参数都必须写在aes里

Week3

gender<-factor(gender)
levels(gender)//gender中的内容

WVPlots

Week4

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class))//透明度
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = class))//形状

pairs(~displ+city+hwy,data = mpg)

图+annotate("text",x=4,y=40,label="XXX")#在（4，40）位置注释XXX

library(gridExtra)
grid.arrange(p1,p2,ncol=2)

a%in%b #a是否在b中
mean_atar <- aggregate(df $atar, list(df$ degree), mean) #按照degree对atar求平均值
aggregate(x,by, FUN) x是待折叠的数据对象，by是变量名组成的list,FUN是函数
merge(df, mean_atar, by.x="degree", by.y="Group.1") #对df和mean_atar两个数据框进行合并，此处两个数据集连接列名称相同