R | Mark Goldberg

Dplyr - reference

Models - reference

Linear regression Test linear moedel One-way ANOVA Linear regression lm(y ~ x1 + x2 + x3) # multiple linear regression lm(log(y) ~ x) # log transformed lm(sqrt(y) ~ x) # sqrt transformed lm( y ~ log(x)) # fields transformed llm(log(y) ~ log(x)) # everything is transformed lm(y ~ .) # use all fields for regression model lm(y ~ x + 0) # forced zero intercept lm(y ~ x*k) # interaction of two variables lm(y ~ x + k + x:k) # product of xkl but without interaction lm(y ~ (x + k + .

R graphic - reference

R tidyverse package - reference

ANOVA

One-way ANOVA Sources One-way ANOVA variance = SS/df, where SS - sum of squares and df - degree of freedom

S S = \sum_{i = 1}^{n} (x_{i} - μ)^{2}

, where

μ

is the sample mean n is the sample size

v a r (x) = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - μ)^{2}

3 groups of students with scores (1-100): a = c(82,93,61,74,69,70,53) b = c(71,62,85,94,78,66,71) c = c(64,73,87,91,56,78,87) SST = SSE + SSC = W + B, where

Determining the optimal number of clusters

Elbow method Average silhouette method Gap statistic method Using NbCLust Sources Elbow method The basic idea is to find minimal the total intra-cluster variation or total Within-cluster Sum ofSquares (WSS). Plot number of clusters ~ WSS show how WSS is reduced with increase of number of clusters. The optimal number of clusters is when adding another cluster doesn’t improve much better the total WSS. The optimal number of clusters can be defined as follow:

How to split data into train and test subsets?

Here you will learn approaches to split your data into subsets - train and test for your modeling.

Hypothesis testing

Hypothesis testing.

Simple Markov process

Here, we will consider a simple example of Markov process with implementation in R. The following example is taken from Bodo Winter website. A Markov process is characterized by (1) a finite set of states and (2) fixed transition probabilities between the states. Let’s consider an example. Assume you have a classroom, with students who could be either in the state alert or in the state bored. And then, at any given time point, there’s a certain probability of an alert student becoming bored (say 0.

Spline model

Practical example showing how to generate data set using given function, how to split data, buld spline model on train data and how to use test data to find optimal parameters of the model.