• R/O
  • SSH

Commit

Tags
No Tags

Frequently used words (click to add to your profile)

javac++androidlinuxc#windowsobjective-ccocoa誰得qtpythonphprubygameguibathyscaphec計画中(planning stage)翻訳omegatframeworktwitterdomtestvb.netdirectxゲームエンジンbtronarduinopreviewer

Commit MetaInfo

Revisión99d97bde6db7c8945ce8b535b76b07d3a321c020 (tree)
Tiempo2018-04-26 05:38:19
AutorLorenzo Isella <lorenzo.isella@gmai...>
CommiterLorenzo Isella

Log Message

I did some extra work on the presentation.

Cambiar Resumen

Diferencia incremental

diff -r b781333e314d -r 99d97bde6db7 latex-documents/R-intro-code.Rnw
--- a/latex-documents/R-intro-code.Rnw Wed Apr 25 22:37:39 2018 +0200
+++ b/latex-documents/R-intro-code.Rnw Wed Apr 25 22:38:19 2018 +0200
@@ -132,7 +132,7 @@
132132 % Compile with Rscript -e "library(knitr); knit('./R-intro-code.Rnw')"
133133
134134
135-\title{Introduction to R}
135+\title{Introduction to R (with Hands on Applications!)}
136136 \framesubtitle{A researcher's perspective}
137137 \author{ {Lorenzo Isella}}
138138 \institute{DG TRADE, G2, Chief Economist Team}
@@ -171,7 +171,7 @@
171171
172172
173173 \begin{frame}
174-\frametitle{Dulcis in Fundo}
174+\frametitle{What to Expect from this Training}
175175 % \framesubtitle{Test Frame}
176176 % \subt{An optional subtitle}
177177 On the other hand, by the end of this training you will
@@ -191,14 +191,15 @@
191191
192192 \frametitle{Overview of the Training}
193193 \begin{itemize}
194-
195- \item We will go through the basics of R (data types, structures,
196- functions, etc...)
197- \item we will also cover in some detail some fairly advanced topics
198- and recently added libraries (tidyverse)
199- \item R is evolving rapidly and an introductory training delivered
200- 3 years ago would be out of sync with the reality of active users
201- in 2018.
194+ \item Philosophy of the training: your goal is to get better,
195+ faster and more productive at data analysis.
196+ \item you are not interested in the 6 different kinds of atomic
197+ vectors in R.
198+\item So we will go head over heels on the basics and
199+ \item plunge into the tidyverse. Tidyverse is a collection of tools
200+ for powerful and expressive data analysis and visualisation.
201+ \item we will barely scratch the surface of many topics, but you
202+ will have an idea of the state-of-the art R for data mining.
202203
203204 \end{itemize}
204205
@@ -299,7 +300,7 @@
299300 and computing the average distance (dist = mean(distance, na.rm =
300301 TRUE)) and arrival delay (delay = mean(arr{\verb|_|}delay, na.rm = TRUE)).
301302
302-<< highlight=T, eval=TRUE,message=F>>=
303+<< highlight=T, eval=TRUE,message=F >>=
303304 library(nycflights13)
304305 library(tidyverse)
305306 by_tailnum <- group_by(flights, tailnum)
@@ -322,7 +323,9 @@
322323 \item logical: TRUE, FALSE
323324 \item complex: 1+4i (complex numbers with real and imaginary parts)
324325 \end{itemize}
325-You can also create your own, but we will not discuss this in these notes.
326+You can also create your own data types and/or, but we will not discuss this
327+in these notes. Later on, we will meet the tibbles -- the tidyverse
328+reinterpretation of the basic R data frames.
326329 \end{frame}
327330
328331 \begin{frame}[fragile]
@@ -390,6 +393,7 @@
390393 \frametitle{Lists 1/3}
391394 A list generalises the idea of a vector. It can hold items of
392395 different types. The name tag is optional
396+\vspace*{-0.2cm}
393397
394398 << eval=TRUE, highlight=F>>=
395399 Lst <- list(name="Fred", wife="Mary",
@@ -496,23 +500,74 @@
496500 ``\verb|<-|'' to assign values.
497501 See for instance
498502 << eval=TRUE, highlight=T>>=
499-x <- seq(5)
500-x
503+x <- c(1,2,3)
501504 x[2] <- -4
502505 x
506+#and sometimes the puzzling
507+y =2
508+y= y +7 # new y = old y +7
509+y
503510 @
504511
505512
506513 \end{frame}
507514
515+
508516 \begin{frame}[fragile]
509-\frametitle{Functions in R 1/}
517+\frametitle{Mutability of Data Structures -- Small Caveat}
518+We saw that ``='' can be used to assign a value. Instead, ``==''
519+is a \underline{logical} operator that checks if
520+two values/objects are identical.
521+See for instance
522+<< eval=TRUE, highlight=T>>=
523+x = 2
524+x
525+x == 2
526+x == 3
527+@
510528
511529
512530 \end{frame}
513531
514532
515533 \begin{frame}[fragile]
534+\frametitle{Functions in R 1/2}
535+A function is defined by an assignment of the form
536+<< eval=F, highlight=T >>=
537+name <- function(arg_1, arg_2, …) expression
538+@
539+The expression is an R expression that uses the arguments, arg\verb|_|i, to calculate a value. The value of the expression is the value returned for the function.
540+mean(), sum(), cumsum(), c(), are examples of R in-built functions we have
541+already met.
542+\end{frame}
543+
544+\begin{frame}[fragile]
545+\frametitle{Functions in R 2/2}
546+Example functions of one and two variables.
547+
548+<< eval=T, highlight=T >>=
549+
550+double <- function(x){ x*2}
551+
552+double_and_triple <- function(x,y) {c(x*2, y*3) }
553+
554+a <-7
555+b <- 5
556+double(a)
557+double_and_triple(a,b)
558+@
559+
560+\end{frame}
561+
562+
563+% \begin{frame}[fragile]
564+% \frametitle{Functions in R 3/2}
565+% A technical remark: functions do \underline{not} modify their own arguments
566+
567+% \end{frame}
568+
569+
570+\begin{frame}[fragile]
516571 \frametitle{Data Input and Output in R}
517572 \begin{itemize}
518573 \item R provides a number of facilities to import external data in different
@@ -521,21 +576,133 @@
521576 input/output by Excel. For importing and manipulating data, I recommend the
522577 tidyverse library.
523578 \end{itemize}
524-If you just use base R, you can do
525-<< eval=F, highlight=T>>=
526-mydata<-read.csv(''filename.csv'')
527-@
528-
529-or with the tidyverse library
530-
531579 << eval=F, highlight=T>>=
532580 library(tidyverse)
533-mydata<-read_csv(''filename.csv'')
581+# read data
582+mydata<-read_csv("filename.csv")
583+# write data
584+write_csv(mydata, "my_output_data.csv")
534585 @
535586
536587 \end{frame}
537588
538589
590+
591+\begin{frame}[fragile]
592+\frametitle{Long Computations in R}
593+
594+R is a functional language, which means that your code often contains a lot of parenthesis, ( and ). When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand.
595+<< eval=T, highlight=T>>=
596+## generate some arbitrary data
597+x<-c(1e4, 1.1e4, 2.3e4, 1.8e4,7e4,4.1e4)
598+# Compute the logarithm of `x`, return suitably
599+# lagged and iterated differences,
600+# compute the exponential function
601+# and round the result
602+round(exp(diff(log(x))), 1)
603+@
604+
605+\end{frame}
606+
607+
608+% \begin{frame}[fragile]
609+% \frametitle{Long Computations in R}
610+% Computations can often result in expressions which are hard to read.
611+% << eval=T, highlight=T>>=
612+% ## generate some arbitrary data
613+% x<-c(1e4, 1.1e4, 2.3e4, 1.8e4,7e4,4.1e4)
614+% # Compute the logarithm of `x`, return suitably
615+% # lagged and iterated differences,
616+% # compute the exponential function
617+% # and round the result
618+% round(exp(diff(log(x))), 1)
619+% @
620+% Wouldn't it be nice to have a way to express these operations which is
621+% easy to read and understand?
622+
623+
624+% \end{frame}
625+
626+\begin{frame}[fragile]
627+\frametitle{Enters the Pipe Operator}
628+The pipe operator \verb|%>%| has two fundamental properties
629+\begin{enumerate}
630+
631+\item Function $f(x)$ can be rewritten as $x$ \verb|%>%| $f$
632+<< eval=T, highlight=F >>=
633+x <- 10
634+ # Compute the logarithm of `x`
635+log(x)
636+x %>% log()
637+@
638+\item Function $f(x, y)$ can be rewritten as $x$ \verb|%>%| $f(y)$
639+
640+<< eval=T, highlight=F >>=
641+# Round pi
642+round(pi, 6)
643+pi %>% round(6)
644+@
645+
646+ \end{enumerate}
647+
648+\end{frame}
649+
650+\begin{frame}[fragile]
651+\frametitle{Why was This Invented at All?}
652+
653+The pipe operator \verb|%>%| provides you with a number of benefits
654+\begin{enumerate}
655+\item You'll structure the sequence of your data operations from left to right, as apposed to from inside and out;
656+\item You'll avoid nested function calls;
657+\item You'll minimize the need for local variables and function definitions; And
658+\item You'll make it easy to add steps anywhere in the sequence of operations.
659+\end{enumerate}
660+<< eval=F, highlight=T >>=
661+log(sin(sqrt(x))) # becomes
662+x %>% sqrt() %>%
663+ sin() %>%
664+ log() #much easier to follow!
665+@
666+
667+
668+\end{frame}
669+
670+
671+\begin{frame}[fragile]
672+\frametitle{Application to the Previous Example}
673+This sounds very abstract, but let us see \verb|%>%| in action
674+<< eval=T, highlight=T>>=
675+library(tidyverse)
676+x<-c(1e4, 1.1e4, 2.3e4, 1.8e4,7e4,4.1e4)
677+x %>% log() %>%
678+ diff() %>%
679+ exp() %>%
680+ round(1)
681+@
682+Now you finally understand what is going on. Cleaner code is easier to
683+share and extend.
684+
685+\end{frame}
686+
687+
688+\begin{frame}[fragile]
689+\frametitle{Modify a Sequence of Computations}
690+Now that the operations are laid out as a sequence, it is much easier to modify them whenever we need to. For instance
691+<< eval=T, highlight=T>>=
692+# Compute the logarithm of `x`, return suitably
693+# lagged and iterated differences,
694+# compute the mean
695+# and round the result with two digits
696+library(tidyverse)
697+x %>% log() %>%
698+ diff() %>%
699+ mean() %>%
700+ round(2)
701+@
702+\end{frame}
703+
704+
705+
539706 \begin{frame}[fragile]
540707 \frametitle{Tidyverse and R}
541708 \begin{itemize}
@@ -549,68 +716,254 @@
549716
550717 \end{frame}
551718
719+
552720 \begin{frame}[fragile]
553-\frametitle{Long Computations in R}
554-Computations can often result in expressions which are hard to read.
555-<< eval=T, highlight=T>>=
556-## generate some arbitrary data
557-x<-c(1e4, 1.1e4, 2.3e4, 1.8e4,7e4,4.1e4)
558-# Compute the logarithm of `x`, return suitably
559-# lagged and iterated differences,
560-# compute the exponential function
561-# and round the result
562-round(exp(diff(log(x))), 1)
721+\frametitle{dplyr -- Data Manipulation 1/2}
722+dplyr (part of the tidyverse family) is a \underline{grammar of data manipulation}.
723+When working with data you must
724+\begin{itemize}
725+\item Figure out what you want to do.
726+\item Describe those tasks in the form of a computer program.
727+\item Execute the program.
728+\end{itemize}
729+The dplyr package makes these steps fast and easy
730+\begin{itemize}
731+ \item By constraining your options, it helps you think about your data manipulation challenges.
732+\item It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
733+\item It uses efficient backends, so you spend less time waiting for the computer.
734+\end{itemize}
735+
736+ % filter() to select cases based on their values.
737+ % arrange() to reorder the cases.
738+ % select() and rename() to select variables based on their names.
739+ % mutate() and transmute() to add new variables that are functions of existing variables.
740+ % summarise() to condense multiple values to a single value.
741+ % sample_n() and sample_frac() to take random samples.
742+
743+\end{frame}
744+
745+\begin{frame}[fragile]
746+\frametitle{dplyr -- Data Manipulation 2/2}
747+dplyr is a grammar because it provides verbs that help you solve the most common data manipulation challenges:
748+\begin{itemize}
749+\item mutate() adds new variables that are functions of existing variables
750+\item select() picks variables based on their names.
751+\item filter() picks cases based on their values.
752+\item summarise() reduces multiple values down to a single summary.
753+\item arrange() changes the ordering of the rows.
754+\item group\verb|_|by() which allows you to perform any operation ``by group''.
755+\end{itemize}
756+This works beautifully with the pipe operator.
757+
758+\end{frame}
759+
760+\begin{frame}[fragile]
761+\frametitle{Example with Balance of Payment Data}
762+<< highlight=T, eval=TRUE,message=F, warning=F >>=
763+library(tidyverse)
764+df<-read_csv("bop_flow2.csv") %>%
765+ {.$Value=as.numeric(.$Value)
766+ .}
563767 @
564-Wouldn't it be nice to have a way to express these operations which is
565-easy to read and understand?
768+
769+Let us glimpse at the resulting table (only a few lines are shown)
770+
771+% df<-read_csv("bop_flow2.csv",col_types = cols(Value = "i"))
772+
773+
774+
775+\begin{table}[ht]
776+\centering
777+\scalebox{0.7}{
778+\begin{tabular}{rlllll}
779+ \hline
780+TIME & GEO & CURRENCY & NACE\_R2 & STK\_FLOW & STK\_FLOW\_LABEL \\
781+ \hline
782+2016 & EU28 & Million euro & TOTAL & ASS & Assets \\
783+ 2016 & EU28 & Million euro & TOTAL & ASS & Assets \\
784+ 2016 & EU28 & Million euro & TOTAL & ASS & Assets \\
785+ \hline
786+\end{tabular}
787+}
788+\end{table}
789+
790+\begin{table}[ht]
791+\centering
792+\scalebox{0.7}{
793+\begin{tabular}{rllllr}
794+ \hline
795+TIME & ENTITY & FDI\_ITEM & FDI\_ITEM\_LABEL & PARTNER & Value \\
796+ \hline
797+2016 & TOTAL & DO\_\_D\_\_F & Direct investment abroad (DIA) & CH & NA \\
798+ 2016 & TOTAL & DO\_\_D\_\_F & Direct investment abroad (DIA) & TR & NA \\
799+ 2016 & TOTAL & DO\_\_D\_\_F & Direct investment abroad (DIA) & RU & NA \\
800+ \hline
801+\end{tabular}
802+}
803+\end{table}
804+
805+
806+
807+
566808
567809
568810 \end{frame}
569811
570812 \begin{frame}[fragile]
571-\frametitle{Enters the Pipe Operator}
572-The pipe operator \verb|%>%| is used to transfer the results of a
573-computation in a pipeline.
574-<< eval=T, highlight=T>>=
813+\frametitle{dplyr Verbs in Action 1/4}
814+In 2015, how many million euros did the EU28 (GEO) invest
815+(FDI\verb|_|ITEM is DO\verb|_|\verb|_|D\verb|_|\verb|_|F; ENTITY is TOTAL) in manufacture
816+(NACE\verb|_|R2 is C) in Japan (PARTNER is JP) as outward net foreign
817+direct investment (STK\verb|_|FLOW is NO)?
818+<< highlight=T, eval=TRUE,message=F >>=
575819 library(tidyverse)
576-x %>% log() %>%
577- diff() %>%
578- exp() %>%
579- round(1)
820+manu_JP <- df %>%filter(TIME==2015, GEO=="EU28",
821+ STK_FLOW=="NO",FDI_ITEM=="DO__D__F",
822+ ENTITY=="TOTAL",PARTNER=="JP", NACE_R2=="C") %>%
823+ select(TIME, GEO, PARTNER, NACE_R2, Value)
824+manu_JP
580825 @
581-This leads to cleaner code, easier to understand and debug.
826+\end{frame}
827+
828+
829+
830+\begin{frame}[fragile]
831+\frametitle{dplyr Verbs in Action 2/4}
832+And the total FDI to the US for all years
833+<< highlight=T, eval=TRUE,message=F >>=
834+library(tidyverse)
835+FDI_US <- df %>%filter( GEO=="EU28",
836+STK_FLOW=="NO",FDI_ITEM=="DO__D__F",
837+ENTITY=="TOTAL",PARTNER =="US",NACE_R2=="FDI") %>%
838+select(TIME, GEO, PARTNER, NACE_R2, Value)
839+FDI_US
840+@
841+\end{frame}
842+
843+
844+\begin{frame}[fragile]
845+\frametitle{dplyr Verbs in Action 3/4}
846+And if you want the average FDI to the US along the years
847+<< highlight=T, eval=TRUE,message=F >>=
848+library(tidyverse)
849+FDI_US_mean <- df %>%filter( GEO=="EU28",
850+STK_FLOW=="NO",FDI_ITEM=="DO__D__F",
851+ENTITY=="TOTAL",PARTNER =="US", NACE_R2=="FDI")%>%
852+select(TIME, GEO, PARTNER, NACE_R2, Value) %>%
853+summarise(mean_FDI_to_US=mean(Value))
854+FDI_US_mean
855+@
856+\end{frame}
857+
858+
859+\begin{frame}[fragile]
860+\frametitle{dplyr Verbs in Action 4/4}
861+Now you want to do the same for US and India in one go
862+\vspace*{-0.2cm}
863+<< highlight=T, eval=TRUE,message=F >>=
864+library(tidyverse)
865+FDI_US_IN <- df %>%filter( GEO=="EU28",
866+STK_FLOW=="NO",FDI_ITEM=="DO__D__F",
867+ENTITY=="TOTAL",PARTNER %in% c("US", "IN"),
868+NACE_R2=="FDI")%>%
869+select(TIME, GEO, PARTNER, NACE_R2, Value) %>%
870+group_by(PARTNER) %>%
871+summarise(mean_FDI=mean(Value))
872+FDI_US_IN
873+@
874+\end{frame}
875+
876+
877+\begin{frame}[fragile]
878+\frametitle{dplyr -- Final Thoughts}
879+\begin{itemize}
880+ \item we barely scratched the surface of dplyr
881+ \item but we have already seen filter, selection of columns and
882+ computing statistics on groups of variables
883+ \item thanks to the pipe operator, most of the code that you write
884+ is reusable and readable
885+ \item you do not worry about cells, indexes etc..., but you think
886+ more about the questions you want to pose to your data.
887+\end{itemize}
888+\end{frame}
889+
890+\begin{frame}[fragile]
891+\frametitle{Tidy Data}
892+The tidyverse is named after the tidy data format. In tidy data
893+\begin{enumerate}
894+\item Each variable forms a column.
895+\item Each observation forms a row.
896+\item Each type of observational unit forms a table.
897+\end{enumerate}
898+
899+Tidy data makes it easy for an analyst or a computer to extract needed
900+variables because it provides a standard way of structuring a
901+dataset. You do not need different strategies to extract different variables.
902+The FDI flow data set was cast in a tidy format.
903+
904+Every time you have a data set with the year on the horizontal axis,
905+you are sure that the data set is messy (not tidy).
582906
583907 \end{frame}
584908
585909 \begin{frame}[fragile]
586-\frametitle{Reasons to use the pipe operator}
587-
910+\frametitle{Tidying Messy Datasets}
911+Real data sets are often messy in every conceivable way, e.g.
912+\begin{itemize}
913+\item Column headers are values, not variable names.
588914
589-\begin{enumerate}
590-\item You'll structure the sequence of your data operations from left to right, as apposed to from inside and out;
591-\item You'll avoid nested function calls;
592-\item You'll minimize the need for local variables and function definitions; And
593-\item You'll make it easy to add steps anywhere in the sequence of operations.
594-\end{enumerate}
915+\item Multiple variables are stored in one column.
916+
917+\item Variables are stored in both rows and columns.
918+
919+\item Multiple types of observational units are stored in the same table.
920+
921+\item A single observational unit is stored in multiple tables.
922+\end{itemize}
923+Tidying messy data sets is in itself a large topic; we'll focus only
924+on one example in the following.
595925 \end{frame}
596926
597927 \begin{frame}[fragile]
598-\frametitle{Enters the Pipe Operator Again}
599-It is straightforward to modify the previous sequence of operations
600-<< eval=T, highlight=T>>=
601-# Compute the logarithm of `x`, return suitably
602-# lagged and iterated differences,
603-# compute the mean
604-# and round the result with two digits
928+\frametitle{Column headers are values, not variable names}
929+This is one of the most common cases. See for instance some data about
930+income and religion in the US
931+<< highlight=T, eval=TRUE,message=F >>=
605932 library(tidyverse)
606-x %>% log() %>%
607- diff() %>%
608- mean() %>%
609- round(2)
933+pew <-read_csv("income_religion.csv")
610934 @
935+\begin{table}[ht]
936+\centering
937+\scalebox{0.7}{
938+\begin{tabular}{lrrrrrr}
939+ \hline
940+religion & $<$\$10k & \$10-20k & \$20-30k & \$30-40k & \$40-50k & \$$>$50k \\
941+ \hline
942+Agnostic & 27 & 34 & 60 & 81 & 76 & 137 \\
943+ Atheist & 12 & 27 & 37 & 52 & 35 & 70 \\
944+ Buddhist & 27 & 21 & 30 & 34 & 33 & 58 \\
945+ Catholic & 418 & 617 & 732 & 670 & 638 & 1116 \\
946+ Don’t know & 15 & 14 & 15 & 11 & 10 & 35 \\
947+ Evangelical & 575 & 869 & 1064 & 982 & 881 & 1486 \\
948+ Hindu & 1 & 9 & 7 & 9 & 11 & 34 \\
949+ Historically black & 228 & 244 & 236 & 238 & 197 & 223 \\
950+ Jehovah's withnesses & 20 & 27 & 24 & 24 & 21 & 30 \\
951+ Jewish & 19 & 19 & 25 & 25 & 30 & 95 \\
952+ \hline
953+\end{tabular}
954+}
955+\end{table}
956+
957+This dataset has three variables, religion, income class and frequency.
958+Religion and income class are non-variable columns, whereas the
959+frequency is the only value.
611960 \end{frame}
612961
962+\begin{frame}[fragile]
963+\frametitle{Tidying the Data Set 1/}
964+To tidy the pew data set, we need to \underline{gather} the non-variable columns into a two-column key-value pair.
613965
966+\end{frame}
614967
615968 \begin{frame}[fragile]
616969 \frametitle{Linear Models in R}