Procesando datos con el paquete tidyverse

# Procesando datos con el paquete tidyverse
## R + Ciencias Sociales
### Pablo Tiscornia

---

.remark-slide-content {
    font-size: 25px;
    padding: 1em 1em 1em 1em;
}

</style>

---
class: inverse, middle, center

# ¿Qué es [Tidyverse](https://www.tidyverse.org/)?

***

---
# Tidyverse

.pull-left[
#### `Tidyverse` es una colección de paquetes de R, pensados para denominada "ciencia de datos". 
 
#### Comparten la misma filosofía de uso, por lo que trabajan en armonía entre unos y otros.
]

<img src="data:image/png;base64,#../img/tidyverse.png" width="781" style="display: block; margin: auto;" />
]

---

# ¿Por qué tidyverse?

---
# __¿Por qué tidyverse?__

- ### Orientado a ser leído y escrito por y para seres humanos

- ### Funciones no pensadas para una tarea específica sino para un proceso de trabajo

- ### Su comunidad, basada en los principios del código abierto y trabajo colaborativo

---
# __Instalación y uso__

* Sólo una vez (por computadora):
```r
install.packages("tidyverse")
```

* En cada inicio de sesión de R o Rstudio:
```r
library(tidyverse)
```
 
--

_No es necesario esto:_

```r
install.packages("dplyr")
install.packages("tidyr")
install.packages("ggplot2")
```

---
# Hoja de ruta

### Presentación de los paquetes `dplyr` y `tidyr`

## ✔️ dplyr

☑️️ `select()`   ☑️️ `filter()`

☑️️ `mutate()`   ☑️️ `rename()`

☑️️ `arragne()`  ☑️️ `summarise()`

☑️️ `group_by()`

]

## ✔️ tidyr
    
☑️ `pivot_longer()` ☑️ `pivot_wider()`

<br>

## ✔️ magrittr

☑️ `%>%` (_el pipe_)

]

***

```r
library(eph)
b_eph_ind <- get_microdata(year = 2019, trimester = 3, type = "individual")
```

---
class: middle, center, inverse
  
  
  
  EL PIPE <img src="../img/pipe.png" alt="Upside-down sloths are so cute", width = "7%">

***

_<p style="color:grey;" align:"center">Una forma de escribir</p>_

---
# EL PIPE

```r
base_de_datos `%>%`
  funcion1 `%>%` 
  funcion2 `%>%` 
  funcion3
```
]

]

---
# EL PIPE

### **Sin EL PIPE:**

```r
# Paso2(Paso1(base_de_datos$variable))

prop.table(table(`b_eph_ind$CH04`))
```

```

1         2 
0.4818711 0.5181289 
```
]

### **Con EL PIPE**

```r
b_eph_ind$CH04 `%>%` # base_de_datos$variable
  table() `%>%`        # Paso 1
  prop.table()       # Paso 2
```

```
.
        1         2 
0.4818711 0.5181289 
```

]

---
# magrittr - una forma de escribir

### **Caso:** Deseo obtener la distribución relativa de casos por sexo:

#### Funciones:

`table()` - `prop.table()` - `round()`

---
class: middle, center, inverse

---
# dplyr

## Funciones del paquete dplyr:

<br>

| __Función__   | __Acción__ |
| :---          | ---:   |
| `select()`    | *selecciona o descarta variables*|
| `filter()`    | *selecciona filas*|
| `mutate()`    | *crea / edita variables*|
| `rename()`    | *renombra variables*|
| `group_by()`  | *segmenta en funcion de una variable*|
| `summarize()` | *genera una tabla de resúmen*|

---

# __select()__

_<p style="color:grey;" align:"center">Elije o descarta columnas de una base de datos</p>_

---
# select()

### La función tiene el siguiente esquema:

```r
base_de_datos %>% 
* select(id, nombre)
```

---
# **Caso**

### - **Indicador 1:** *Principales tasas del mercado de trabajo para el aglomerado de CABA y Partidos del GBA*

### - **Indicador 2:** *Indicador 1 según el __sexo__ y __edad__ de las personas.*

Según el [Diseño de registro](https://www.indec.gob.ar/ftp/cuadros/menusuperior/eph/EPH_registro_t318.pdf), las variables de trabajo son:

- **Aglomerado de residencia** = `AGLOMERADO`

- **Condición de actividad** = `ESTADO`

- **Sexo** = `CH04`

- **Edad** = `CH06`

- **Factor de ponderación** = `PONDERA`

---
# **Caso**

### Librerías de trabajo e importación de la base:

```r
library(tidyverse)
library(eph)

b_eph_ind <- read.table("entradas/usu_individual_t119.txt",
                        header = TRUE, sep = ";")
```

---
# select() - nombre de las variables

### selecciono las columnas que deseo de la base de datos:

```r
b_eph_ind_seleccion <- `b_eph_ind` %>% 
  `select`(ESTADO, CH04, CH06, PONDERA)
```

### Chequeo la operación:

```r
colnames(b_eph_ind_seleccion)
```

```
[1] "ESTADO"  "CH04"    "CH06"    "PONDERA"
```

---
# select() - por posición de la columna

```r
b_eph_ind_seleccion <- b_eph_ind %>% 
  select(`10, 12, 14, 28`)
```

### chequeo seleccion:

```r
colnames(b_eph_ind_seleccion)
```

```
[1] "PONDERA" "CH04"    "CH06"    "ESTADO" 
```

---

```r
*b_eph_ind
```
]
 
.panel2-select_1-auto[

```
# A tibble: 57,229 x 177
   CODUSU    ANO4 TRIMESTRE NRO_HOGAR COMPONENTE   H15 REGION MAS_500 AGLOMERADO
   <fct>    <int>     <int>     <int>      <int> <int>  <int> <fct>        <int>
 1 TQRMNOQ~  2019         3         1          1     1     43 S                2
 2 TQRMNOQ~  2019         3         1          2     1     43 S                2
 3 TQRMNOQ~  2019         3         1          3     1     43 S                2
 4 TQRMNOQ~  2019         3         1          4     1     43 S                2
 5 TQRMNOQ~  2019         3         1          2     1     43 S                2
 6 TQRMNOQ~  2019         3         1          3     0     43 S                2
 7 TQRMNOQ~  2019         3         1          4     0     43 S                2
 8 TQRMNOQ~  2019         3         1          5     0     43 S                2
 9 TQRMNOS~  2019         3         1          1     1     43 S                2
10 TQRMNOS~  2019         3         1          2     1     43 S                2
# ... with 57,219 more rows, and 168 more variables: PONDERA <int>, CH03 <int>,
#   CH04 <int>, CH05 <fct>, CH06 <int>, CH07 <int>, CH08 <int>, CH09 <int>,
#   CH10 <int>, CH11 <int>, CH12 <int>, CH13 <int>, CH14 <chr>, CH15 <int>,
#   CH15_COD <int>, CH16 <int>, CH16_COD <int>, NIVEL_ED <int>, ESTADO <int>,
#   CAT_OCUP <int>, CAT_INAC <int>, IMPUTA <int>, PP02C1 <int>, PP02C2 <int>,
#   PP02C3 <int>, PP02C4 <int>, PP02C5 <int>, PP02C6 <int>, PP02C7 <int>,
#   PP02C8 <int>, PP02E <int>, PP02H <int>, PP02I <int>, PP03C <int>, ...
```
]

---
count: false
 
# Otra forma de selecionar
.panel1-select_1-auto[

```r
b_eph_ind %>%
* select(12:16)
```
]
 
.panel2-select_1-auto[

```
# A tibble: 57,229 x 5
    CH04 CH05        CH06  CH07  CH08
   <int> <fct>      <int> <int> <int>
 1     1 12/04/1963    56     2     1
 2     2 24/09/1972    46     2     1
 3     2 14/09/1998    20     1     1
 4     1 11/04/2007    12     5     1
 5     2 03/03/1981    38     2     4
 6     2 17/12/2011     7     5     4
 7     1 10/12/2013     5     5     4
 8     1 27/02/2016     3     5     4
 9     1 15/07/1965    54     3     4
10     1 19/08/2000    19     5     4
# ... with 57,219 more rows
```
]

---

---
class: inverse, middle, center

## Una más.

---

```r
*b_eph_ind
```
]
 
.panel2-select_2-auto[

---
count: false
 
# Otra forma de selecionar
.panel1-select_2-auto[

```r
b_eph_ind %>%
* select(CH03:CH10)
```
]
 
.panel2-select_2-auto[

```
# A tibble: 57,229 x 8
    CH03  CH04 CH05        CH06  CH07  CH08  CH09  CH10
   <int> <int> <fct>      <int> <int> <int> <int> <int>
 1     1     1 12/04/1963    56     2     1     1     2
 2     2     2 24/09/1972    46     2     1     1     2
 3     3     2 14/09/1998    20     1     1     1     2
 4     3     1 11/04/2007    12     5     1     1     1
 5     2     2 03/03/1981    38     2     4     1     2
 6     3     2 17/12/2011     7     5     4     1     1
 7     3     1 10/12/2013     5     5     4     2     1
 8     3     1 27/02/2016     3     5     4     2     1
 9     1     1 15/07/1965    54     3     4     1     2
10     3     1 19/08/2000    19     5     4     1     1
# ... with 57,219 more rows
```
]

---
class: inverse, middle, center

## Una más.

---

```r
*b_eph_ind
```
]
 
.panel2-select_3-auto[

---
count: false
 
# Otra forma de selecionar
.panel1-select_3-auto[

```r
b_eph_ind %>%
* select(starts_with("CH"))
```
]
 
.panel2-select_3-auto[

```
# A tibble: 57,229 x 16
    CH03  CH04 CH05   CH06  CH07  CH08  CH09  CH10  CH11  CH12  CH13 CH14   CH15
   <int> <int> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <chr> <int>
 1     1     1 12/0~    56     2     1     1     2     0     4     1 <NA>      1
 2     2     2 24/0~    46     2     1     1     2     0     4     2 3         1
 3     3     2 14/0~    20     1     1     1     2     0     7     2 1         1
 4     3     1 11/0~    12     5     1     1     1     2     4     2 0         1
 5     2     2 03/0~    38     2     4     1     2     0     4     2 2         4
 6     3     2 17/1~     7     5     4     1     1     1     2     2 1         1
 7     3     1 10/1~     5     5     4     2     1     1     1     2 4         1
 8     3     1 27/0~     3     5     4     2     1     1     1     2 0         1
 9     1     1 15/0~    54     3     4     1     2     0     2     1 <NA>      3
10     3     1 19/0~    19     5     4     1     1     1     4     2 5         1
# ... with 57,219 more rows, and 3 more variables: CH15_COD <int>, CH16 <int>,
#   CH16_COD <int>
```
]

---
class: inverse, middle, center

## Una más!

---

```r
*b_eph_ind
```
]
 
.panel2-select_4-auto[

---
count: false
 
# Otra forma de selecionar
.panel1-select_4-auto[

```r
b_eph_ind %>%
* select(ends_with("_COD"))
```
]
 
.panel2-select_4-auto[

```
# A tibble: 57,229 x 6
   CH15_COD CH16_COD PP04B_COD PP04D_COD PP11B_COD PP11D_COD
      <int>    <int> <chr>     <chr>     <chr>     <chr>    
 1       NA       NA 8401      34323     <NA>      <NA>     
 2       NA       NA 9700      55314     <NA>      <NA>     
 3       NA       NA 1009      20333     <NA>      <NA>     
 4       NA       NA <NA>      <NA>      <NA>      <NA>     
 5      202       NA 4803      30113     <NA>      <NA>     
 6       NA       NA <NA>      <NA>      <NA>      <NA>     
 7       NA       NA <NA>      <NA>      <NA>      <NA>     
 8       NA       NA <NA>      <NA>      <NA>      <NA>     
 9       22       NA 1009      30113     <NA>      <NA>     
10       NA       NA 1009      30314     <NA>      <NA>     
# ... with 57,219 more rows
```
]

---

---
class: inverse, middle, center

## Una más.

---

```r
*b_eph_ind
```
]
 
.panel2-select_5-auto[

---
count: false
 
# Otra forma de selecionar
.panel1-select_5-auto[

```r
b_eph_ind %>%
* select(contains("03"))
```
]
 
.panel2-select_5-auto[

```
# A tibble: 57,229 x 7
    CH03 PP03C PP03D PP03G PP03H PP03I PP03J
   <int> <int> <int> <int> <int> <int> <int>
 1     1     0     0     2     0     2     2
 2     2     2     2     2     0     2     1
 3     3     1     0     1     1     1     1
 4     3    NA    NA    NA    NA    NA    NA
 5     2     1     0     2     0     2     2
 6     3    NA    NA    NA    NA    NA    NA
 7     3    NA    NA    NA    NA    NA    NA
 8     3    NA    NA    NA    NA    NA    NA
 9     1     1     0     2     0     2     2
10     3     1     0     2     0     2     2
# ... with 57,219 more rows
```
]

---

---
class: inverse, middle, center

# _PRÁCTICA_

---
class: inverse, middle

## Práctica

1) Crear un objeto en donde importamos la base de datos de la EPH (recordar tener en cuenta la extensión del archivo)

2) Crear otro objeto en donde selecciono 3 columnas de interés según sus nombres

3) Crear otro objeto en donde selecciono 3 columnas de interés según su posición

4) Escribir el siguiente código en el esquema "paso a paso (con pipes)"

```r
base_ejercicio <- select(b_eph_ind, ESTADO, CH04, CAT_OCUP)
```

---
class: inverse, middle, center

# filter()

***

_<p style="color:grey;" align:"center">Define los casos (filas) en base a una condición</p>_

---
# filter()

### La función tiene el siguiente esquema:

```r
base_de_datos %>% 
  filter(condicion)
```

---
# filter()

- ### Por ejemplo:

```r
base %>% 
  `filter(Edad > 65)`
```

---
# filter()

### Para resolver el **indicador** planteado, vamos a delimitar el universo a las **personas de 14 o más años**

---

```r
*b_eph_ind
```
]
 
.panel2-filter-auto[

---
count: false
 
# filter()
.panel1-filter-auto[

```r
b_eph_ind %>%
* select(AGLOMERADO, CH04, CH06, ESTADO, PONDERA)
```
]
 
.panel2-filter-auto[

```
# A tibble: 57,229 x 5
   AGLOMERADO  CH04  CH06 ESTADO PONDERA
        <int> <int> <int>  <int>   <int>
 1          2     1    56      1     547
 2          2     2    46      1     547
 3          2     2    20      1     547
 4          2     1    12      3     547
 5          2     2    38      1     584
 6          2     2     7      4     584
 7          2     1     5      4     584
 8          2     1     3      4     584
 9          2     1    54      1     584
10          2     1    19      1     584
# ... with 57,219 more rows
```
]

---
count: false
 
# filter()
.panel1-filter-auto[

```r
b_eph_ind %>%
  select(AGLOMERADO, CH04, CH06, ESTADO, PONDERA) %>%
* filter(CH06 >= 14)
```
]
 
.panel2-filter-auto[

```
# A tibble: 45,344 x 5
   AGLOMERADO  CH04  CH06 ESTADO PONDERA
        <int> <int> <int>  <int>   <int>
 1          2     1    56      1     547
 2          2     2    46      1     547
 3          2     2    20      1     547
 4          2     2    38      1     584
 5          2     1    54      1     584
 6          2     1    19      1     584
 7          2     2    44      1     815
 8          2     1    16      3     815
 9          2     1    31      1     815
10          2     1    58      1     563
# ... with 45,334 more rows
```
]

---
# filter()

#### Operadores para filtrar:

<br>

|Condición |Acción              |
| :---     | :---               |
|          |                    |
| `==`     | *igual*            |
| `%in%`   | *incluye*          |
| `!=`     | *distinto*         |
| `>`      | *mayor que*        |
| `<`      | *menor que*        |
| `>=`     | *mayor o igual que*|
| `<=`     | *menor o igual que*|

]

| Operador | Descripción |
| :---     | :---               |
|          |                    |
| `&`      | *y* - Cuando se cumplen ambas condiciones   |
| &#124;   | *o* - Cuando se cumple una u otra condición   |

]

---
# filter()

### **Caso:** Necesito delimitarl el universo a la población que reside en la _Ciudad Autónoma de buenos Aires_ __o__ en los _Partidos del Buenos aires_.

- Chequeo categorías de la variable:

```r
unique(b_eph_ind$AGLOMERADO)
```

```
 [1]  2  3  4  5  6  7  9 10 12 13 14 15 17 18 19 20 22 23 25 26 27 29 30 31 32
[26] 33 34 36 38 91 93
```

- Reviso en el diseño de registro los códigos correspondientes.

---

```r
*b_eph_ind
```
]
 
.panel2-filter_1-auto[

---
count: false
 
#filter
.panel1-filter_1-auto[

```r
b_eph_ind %>%
* select(AGLOMERADO, CH04, CH06, ESTADO, PONDERA)
```
]
 
.panel2-filter_1-auto[

---
count: false
 
#filter
.panel1-filter_1-auto[

```r
b_eph_ind %>%
  select(AGLOMERADO, CH04, CH06, ESTADO, PONDERA) %>%
* filter(AGLOMERADO == 32 | AGLOMERADO == 33)
```
]
 
.panel2-filter_1-auto[

```
# A tibble: 10,097 x 5
   AGLOMERADO  CH04  CH06 ESTADO PONDERA
        <int> <int> <int>  <int>   <int>
 1         32     2    49      1    1031
 2         32     1     9      4    1031
 3         32     2    81      3    1031
 4         32     1    72      1    1234
 5         32     2    73      1    1234
 6         32     1    28      3    1234
 7         32     2    69      3     640
 8         32     2    87      3    1923
 9         32     1    40      1    2424
10         32     2    41      1    2424
# ... with 10,087 more rows
```
]

---

```r
*b_eph_ind
```
]
 
.panel2-filter_2-auto[

---
count: false
 
#filter
.panel1-filter_2-auto[

```r
b_eph_ind %>%
* select(AGLOMERADO, CH04, CH06, ESTADO, PONDERA)
```
]
 
.panel2-filter_2-auto[

---
count: false
 
#filter
.panel1-filter_2-auto[

```r
b_eph_ind %>%
  select(AGLOMERADO, CH04, CH06, ESTADO, PONDERA) %>%
* filter(AGLOMERADO %in% c(32,33))
```
]
 
.panel2-filter_2-auto[

---
class: inverse, middle, center

# _PRÁCTICA_

---
class: inverse, middle

# Práctica

- A partir de la base de la EPH, crear un objeto nuevo que **contenga** las variables __AGLOMERADO__ y __CH06__ y **filtar** por aquella población que tenga _18 o más años de edad_ y que resida en los aglomerados de _Neuquén_ o _Río Negro_

- Chequear que las operaciones hayan sido un éxito (_pista: funciones como **unique()**, **table()** o **colnames()** pueden ser de ayuda)_

---
class: inverse, middle, center

# _mutate()_

_<p style="color:grey;" align:"center">Creoa / edita variables (columnas)</p>_

---
# mutate()

- ### En R base: 
```r
base_de_dato$var_nueva <- base_de_datos$var_1 + base_de_datos$var_2
```

<br>

- ### En `tidyverse`:

```r
base_de_datos %>% 
   mutate(var_nueva = var_1 + var_2)
```

---
# mutate()

### **Indicador:** Sumatoria de ingresos por la ocupación principal y secundaria(s)

---

```r
*b_eph_ind
```
]
 
.panel2-mutate_1-auto[

---
count: false
 
# mutate()
.panel1-mutate_1-auto[

```r
b_eph_ind %>%
* select(P21, TOT_P12)
```
]
 
.panel2-mutate_1-auto[

```
# A tibble: 57,229 x 2
     P21 TOT_P12
   <int>   <int>
 1 28000     700
 2  9500    3600
 3    -9       0
 4     0       0
 5    -9       0
 6     0       0
 7     0       0
 8     0       0
 9    -9       0
10     0       0
# ... with 57,219 more rows
```
]

---
count: false
 
# mutate()
.panel1-mutate_1-auto[

```r
b_eph_ind %>%
  select(P21, TOT_P12) %>%
* mutate(ingreso_ocup_tot = P21 + TOT_P12)
```
]
 
.panel2-mutate_1-auto[

```
# A tibble: 57,229 x 3
     P21 TOT_P12 ingreso_ocup_tot
   <int>   <int>            <int>
 1 28000     700            28700
 2  9500    3600            13100
 3    -9       0               -9
 4     0       0                0
 5    -9       0               -9
 6     0       0                0
 7     0       0                0
 8     0       0                0
 9    -9       0               -9
10     0       0                0
# ... with 57,219 more rows
```
]

---
# mutate() - case_when()

### Función complementaria: `case_when()`, mayormente utilizada para recodificación de variables

---

```r
*b_eph_ind
```
]
 
.panel2-mutate_2-auto[

---
count: false
 
# Recodificando con mutate() y case_when()
.panel1-mutate_2-auto[

```r
b_eph_ind %>%
* select(CH04, CH06)
```
]
 
.panel2-mutate_2-auto[

```
# A tibble: 57,229 x 2
    CH04  CH06
   <int> <int>
 1     1    56
 2     2    46
 3     2    20
 4     1    12
 5     2    38
 6     2     7
 7     1     5
 8     1     3
 9     1    54
10     1    19
# ... with 57,219 more rows
```
]

---
count: false
 
# Recodificando con mutate() y case_when()
.panel1-mutate_2-auto[

```r
b_eph_ind %>%
  select(CH04, CH06) %>%
* mutate(sexo = case_when(CH04 == 1 ~ "Varón",
*                         CH04 == 2 ~ "Mujer"))
```
]
 
.panel2-mutate_2-auto[

```
# A tibble: 57,229 x 3
    CH04  CH06 sexo 
   <int> <int> <chr>
 1     1    56 Varón
 2     2    46 Mujer
 3     2    20 Mujer
 4     1    12 Varón
 5     2    38 Mujer
 6     2     7 Mujer
 7     1     5 Varón
 8     1     3 Varón
 9     1    54 Varón
10     1    19 Varón
# ... with 57,219 more rows
```
]

---

```r
*b_eph_ind
```
]
 
.panel2-mutate_3-auto[

---
count: false
 
# Recodificando con mutate() y case_when()
.panel1-mutate_3-auto[

```r
b_eph_ind %>%
* select(CH06)
```
]
 
.panel2-mutate_3-auto[

```
# A tibble: 57,229 x 1
    CH06
   <int>
 1    56
 2    46
 3    20
 4    12
 5    38
 6     7
 7     5
 8     3
 9    54
10    19
# ... with 57,219 more rows
```
]

---
count: false
 
# Recodificando con mutate() y case_when()
.panel1-mutate_3-auto[

```r
b_eph_ind %>%
  select(CH06) %>%
* mutate(edad_rango = case_when(CH06 %in% c(0:18) ~  "0 a 18",
*                               CH06 %in% c(19:29) ~ "19 a 29",
*                               CH06 %in% c(30:39) ~ "30 a 39",
*                               CH06 %in% c(40:49) ~ "40 a 49",
*                               CH06 %in% c(50:59) ~ "50 a 59",
*                               CH06 >= 60 ~ "60 o más"))
```
]
 
.panel2-mutate_3-auto[

```
# A tibble: 57,229 x 2
    CH06 edad_rango
   <int> <chr>     
 1    56 50 a 59   
 2    46 40 a 49   
 3    20 19 a 29   
 4    12 0 a 18    
 5    38 30 a 39   
 6     7 0 a 18    
 7     5 0 a 18    
 8     3 0 a 18    
 9    54 50 a 59   
10    19 19 a 29   
# ... with 57,219 more rows
```
]

---
class: inverse, middle, center

# _PRÁCTICA_

***

---
class: inverse

# Práctica

1) Crear una variable nueva con las etiquetas correspondientes a los valores de **CAT_OCUP**:

```r
1 --> Patrón
2 --> Cuenta propia
3 --> Obrero o empleado
4 --> Trabajador familiar sin remuneración
9 --> Ns./Nr.
```

1) Recodificar la variable de ingresos P21 en 5 rangos.

---
class: inverse, middle, center

# _summarise()_

_<p style="color:grey;" align:"center">Resume la información en una nueva tabla</p>_

---
# summarise()

#### **Caso:**

- **Indicador1:** Quiero conocer cuántas personas ocupadas hay

- **Indicador2:** Quiero conocer el ingreso medio de la ocupación principal

---

```r
*b_eph_ind
```
]
 
.panel2-summarise_1-auto[

---
count: false
 
# _summarise()_
.panel1-summarise_1-auto[

```r
b_eph_ind %>%
* select(ESTADO, P21, PONDERA)
```
]
 
.panel2-summarise_1-auto[

```
# A tibble: 57,229 x 3
   ESTADO   P21 PONDERA
    <int> <int>   <int>
 1      1 28000     547
 2      1  9500     547
 3      1    -9     547
 4      3     0     547
 5      1    -9     584
 6      4     0     584
 7      4     0     584
 8      4     0     584
 9      1    -9     584
10      1     0     584
# ... with 57,219 more rows
```
]

---
count: false
 
# _summarise()_
.panel1-summarise_1-auto[

```r
b_eph_ind %>%
  select(ESTADO, P21, PONDERA) %>%
* summarise(cant_pob_tot = sum(PONDERA),
*           cant_ocupados = sum(PONDERA[ESTADO == 1]),
*           min_ingr_oc_princ = min(P21),
*           max_ingr_oc_princ = max(P21),
*           ingr_oc_princ_media = questionr::wtd.mean(x = P21,
*                                                     weights = PONDERA))
```
]
 
.panel2-summarise_1-auto[

```
# A tibble: 1 x 5
  cant_pob_tot cant_ocupados min_ingr_oc_princ max_ingr_oc_pri~ ingr_oc_princ_m~
         <int>         <int>             <int>            <int>            <dbl>
1     27989128      11933503                -9           540000            8269.
```
]

---

```r
*library(questionr)
```
]
 
.panel2-summarise_2-auto[

]

---
count: false
 
# _summarise()_
.panel1-summarise_2-auto[

```r
library(questionr)

*b_eph_ind
```
]
 
.panel2-summarise_2-auto[

---
count: false
 
# _summarise()_
.panel1-summarise_2-auto[

```r
library(questionr)

b_eph_ind %>%
* select(ESTADO, P21, PONDERA)
```
]
 
.panel2-summarise_2-auto[

---
count: false
 
# _summarise()_
.panel1-summarise_2-auto[

```r
library(questionr)

b_eph_ind %>%
  select(ESTADO, P21, PONDERA) %>%
* summarise(cant_pob_tot = sum(PONDERA),
*           cant_ocupados = sum(PONDERA[ESTADO == 1]),
*           min_ingr_oc_princ = min(P21),
*           max_ingr_oc_princ = max(P21),
*           ingr_oc_princ_media = wtd.mean(x = P21,  # Paquete questionr
*                                          weights = PONDERA))
```
]
 
.panel2-summarise_2-auto[

---
class: inverse, middle, center

# _group_by()_

***

_<p style="color:grey;" align:"center">Aplica una operación sobre la población de forma segmentada</p>_

---
# group_by()

```r
base_de_datos %>% 
        group_by(variable_de_corte) #<<
```

---

```r
*library(questionr)
```
]
 
.panel2-group_by_1-auto[

]

---
count: false
 
# _group_by()_
.panel1-group_by_1-auto[

```r
library(questionr)

*b_eph_ind
```
]
 
.panel2-group_by_1-auto[

---
count: false
 
# _group_by()_
.panel1-group_by_1-auto[

```r
library(questionr)

b_eph_ind %>%
* group_by(CH04)
```
]
 
.panel2-group_by_1-auto[

```
# A tibble: 57,229 x 177
# Groups:   CH04 [2]
   CODUSU    ANO4 TRIMESTRE NRO_HOGAR COMPONENTE   H15 REGION MAS_500 AGLOMERADO
   <fct>    <int>     <int>     <int>      <int> <int>  <int> <fct>        <int>
 1 TQRMNOQ~  2019         3         1          1     1     43 S                2
 2 TQRMNOQ~  2019         3         1          2     1     43 S                2
 3 TQRMNOQ~  2019         3         1          3     1     43 S                2
 4 TQRMNOQ~  2019         3         1          4     1     43 S                2
 5 TQRMNOQ~  2019         3         1          2     1     43 S                2
 6 TQRMNOQ~  2019         3         1          3     0     43 S                2
 7 TQRMNOQ~  2019         3         1          4     0     43 S                2
 8 TQRMNOQ~  2019         3         1          5     0     43 S                2
 9 TQRMNOS~  2019         3         1          1     1     43 S                2
10 TQRMNOS~  2019         3         1          2     1     43 S                2
# ... with 57,219 more rows, and 168 more variables: PONDERA <int>, CH03 <int>,
#   CH04 <int>, CH05 <fct>, CH06 <int>, CH07 <int>, CH08 <int>, CH09 <int>,
#   CH10 <int>, CH11 <int>, CH12 <int>, CH13 <int>, CH14 <chr>, CH15 <int>,
#   CH15_COD <int>, CH16 <int>, CH16_COD <int>, NIVEL_ED <int>, ESTADO <int>,
#   CAT_OCUP <int>, CAT_INAC <int>, IMPUTA <int>, PP02C1 <int>, PP02C2 <int>,
#   PP02C3 <int>, PP02C4 <int>, PP02C5 <int>, PP02C6 <int>, PP02C7 <int>,
#   PP02C8 <int>, PP02E <int>, PP02H <int>, PP02I <int>, PP03C <int>, ...
```
]

---
count: false
 
# _group_by()_
.panel1-group_by_1-auto[

```r
library(questionr)

b_eph_ind %>%
  group_by(CH04) %>%
* summarise(cant_pob_tot = sum(PONDERA),
*           cant_ocupados = sum(PONDERA[ESTADO == 1]),
*           min_ingr_oc_princ = min(P21),
*           max_ingr_oc_princ = max(P21),
*           ingr_oc_princ_media = wtd.mean(x = P21,  # Paquete questionr
*                                          weights = PONDERA))
```
]
 
.panel2-group_by_1-auto[

```
# A tibble: 2 x 6
   CH04 cant_pob_tot cant_ocupados min_ingr_oc_princ max_ingr_oc_princ
  <int>        <int>         <int>             <int>             <int>
1     1     13528065       6793308                -9            540000
2     2     14461063       5140195                -9            300000
# ... with 1 more variable: ingr_oc_princ_media <dbl>
```
]

---
# Paso a Paso

---
# **Caso**

### - **Indicador 1:** *Principales tasas del mercado de trabajo para el aglomerado de CABA y Partidos del GBA*

### - **Indicador 2:** *Indicador 1 según el __sexo__ y __edad__ de las personas.*

Según el [Diseño de registro](https://www.indec.gob.ar/ftp/cuadros/menusuperior/eph/EPH_registro_t318.pdf), las variables de trabajo son:

- **Aglomerado de residencia** = `AGLOMERADO`

- **Condición de actividad** = `ESTADO`

- **Sexo** = `CH04`

- **Edad** = `CH06`

- **Factor de ponderación** = `PONDERA`

---

```r
*b_eph_ind
```
]
 
.panel2-group_by_2-auto[

---
count: false
 
# _group_by()_
.panel1-group_by_2-auto[

```r
b_eph_ind %>%
* select(AGLOMERADO, CH04, CH06, ESTADO, P21, PONDERA)
```
]
 
.panel2-group_by_2-auto[

```
# A tibble: 57,229 x 6
   AGLOMERADO  CH04  CH06 ESTADO   P21 PONDERA
        <int> <int> <int>  <int> <int>   <int>
 1          2     1    56      1 28000     547
 2          2     2    46      1  9500     547
 3          2     2    20      1    -9     547
 4          2     1    12      3     0     547
 5          2     2    38      1    -9     584
 6          2     2     7      4     0     584
 7          2     1     5      4     0     584
 8          2     1     3      4     0     584
 9          2     1    54      1    -9     584
10          2     1    19      1     0     584
# ... with 57,219 more rows
```
]

---
count: false
 
# _group_by()_
.panel1-group_by_2-auto[

```r
b_eph_ind %>%
  select(AGLOMERADO, CH04, CH06, ESTADO, P21, PONDERA) %>%
* mutate(edad_rango = case_when(CH06 %in% c(0:18) ~  "0 a 18",
*                               CH06 %in% c(19:29) ~ "19 a 29",
*                               CH06 %in% c(30:39) ~ "30 a 39",
*                               CH06 %in% c(40:49) ~ "40 a 49",
*                               CH06 %in% c(50:59) ~ "50 a 59",
*                               CH06 >= 60 ~ "60 o más"),
*        sexo = case_when(CH04 == 1 ~ "Varón",
*                         CH04 == 2 ~ "Mujer"))
```
]
 
.panel2-group_by_2-auto[

```
# A tibble: 57,229 x 8
   AGLOMERADO  CH04  CH06 ESTADO   P21 PONDERA edad_rango sexo 
        <int> <int> <int>  <int> <int>   <int> <chr>      <chr>
 1          2     1    56      1 28000     547 50 a 59    Varón
 2          2     2    46      1  9500     547 40 a 49    Mujer
 3          2     2    20      1    -9     547 19 a 29    Mujer
 4          2     1    12      3     0     547 0 a 18     Varón
 5          2     2    38      1    -9     584 30 a 39    Mujer
 6          2     2     7      4     0     584 0 a 18     Mujer
 7          2     1     5      4     0     584 0 a 18     Varón
 8          2     1     3      4     0     584 0 a 18     Varón
 9          2     1    54      1    -9     584 50 a 59    Varón
10          2     1    19      1     0     584 19 a 29    Varón
# ... with 57,219 more rows
```
]

---
count: false
 
# _group_by()_
.panel1-group_by_2-auto[

```
# A tibble: 10,097 x 8
   AGLOMERADO  CH04  CH06 ESTADO   P21 PONDERA edad_rango sexo 
        <int> <int> <int>  <int> <int>   <int> <chr>      <chr>
 1         32     2    49      1 30000    1031 40 a 49    Mujer
 2         32     1     9      4     0    1031 0 a 18     Varón
 3         32     2    81      3     0    1031 60 o más   Mujer
 4         32     1    72      1     0    1234 60 o más   Varón
 5         32     2    73      1 20000    1234 60 o más   Mujer
 6         32     1    28      3     0    1234 19 a 29    Varón
 7         32     2    69      3     0     640 60 o más   Mujer
 8         32     2    87      3     0    1923 60 o más   Mujer
 9         32     1    40      1    -9    2424 40 a 49    Varón
10         32     2    41      1    -9    2424 40 a 49    Mujer
# ... with 10,087 more rows
```
]

---
count: false
 
# _group_by()_
.panel1-group_by_2-auto[

```
# A tibble: 10,097 x 8
# Groups:   sexo, edad_rango [14]
   AGLOMERADO  CH04  CH06 ESTADO   P21 PONDERA edad_rango sexo 
        <int> <int> <int>  <int> <int>   <int> <chr>      <chr>
 1         32     2    49      1 30000    1031 40 a 49    Mujer
 2         32     1     9      4     0    1031 0 a 18     Varón
 3         32     2    81      3     0    1031 60 o más   Mujer
 4         32     1    72      1     0    1234 60 o más   Varón
 5         32     2    73      1 20000    1234 60 o más   Mujer
 6         32     1    28      3     0    1234 19 a 29    Varón
 7         32     2    69      3     0     640 60 o más   Mujer
 8         32     2    87      3     0    1923 60 o más   Mujer
 9         32     1    40      1    -9    2424 40 a 49    Varón
10         32     2    41      1    -9    2424 40 a 49    Mujer
# ... with 10,087 more rows
```
]

---
count: false
 
# _group_by()_
.panel1-group_by_2-auto[

```r
b_eph_ind %>%
  select(AGLOMERADO, CH04, CH06, ESTADO, P21, PONDERA) %>%
  mutate(edad_rango = case_when(CH06 %in% c(0:18) ~  "0 a 18",
                                CH06 %in% c(19:29) ~ "19 a 29",
                                CH06 %in% c(30:39) ~ "30 a 39",
                                CH06 %in% c(40:49) ~ "40 a 49",
                                CH06 %in% c(50:59) ~ "50 a 59",
                                CH06 >= 60 ~ "60 o más"),
         sexo = case_when(CH04 == 1 ~ "Varón",
                          CH04 == 2 ~ "Mujer")) %>%
  filter(AGLOMERADO %in% c(32, 33)) %>%
  group_by(sexo, edad_rango) %>%
* summarise(cant_pob_tot = sum(PONDERA),
*           cant_ocupados = sum(PONDERA[ESTADO == 1]),
*           min_ingr_oc_princ = min(P21),
*           max_ingr_oc_princ = max(P21),
*           ingr_oc_princ_media = wtd.mean(x = P21,  # Paquete questionr
*                                          weights = PONDERA))
```
]
 
.panel2-group_by_2-auto[

```
# A tibble: 14 x 7
# Groups:   sexo [2]
   sexo  edad_rango cant_pob_tot cant_ocupados min_ingr_oc_pri~ max_ingr_oc_pri~
   <chr> <chr>             <int>         <int>            <int>            <int>
 1 Mujer 0 a 18          1946718         17926               -9            15000
 2 Mujer 19 a 29         1192959        517320               -9            60000
 3 Mujer 30 a 39         1039620        637976               -9           103000
 4 Mujer 40 a 49         1076082        766799               -9           300000
 5 Mujer 50 a 59          817229        511513               -9           200000
 6 Mujer 60 o más        1692597        320630               -9            75000
 7 Mujer <NA>              67672             0                0                0
 8 Varón 0 a 18          2113559         47708               -9            64008
 9 Varón 19 a 29         1252010        808136               -9            85000
10 Varón 30 a 39          975293        858522               -9           122000
11 Varón 40 a 49         1017797        895313               -9           260000
12 Varón 50 a 59          772758        671746               -9           500000
13 Varón 60 o más        1229724        491218               -9           300000
14 Varón <NA>              88090             0                0                0
# ... with 1 more variable: ingr_oc_princ_media <dbl>
```
]

---
class: middle, center, inverse
  
<img src="data:image/png;base64,#../img/logo tidyr.png" width="30%" style="display: block; margin: auto;" />

---
# Funciones del paquete tidyr:

| __Función__      | __Acción__ |
| :---             | ---:       |
| `pivot_longer()` | *Transforma en filas varias columnas*|
| `pivot_wider()`  | *transforma en columnas varias filas*|

---
# estructura de datos

<br>

]

]

---
class: inverse, middle, center

# _pivot_longer()_

***

_<p style="color:grey;" align:"center">Reestructura la base, apilando varias columnas en una. De ancho a largo</p>_

---

```r
*b_eph_ind
```
]
 
.panel2-pivot_longer_1-auto[

---
count: false
 
# _pivot_longer()_
.panel1-pivot_longer_1-auto[

```r
b_eph_ind %>%
* group_by(CH04)
```
]
 
.panel2-pivot_longer_1-auto[

---
count: false
 
# _pivot_longer()_
.panel1-pivot_longer_1-auto[

```r
b_eph_ind %>%
  group_by(CH04) %>%
* summarise(cant_pob_tot = sum(PONDERA),
*           cant_ocupados = sum(PONDERA[ESTADO == 1]),
*           min_ingr_oc_princ = min(P21),
*           max_ingr_oc_princ = max(P21),
*           ingr_oc_princ_media = wtd.mean(x = P21,  # Paquete questionr
*                                          weights = PONDERA))
```
]
 
.panel2-pivot_longer_1-auto[

---
count: false
 
# _pivot_longer()_
.panel1-pivot_longer_1-auto[

```r
b_eph_ind %>%
  group_by(CH04) %>%
  summarise(cant_pob_tot = sum(PONDERA),
            cant_ocupados = sum(PONDERA[ESTADO == 1]),
            min_ingr_oc_princ = min(P21),
            max_ingr_oc_princ = max(P21),
            ingr_oc_princ_media = wtd.mean(x = P21,  # Paquete questionr
                                           weights = PONDERA)) %>%
* select(CH04, cant_ocupados, ingr_oc_princ_media)
```
]
 
.panel2-pivot_longer_1-auto[

```
# A tibble: 2 x 3
   CH04 cant_ocupados ingr_oc_princ_media
  <int>         <int>               <dbl>
1     1       6793308              10805.
2     2       5140195               5896.
```
]

---
count: false
 
# _pivot_longer()_
.panel1-pivot_longer_1-auto[

```r
b_eph_ind %>%
  group_by(CH04) %>%
  summarise(cant_pob_tot = sum(PONDERA),
            cant_ocupados = sum(PONDERA[ESTADO == 1]),
            min_ingr_oc_princ = min(P21),
            max_ingr_oc_princ = max(P21),
            ingr_oc_princ_media = wtd.mean(x = P21,  # Paquete questionr
                                           weights = PONDERA)) %>%
  select(CH04, cant_ocupados, ingr_oc_princ_media) %>%
* pivot_longer(cols = c(cant_ocupados, ingr_oc_princ_media),  #<<
*              names_to = "variable",
*              values_to = "valor")
```
]
 
.panel2-pivot_longer_1-auto[

```
# A tibble: 4 x 3
   CH04 variable               valor
  <int> <chr>                  <dbl>
1     1 cant_ocupados       6793308 
2     1 ingr_oc_princ_media   10805.
3     2 cant_ocupados       5140195 
4     2 ingr_oc_princ_media    5896.
```
]

---
class: inverse, middle, center

# _pivot_wider()_

***

_<p style="color:grey;" align:"center">Reestructura la base, encolumnando varias filas de una variable. De largo a ancho</p>_

---

```r
*base_largo
```
]
 
.panel2-pivot_wider_1-auto[

---
count: false
 
# _pivot_wider()_
.panel1-pivot_wider_1-auto[

```r
base_largo %>%
* pivot_wider(names_from = "variable",  #<<
*             values_from = "valor")
```
]
 
.panel2-pivot_wider_1-auto[

```
# A tibble: 2 x 3
   CH04 cant_ocupados ingr_oc_princ_media
  <int>         <dbl>               <dbl>
1     1       6793308              10805.
2     2       5140195               5896.
```
]