# Data wrangling tricks from the R4DS slack

Every now and then there is a question on the R4DS slack that produces a large number of replies. This case was particularly interesting because the user that had the issue deleted all the original content, the question, the clarifications and so on.

I thought it would be interesting to document this, because it may be useful to other people.

The question was posted with regards to some genes data, and the task was to filter out some of the genes that have missing values. But generally speaking the question can be summarized like this:

How to filter a dataset in a such way that all observations from a group are dropped if the group doesn’t meet certain criteria?

These are our sample data:

``````dat <- data.frame(
groups = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
features = c("c", "d", "e", "e", "e", "d", "d", "f", "g"),
col1 = c(1, 2, 5, NA, 5, NA, 6, 7, NA),
col2 = c(1, 2, 4, 4, NA, 5, 6, NA, 7),
col3 = c(1, 3, 4, NA, 5, NA, 4, 7, 8)
)

dat``````
``````##   groups features col1 col2 col3
## 1      a        c    1    1    1
## 2      a        d    2    2    3
## 3      a        e    5    4    4
## 4      b        e   NA    4   NA
## 5      b        e    5   NA    5
## 6      b        d   NA    5   NA
## 7      c        d    6    6    4
## 8      c        f    7   NA    7
## 9      c        g   NA    7    8``````

We want to keep only `group a` since it is the only one where none of the observations have missing data.

There is a case of complete observations in `group c` as well, but it is only one (on row 7), so we don’t want to keep that group.

A simple pipeline to do this would be:

``````library(dplyr)
library(tidyr)

dat |>
drop_na() |>
group_by(groups) |>
filter(n() == 3)``````
``````## # A tibble: 3 × 5
## # Groups:   groups 
##   groups features  col1  col2  col3
##   <chr>  <chr>    <dbl> <dbl> <dbl>
## 1 a      c            1     1     1
## 2 a      d            2     2     3
## 3 a      e            5     4     4``````

After providing this solution, I thought about the different scenarios where a similar recipe may be needed. For example, what if the data is too big and, `data.table` would be preferred? Or what if the pipeline should be part of a bigger project using `python` for example?

## More R

So here are the possible solutions with other libraries. First `data.table`.

``````library(data.table)

dat_dt <- setDT(dat)

dat_grp <-
dat_dt[complete.cases(dat_dt),][, .N, by = groups][N == 3]

dt_result <- dat_dt[dat_grp, on = "groups"]

dt_result``````
``````##    groups features col1 col2 col3 N
## 1:      a        c    1    1    1 3
## 2:      a        d    2    2    3 3
## 3:      a        e    5    4    4 3``````

Most of the time I find `data.table` unreadable, and luckily, a `tidytable` solution is also possible.

``````dat |>
tidytable::drop_na() |>
tidytable::group_by(groups) |>
tidytable::filter(dplyr::n() == 3)``````
``````## # A tidytable: 3 × 5
## # Groups:      groups
##   groups features  col1  col2  col3
##   <chr>  <chr>    <dbl> <dbl> <dbl>
## 1 a      c            1     1     1
## 2 a      d            2     2     3
## 3 a      e            5     4     4``````

## Python

On to `python`. Here is a simple chain in `pandas` that does the same.

``````import pandas as pd

dat_py = r.dat

(dat_py.dropna(axis = 0, how = "any")
.groupby("groups")
.filter(lambda x: len(x) == 3))``````
``````##   groups features  col1  col2  col3
## 0      a        c   1.0   1.0   1.0
## 1      a        d   2.0   2.0   3.0
## 2      a        e   5.0   4.0   4.0``````

The interesting bit is that `drop_na()` and `dropna()` behave differently. In tidyr, the function assumes that it is checking all of the columns by default, while in pandas arguments have to be called to check if any of the columns have missing values.

## SQL

And, how about some `SQL`? It is possible, of course, that your data may be in a database and not in a flat file. Here is a `sqlite` solution.

``````library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

dbWriteTable(con, "dat", dat)

query <- "SELECT t1.* FROM dat AS t1
INNER JOIN (
SELECT groups, COUNT(*) as cnt
FROM dat
WHERE col1 IS NOT NULL AND col2 IS NOT NUll AND col3 IS NOT NULL
GROUP BY groups
) AS t2
ON t1.groups = t2.groups
WHERE t2.cnt = 3;"

res <- dbSendQuery(con, query)

dbFetch(res)``````
``````##   groups features col1 col2 col3
## 1      a        c    1    1    1
## 2      a        d    2    2    3
## 3      a        e    5    4    4``````

# Summary

This was a quick post that provides answers with different tools to the question: How to filter a dataset in a such way that all observations from a group are dropped if the group doesn’t meet certain criteria?

We saw how to do it in `tidyverse` and in `data.table`. What is missing is `base R`. Sorry about that! However there are solutions in `pandas` and `SQL`. Hopefully something can be useful to somebody.