Data wrangling tricks from the R4DS slack
Every now and then there is a question on the R4DS slack that produces a large number of replies. This case was particularly interesting because the user that had the issue deleted all the original content, the question, the clarifications and so on.
I thought it would be interesting to document this, because it may be useful to other people.
The question was posted with regards to some genes data, and the task was to filter out some of the genes that have missing values. But generally speaking the question can be summarized like this:
How to filter a dataset in a such way that all observations from a group are dropped if the group doesn’t meet certain criteria?
These are our sample data:
dat <- data.frame(
groups = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
features = c("c", "d", "e", "e", "e", "d", "d", "f", "g"),
col1 = c(1, 2, 5, NA, 5, NA, 6, 7, NA),
col2 = c(1, 2, 4, 4, NA, 5, 6, NA, 7),
col3 = c(1, 3, 4, NA, 5, NA, 4, 7, 8)
)
dat
## groups features col1 col2 col3
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4
## 4 b e NA 4 NA
## 5 b e 5 NA 5
## 6 b d NA 5 NA
## 7 c d 6 6 4
## 8 c f 7 NA 7
## 9 c g NA 7 8
We want to keep only group a
since it is the only one where none of the
observations have missing data.
There is a case of complete observations in group c
as well, but it is only
one (on row 7), so we don’t want to keep that group.
A simple pipeline to do this would be:
library(dplyr)
library(tidyr)
dat |>
drop_na() |>
group_by(groups) |>
filter(n() == 3)
## # A tibble: 3 × 5
## # Groups: groups [1]
## groups features col1 col2 col3
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4
After providing this solution, I thought about the different scenarios where a similar recipe may be needed. For example, what if the data is too big and, data.table
would be preferred? Or what if the pipeline should be part of a bigger project using python
for example?
More R
So here are the possible solutions with other libraries. First data.table
.
library(data.table)
dat_dt <- setDT(dat)
dat_grp <-
dat_dt[complete.cases(dat_dt),][, .N, by = groups][N == 3]
dt_result <- dat_dt[dat_grp, on = "groups"]
dt_result
## groups features col1 col2 col3 N
## 1: a c 1 1 1 3
## 2: a d 2 2 3 3
## 3: a e 5 4 4 3
Most of the time I find data.table
unreadable, and luckily, a tidytable
solution is also possible.
dat |>
tidytable::drop_na() |>
tidytable::group_by(groups) |>
tidytable::filter(dplyr::n() == 3)
## # A tidytable: 3 × 5
## # Groups: groups
## groups features col1 col2 col3
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4
Python
On to python
. Here is a simple chain in pandas
that does the same.
import pandas as pd
dat_py = r.dat
(dat_py.dropna(axis = 0, how = "any")
.groupby("groups")
.filter(lambda x: len(x) == 3))
## groups features col1 col2 col3
## 0 a c 1.0 1.0 1.0
## 1 a d 2.0 2.0 3.0
## 2 a e 5.0 4.0 4.0
The interesting bit is that drop_na()
and dropna()
behave
differently. In
tidyr, the
function assumes that it is checking all of the columns by default,
while in
pandas
arguments have to be called to check if any of the columns have missing
values.
SQL
And, how about some SQL
? It is possible, of course, that your data may be in a database and not in a flat file. Here is a sqlite
solution.
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "dat", dat)
query <- "SELECT t1.* FROM dat AS t1
INNER JOIN (
SELECT groups, COUNT(*) as cnt
FROM dat
WHERE col1 IS NOT NULL AND col2 IS NOT NUll AND col3 IS NOT NULL
GROUP BY groups
) AS t2
ON t1.groups = t2.groups
WHERE t2.cnt = 3;"
res <- dbSendQuery(con, query)
dbFetch(res)
## groups features col1 col2 col3
## 1 a c 1 1 1
## 2 a d 2 2 3
## 3 a e 5 4 4
Summary
This was a quick post that provides answers with different tools to the question: How to filter a dataset in a such way that all observations from a group are dropped if the group doesn’t meet certain criteria?
We saw how to do it in tidyverse
and in data.table
. What is missing is base R
. Sorry about that! However there are solutions in pandas
and SQL
. Hopefully something can be useful to somebody.