Introduction to the rgnoisefilt package

The rgnoisefilt package contains filtering techniques to remove noisy samples in regression datasets. It adapts classic and recent filtering techniques for use in regression problems, and it also incorporates methods specifically designed for regression data. In order to do this, it uses approaches proposed in the specialized literature, such as Martín et al. (2021) and Arnaiz-González et al. (2016).

Instalation

The rgnoisefilt package can be installed in R from CRAN servers using the command:

#install.packages("rgnoisefilt")

This command installs all the dependencies of the package as well as all the regression algorithms necessary for the operation of the noise filters. In order to access all the functions of the package, it is necessary to use the R command:

library(rgnoisefilt)

Documentation

All the information corresponding to each noise filter can be consulted from the CRAN website. Additionally, the help() command can be used. For example, in order to check the documentation of the regIPF noise filter, we can use:

help(regIPF)

Usage of regression noise filters

For processing noisy regression data, each noise filter in the rgnoisefilt package provides two standard ways of use:

An example on how to use these two methods for filtering out the rock dataset with the regCNN noise filter is shown below:

data(rock)
head(rock)
#>   area    peri     shape perm
#> 1 4990 2791.90 0.0903296  6.3
#> 2 7002 3892.60 0.1486220  6.3
#> 3 7558 3930.66 0.1833120  6.3
#> 4 7352 3869.32 0.1170630  6.3
#> 5 7943 3948.54 0.1224170 17.1
#> 6 7979 4010.15 0.1670450 17.1
# Using the default method:
set.seed(9)
out.def <- regCNN(x = rock[,-ncol(rock)], y = rock[,ncol(rock)])
# Using the formula method:
set.seed(9)
out.frm <- regCNN(formula = perm ~ ., data = rock)
# Check the match of noisy indices:
all(out.def$idnoise == out.frm$idnoise)
#> [1] TRUE

Note that, the \(\$\) operator is used to access the elements returned by the filter in the objects \(out.def\) and \(out.frm\).

Output values

All regression noise filters return an object of rfdata class. It is designed to unify the output value of the methods included in the rgnoisefilt package. The rfdata class is a list of elements with the most relevant information of the noise filtering process:

As an example, the structure of the rfdata object returned using the regCNN noise filter is shown below:

str(out.def)
#> List of 11
#>  $ xclean  :'data.frame':    39 obs. of  3 variables:
#>   ..$ area : int [1:39] 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...
#>   ..$ peri : num [1:39] 2792 3893 3931 3869 3949 ...
#>   ..$ shape: num [1:39] 0.0903 0.1486 0.1833 0.1171 0.1224 ...
#>  $ yclean  : num [1:39] 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...
#>  $ numclean: int 39
#>  $ idclean : num [1:39] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ xnoise  :'data.frame':    9 obs. of  3 variables:
#>   ..$ area : int [1:9] 3469 1468 3524 5267 5048 1016 5605 8793 5514
#>   ..$ peri : num [1:9] 1377 476 1189 1645 942 ...
#>   ..$ shape: num [1:9] 0.177 0.439 0.164 0.254 0.329 ...
#>  $ ynoise  : num [1:9] 100 100 100 100 1300 1300 1300 1300 580
#>  $ numnoise: int 9
#>  $ idnoise : int [1:9] 37 38 39 40 41 42 43 44 47
#>  $ filter  : chr "Condensed Nearest Neighbors"
#>  $ param   :List of 1
#>   ..$ t: num 0.2
#>  $ call    : language regCNN(x = rock[, -ncol(rock)], y = rock[, ncol(rock)])
#>  - attr(*, "class")= chr "rfdata"

In order to display the results of the rfdata class in a friendly way in the R console, two specific print and summary functions are implemented. The print function presents the basic information of the noise filtering process:

print(out.def)
#> 
#> ## Noise filter: 
#> Condensed Nearest Neighbors
#> 
#> ## Parameters:
#> - t = 0.2
#> 
#> ## Number of noisy and clean samples:
#> - Noisy samples: 9/48 (18.75%)
#> - Clean samples: 39/48 (81.25%)

The information offered by print is as follows:

On the other hand, the summary function displays the information of the dataset processed with the noise filter along with other additional details. This function can be called by typing the following R command:

summary(out.frm, showid = TRUE)
#> 
#> ########################################################
#>  Noise filtering process: Summary
#> ########################################################
#> 
#> ## Original call:
#> regCNN(formula = perm ~ ., data = rock)
#> 
#> ## Noise filter: 
#> Condensed Nearest Neighbors
#> 
#> ## Parameters:
#> - t = 0.2
#> 
#> ## Number of noisy and clean samples:
#> - Noisy samples: 9/48 (18.75%)
#> - Clean samples: 39/48 (81.25%)
#> 
#> ## Indices of noisy samples:
#> 37, 38, 39, 40, 41, 42, 43, 44, 47

The information offered by this function is as follows: