16 Text Cleaning

More often than not, when working with text responses or any sort of character data output, it will initially be very difficult to work with. This is the case both when dealing with qualitative data and sometimes even just from the output of your data collection means (Qualtrics, Googleforms, etc.). What follow are some common cleaning procedures.

16.1 Remove Text From Strings

Say your data output included a “Response:” before each response. Obviously, you would want the variable to just contain the actual response values. You can use str_remove_all() to remove a specified pattern from a text response. Here, removing “Response:”.

x = "Response:Apple Juice"
x
#> [1] "Response:Apple Juice"
x = str_remove_all(x, "Response:")
x
#> [1] "Apple Juice"

16.2 Escaping Special Characters

In many programming languages, dealing with special characters is difficult. The language has trouble deciding if you are trying to use the character or just refer to it.

Consider the example below:

x = '{Response:"Apple Juice"}'
x
#> [1] "{Response:\"Apple Juice\"}"
str_remove_all(x, '{')
#> Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`{`)

Instead, you have to do what is called “escaping” a special character. One way to do so in R is to wrap the character in brackets.

x = '{Response:"Apple Juice"}'
x
#> [1] "{Response:\"Apple Juice\"}"
str_remove_all(x, '[{]')
#> [1] "Response:\"Apple Juice\"}"

Now this raises the question of what do you do when you want to escape a bracket? You can do so with double forward slash.

x = '[Response:"Apple Juice"]'
x
#> [1] "[Response:\"Apple Juice\"]"
str_remove_all(x, '\\[')
#> [1] "Response:\"Apple Juice\"]"

Note: Double forward slash (\) can be used to escape any special character as well, not just brackets.

16.3 Removing Multiple Strings at Once

You can remove multiple strings at once by using paste() to include all the characters or strings you want removed.

x = 'Response:{"Apple Juice"}'

str_remove_all(x, paste(c('Response', '[:]', '[{]', '["]', '[}]'), 
                        collapse='|'))
#> [1] "Apple Juice"

16.4 Removing the First Instance

Sometimes, you do not want EVERY instance of a string removed. In the example below, the string “Answer” is actually part of a participant’s response. This should not be removed!

x = 'Response:{"Apple Juice Response"}'

str_remove_all(x, paste(c('Response', '[:]', '[{]', '["]', '[}]'), 
                        collapse='|'))
#> [1] "Apple Juice "

If we removed things as we had been before, we’d lose part of their response! Instead, we can just remove the first instance of a string by using str_remove()

x = 'Response:{"Apple Juice Response"}'

x = str_remove_all(x, paste(c('[:]', '[{]', '["]', '[}]'), 
                            collapse='|'))
x
#> [1] "ResponseApple Juice Response"
str_remove(x, "Response")
#> [1] "Apple Juice Response"

This first gets rid of all the other text strings that are to be removed by using str_remove_all() as before. Then, to deal with the extra “Response”, str_remove() is used, and only the first instance is removed.

16.5 Replace Parts of a Response

Sometimes you will not want to only remove part of a response but also replace it with something else. Whereas the str_remove() function will simply remove the string, gsub() will substitute it with something else that you specify!

gsub() takes the form:
gsub(string to replace, what to replace with, where to look)

x = "foo:bar"

str_remove(x, ":") # Not what you want
#> [1] "foobar"
gsub(":", " ", x) # What you want!
#> [1] "foo bar"