What's the best way to strip punctuation from a string in Python?¶
First, is there a convenient way to get all punctuation characters? Turns out there is with the built-in string
library.
import string
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
One approach would be to use a list comprehension.
PUNCTUATION = set(string.punctuation)
def remove_punctuation_set(s):
return "".join(c for c in s if c not in PUNCTUATION)
TEST_STRING = "Hello world! How are you? I'm good, hope you are too!"
remove_punctuation_set(TEST_STRING)
'Hello world How are you Im good hope you are too'
Next, we can look at regular expressions.
re.sub
replaces matches with a given character. Here, we'd replace punctuation tokens with an empty character.
Naively, we'd think about doing something like re.sub(f"[{string.punctuation}]", "", TEST_STRING)
, however this has a few issues:
- It's faster to compile the regex pattern first
- A lot of the punctuation tokens are special characters which need to be escaped
Luckily, there's the re.escape
function which escapes a given string.
import re
re.escape(string.punctuation)
'!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~'
We can now create our re-compiled regex. Using square brackets in regex means that we match on any given pattern within the brackets.
PATTERN = re.compile(f"[{re.escape(string.punctuation)}]")
def remove_punctuation_regex(s):
return PATTERN.sub("", s)
remove_punctuation_regex(TEST_STRING)
'Hello world How are you Im good hope you are too'
Finally, there's a little known approach using the translate
methods on strings. This transforms a string using a mapping (e.g. a dictionary) between ordinals (and refers to these mappings as "translation tables").
What do we mean by ordinals? An ordinal is an integer value assigned to a character in Unicode encoding. We can get the ordinal for a character using the built-in ord
function.
ord("a"), ord("b"), ord("c")
(97, 98, 99)
We can also go from ordinals back into characters using the built-in chr
function.
chr(97), chr(98), chr(99)
('a', 'b', 'c')
An example of using translate
:
"hebbo worbd!".translate(
{
ord("b"): ord("l"),
ord("!"): None, # mapping to None removes the character
}
)
'hello world'
So, all we have to do is create the mapping from all punctuation tokens to None
, like so:
PUNCTUATION_MAPPING = {ord(c): None for c in string.punctuation}
def remove_punctuation_translate(s):
return s.translate(PUNCTUATION_MAPPING)
remove_punctuation_translate(TEST_STRING)
'Hello world How are you Im good hope you are too'
There's also a helpful function under the str
namespace that creates the mappings (aka translation tables) for us:
table = str.maketrans("b", "l", "!")
"hebbo worbd".translate(table)
'hello world'
The first argument of maketrans
are strings you want to replace with the second argument, where the i'th character in the first argument is replaced by the i'th character in the second argument. The third argument (which is optional) denotes strings which we want to remove.
By having the first and second argument be empty strings and using string.punctuation
as the third argument we can replicate our PUNCTUATION_MAPPING
dictionary.
assert str.maketrans("", "", string.punctuation) == PUNCTUATION_MAPPING
We can now benchmark each of our approaches using timeit
.
import timeit
n = 1_000_000
set_time = timeit.timeit(
"remove_punctuation_set(TEST_STRING)",
globals=globals(),
number=n,
)
regex_time = timeit.timeit(
"remove_punctuation_regex(TEST_STRING)",
globals=globals(),
number=n,
)
translate_time = timeit.timeit(
"remove_punctuation_translate(TEST_STRING)",
globals=globals(),
number=n,
)
The results:
print(f"set : {set_time}")
print(f"regex : {regex_time}")
print(f"translate: {translate_time}")
set : 2.3104199539998262 regex : 0.9609262190001573 translate: 0.8181181009999818
Conclusion: either regex or translate is fine, with translate being slightly faster.