What's the best way to strip punctuation from a string in Python?¶

First, is there a convenient way to get all punctuation characters? Turns out there is with the built-in string library.

In [1]:
import string

string.punctuation
Out[1]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

One approach would be to use a list comprehension.

In [2]:
PUNCTUATION = set(string.punctuation)
In [3]:
def remove_punctuation_set(s):
    return "".join(c for c in s if c not in PUNCTUATION)
In [4]:
TEST_STRING = "Hello world! How are you? I'm good, hope you are too!"

remove_punctuation_set(TEST_STRING)
Out[4]:
'Hello world How are you Im good hope you are too'

Next, we can look at regular expressions.

re.sub replaces matches with a given character. Here, we'd replace punctuation tokens with an empty character.

Naively, we'd think about doing something like re.sub(f"[{string.punctuation}]", "", TEST_STRING), however this has a few issues:

  • It's faster to compile the regex pattern first
  • A lot of the punctuation tokens are special characters which need to be escaped

Luckily, there's the re.escape function which escapes a given string.

In [5]:
import re

re.escape(string.punctuation)
Out[5]:
'!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~'

We can now create our re-compiled regex. Using square brackets in regex means that we match on any given pattern within the brackets.

In [6]:
PATTERN = re.compile(f"[{re.escape(string.punctuation)}]")
In [7]:
def remove_punctuation_regex(s):
    return PATTERN.sub("", s)
In [8]:
remove_punctuation_regex(TEST_STRING)
Out[8]:
'Hello world How are you Im good hope you are too'

Finally, there's a little known approach using the translate methods on strings. This transforms a string using a mapping (e.g. a dictionary) between ordinals (and refers to these mappings as "translation tables").

What do we mean by ordinals? An ordinal is an integer value assigned to a character in Unicode encoding. We can get the ordinal for a character using the built-in ord function.

In [9]:
ord("a"), ord("b"), ord("c")
Out[9]:
(97, 98, 99)

We can also go from ordinals back into characters using the built-in chr function.

In [10]:
chr(97), chr(98), chr(99)
Out[10]:
('a', 'b', 'c')

An example of using translate:

In [11]:
"hebbo worbd!".translate(
    {
        ord("b"): ord("l"),
        ord("!"): None,  # mapping to None removes the character
    }
)
Out[11]:
'hello world'

So, all we have to do is create the mapping from all punctuation tokens to None, like so:

In [12]:
PUNCTUATION_MAPPING = {ord(c): None for c in string.punctuation}


def remove_punctuation_translate(s):
    return s.translate(PUNCTUATION_MAPPING)
In [13]:
remove_punctuation_translate(TEST_STRING)
Out[13]:
'Hello world How are you Im good hope you are too'

There's also a helpful function under the str namespace that creates the mappings (aka translation tables) for us:

In [14]:
table = str.maketrans("b", "l", "!")

"hebbo worbd".translate(table)
Out[14]:
'hello world'

The first argument of maketrans are strings you want to replace with the second argument, where the i'th character in the first argument is replaced by the i'th character in the second argument. The third argument (which is optional) denotes strings which we want to remove.

By having the first and second argument be empty strings and using string.punctuation as the third argument we can replicate our PUNCTUATION_MAPPING dictionary.

In [15]:
assert str.maketrans("", "", string.punctuation) == PUNCTUATION_MAPPING

We can now benchmark each of our approaches using timeit.

In [16]:
import timeit

n = 1_000_000

set_time = timeit.timeit(
    "remove_punctuation_set(TEST_STRING)",
    globals=globals(),
    number=n,
)

regex_time = timeit.timeit(
    "remove_punctuation_regex(TEST_STRING)",
    globals=globals(),
    number=n,
)

translate_time = timeit.timeit(
    "remove_punctuation_translate(TEST_STRING)",
    globals=globals(),
    number=n,
)

The results:

In [17]:
print(f"set      : {set_time}")
print(f"regex    : {regex_time}")
print(f"translate: {translate_time}")
set      : 2.3104199539998262
regex    : 0.9609262190001573
translate: 0.8181181009999818

Conclusion: either regex or translate is fine, with translate being slightly faster.