Python split string on whitespace and punctuation

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

For instance:

>>> c = "help, me"
>>> print c.split()
['help,', 'me']

What I really want the list to look like is:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

I've tried to parse the string first and then run the split:

>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

Is there a way to do this more efficiently?

Fionnuala

89.7k7 gold badges108 silver badges145 bronze badges

asked Dec 14, 2008 at 23:30

1

This is more or less the way to do it:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

  • The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
  • This will not work with (single) quotes in the string.
  • Put any additional punctuation marks you want to use in the right half of the regular expression.
  • Anything not explicitely mentioned in the re is silently dropped.

answered Dec 15, 2008 at 1:53

4

Here is a Unicode-aware version:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

answered Jan 19, 2012 at 17:58

LaCLaC

12.5k5 gold badges37 silver badges38 bronze badges

1

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk
nltk.download('punkt')
sentence = "help, me"
nltk.word_tokenize(sentence)

Python split string on whitespace and punctuation

sh37211

1,2611 gold badge15 silver badges34 bronze badges

answered Nov 8, 2018 at 16:16

Python split string on whitespace and punctuation

1

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

answered Dec 15, 2008 at 1:30

1

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

answered Dec 15, 2008 at 1:05

Python split string on whitespace and punctuation

monkutmonkut

40.3k23 gold badges118 silver badges148 bronze badges

3

This worked for me

import re

i = 'Sandra went to the hallway.!!'
l = re.split('(\W+?)', i)
print(l)

empty = ['', ' ']
l = [el for el in l if el not in empty]
print(l)

Output:
['Sandra', ' ', 'went', ' ', 'to', ' ', 'the', ' ', 'hallway', '.', '', '!', '', '!', '']
['Sandra', 'went', 'to', 'the', 'hallway', '.', '!', '!']

answered Apr 21, 2020 at 8:41

Python split string on whitespace and punctuation

MalgoMalgo

1,5331 gold badge14 silver badges28 bronze badges

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

answered Dec 15, 2008 at 0:34

dkretzdkretz

37.1k13 gold badges79 silver badges136 bronze badges

I came up with a way to tokenize all words and \W+ patterns using \b which doesn't need hardcoding:

>>> import re
>>> sentence = 'Hello, world!'
>>> tokens = [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', sentence)]
['Hello', ',', 'world', '!']

Here .*?\S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

Note the following though -- this will group punctuation that consists of more than one symbol:

>>> print [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"Oh no", she said')]
['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

>>> for token in [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"You can", she said')]:
...     print re.findall(r'(?:\w+|\W)', token)

['You']
['can']
['"', ',']
['she']
['said']

answered Apr 15, 2014 at 19:11

Python split string on whitespace and punctuation

FrauHahnhenFrauHahnhen

1332 silver badges11 bronze badges

Try this:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"
my_list =[]
x = len(string_big)
poistion_ofspace = 0
while poistion_ofspace < x:
    for i in range(poistion_ofspace,x):
        if string_big[i] == ' ':
            break
        else:
            continue
    print string_big[poistion_ofspace:(i+1)]
    my_list.append(string_big[poistion_ofspace:(i+1)])
    poistion_ofspace = i+1

print my_list

Python split string on whitespace and punctuation

Aurasphere

3,68112 gold badges42 silver badges69 bronze badges

answered Apr 18, 2017 at 9:03

Have you tried using a regex?

http://docs.python.org/library/re.html#re-syntax


By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

[0]

","

[1]

","

So if you want to add the "," you can just do it after each iteration when you use the array..

answered Dec 14, 2008 at 23:34

Filip EkbergFilip Ekberg

35.8k19 gold badges123 silver badges181 bronze badges

In case you are not allowed to import anything,use this!

word = "Hello,there"
word = word.replace("," , " ," )
word = word.replace("." , " .")
return word.split()

Kosuke Sakai

2,1912 gold badges4 silver badges11 bronze badges

answered Nov 27, 2019 at 9:14

How do you split a string by punctuation and spaces in Python?

findall() method to split a string into words and punctuation, e.g. result = re. findall(r"[\w'\"]+|[,.!?] ", my_str) . The findall() method will split the string on whitespace characters and punctuation and will return a list of the matches.

How do you split a string with spaces and commas?

To split a string by space or comma, pass the following regular expression to the split() method - /[, ]+/ . The method will split the string on each occurrence of a space or comma and return an array containing the substrings.

How do you split a string in whitespace in Python?

Python String split() Method The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.

Can I split a string by two delimiters Python?

To split a string with multiple delimiters in Python, use the re. split() method. The re. split() function splits the string by each occurrence of the pattern.