Python split string on whitespace and punctuation

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

For instance:

>>> c = "help, me"
>>> print c.split[]
['help,', 'me']

What I really want the list to look like is:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

I've tried to parse the string first and then run the split:

>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split[]
['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

Is there a way to do this more efficiently?

Fionnuala

89.7k7 gold badges108 silver badges145 bronze badges

asked Dec 14, 2008 at 23:30

1

This is more or less the way to do it:

>>> import re
>>> re.findall[r"[\w']+|[.,!?;]", "Hello, I'm a string!"]
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

  • The underscore [_] is considered an inner-word character. Replace \w, if you don't want that.
  • This will not work with [single] quotes in the string.
  • Put any additional punctuation marks you want to use in the right half of the regular expression.
  • Anything not explicitely mentioned in the re is silently dropped.

answered Dec 15, 2008 at 1:53

4

Here is a Unicode-aware version:

re.findall[r"\w+|[^\w\s]", text, re.UNICODE]

The first alternative catches sequences of word characters [as defined by unicode, so "résumé" won't turn into ['r', 'sum']]; the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation [e.g. "I'm" -> ['I', "'", 'm']]. This appears to be standard in NLP, so I consider it a feature.

answered Jan 19, 2012 at 17:58

LaCLaC

12.5k5 gold badges37 silver badges38 bronze badges

1

If you are going to work in English [or some other common languages], you can use NLTK [there are many other tools to do this such as FreeLing].

import nltk
nltk.download['punkt']
sentence = "help, me"
nltk.word_tokenize[sentence]

sh37211

1,2611 gold badge15 silver badges34 bronze badges

answered Nov 8, 2018 at 16:16

1

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases [note the "!!!" grouped together; this may or may not be a good thing].

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map[string.strip, re.split["[\W+]", s]] if len[item] > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand [using re.compile] if you're going to be doing this on a line-by-line basis.

answered Dec 15, 2008 at 1:30

1

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join[] is used in place of +=, which is known to be faster.

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append[word]
            result.append[char]
            word = ''
        else:
            word = ''.join[[word,char]]

    else:
        if word:
            result.append[word]
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

answered Dec 15, 2008 at 1:05

monkutmonkut

40.3k23 gold badges118 silver badges148 bronze badges

3

This worked for me

import re

i = 'Sandra went to the hallway.!!'
l = re.split['[\W+?]', i]
print[l]

empty = ['', ' ']
l = [el for el in l if el not in empty]
print[l]

Output:
['Sandra', ' ', 'went', ' ', 'to', ' ', 'the', ' ', 'hallway', '.', '', '!', '', '!', '']
['Sandra', 'went', 'to', 'the', 'hallway', '.', '!', '!']

answered Apr 21, 2020 at 8:41

MalgoMalgo

1,5331 gold badge14 silver badges28 bronze badges

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

answered Dec 15, 2008 at 0:34

dkretzdkretz

37.1k13 gold badges79 silver badges136 bronze badges

I came up with a way to tokenize all words and \W+ patterns using \b which doesn't need hardcoding:

>>> import re
>>> sentence = 'Hello, world!'
>>> tokens = [t.strip[] for t in re.findall[r'\b.*?\S.*?[?:\b|$]', sentence]]
['Hello', ',', 'world', '!']

Here .*?\S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

Note the following though -- this will group punctuation that consists of more than one symbol:

>>> print [t.strip[] for t in re.findall[r'\b.*?\S.*?[?:\b|$]', '"Oh no", she said']]
['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

>>> for token in [t.strip[] for t in re.findall[r'\b.*?\S.*?[?:\b|$]', '"You can", she said']]:
...     print re.findall[r'[?:\w+|\W]', token]

['You']
['can']
['"', ',']
['she']
['said']

answered Apr 15, 2014 at 19:11

FrauHahnhenFrauHahnhen

1332 silver badges11 bronze badges

Try this:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"
my_list =[]
x = len[string_big]
poistion_ofspace = 0
while poistion_ofspace < x:
    for i in range[poistion_ofspace,x]:
        if string_big[i] == ' ':
            break
        else:
            continue
    print string_big[poistion_ofspace:[i+1]]
    my_list.append[string_big[poistion_ofspace:[i+1]]]
    poistion_ofspace = i+1

print my_list

Aurasphere

3,68112 gold badges42 silver badges69 bronze badges

answered Apr 18, 2017 at 9:03

Have you tried using a regex?

//docs.python.org/library/re.html#re-syntax

By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

[0]

","

[1]

","

So if you want to add the "," you can just do it after each iteration when you use the array..

answered Dec 14, 2008 at 23:34

Filip EkbergFilip Ekberg

35.8k19 gold badges123 silver badges181 bronze badges

In case you are not allowed to import anything,use this!

word = "Hello,there"
word = word.replace["," , " ," ]
word = word.replace["." , " ."]
return word.split[]

Kosuke Sakai

2,1912 gold badges4 silver badges11 bronze badges

answered Nov 27, 2019 at 9:14

How do you split a string by punctuation and spaces in Python?

findall[] method to split a string into words and punctuation, e.g. result = re. findall[r"[\w'\"]+|[,.!?] ", my_str] . The findall[] method will split the string on whitespace characters and punctuation and will return a list of the matches.

How do you split a string with spaces and commas?

To split a string by space or comma, pass the following regular expression to the split[] method - /[, ]+/ . The method will split the string on each occurrence of a space or comma and return an array containing the substrings.

How do you split a string in whitespace in Python?

Python String split[] Method The split[] method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.

Can I split a string by two delimiters Python?

To split a string with multiple delimiters in Python, use the re. split[] method. The re. split[] function splits the string by each occurrence of the pattern.

Chủ Đề