I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.
For instance:
>>> c = "help, me"
>>> print c.split[]
['help,', 'me']
What I really want the list to look like is:
['help', ',', 'me']
So, I want the string split at whitespace with the punctuation split from the words.
I've tried to parse the string first and then run the split:
>>> for character in c:
... if character in ".,;!?":
... outputCharacter = " %s" % character
... else:
... outputCharacter = character
... separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split[]
['help', ',', 'me']
This produces the result I want, but is painfully slow on large files.
Is there a way to do this more efficiently?
Fionnuala
89.7k7 gold badges108 silver badges145 bronze badges
asked Dec 14, 2008 at 23:30
1
This is more or less the way to do it:
>>> import re
>>> re.findall[r"[\w']+|[.,!?;]", "Hello, I'm a string!"]
['Hello', ',', "I'm", 'a', 'string', '!']
The trick is, not to think about where to split the string, but what to include in the tokens.
Caveats:
- The underscore [_] is considered an inner-word character. Replace \w, if you don't want that.
- This will not work with [single] quotes in the string.
- Put any additional punctuation marks you want to use in the right half of the regular expression.
- Anything not explicitely mentioned in the re is silently dropped.
answered Dec 15, 2008 at 1:53
4
Here is a Unicode-aware version:
re.findall[r"\w+|[^\w\s]", text, re.UNICODE]
The first alternative catches sequences of word characters [as defined by unicode, so "résumé" won't turn into ['r', 'sum']
]; the second catches individual non-word characters, ignoring whitespace.
Note that, unlike the top answer, this treats the single quote as separate punctuation [e.g. "I'm" -> ['I', "'", 'm']
]. This appears to be standard in NLP, so I consider it a feature.
answered Jan 19, 2012 at 17:58
LaCLaC
12.5k5 gold badges37 silver badges38 bronze badges
1
If you are going to work in English [or some other common languages], you can use NLTK [there are many other tools to do this such as FreeLing].
import nltk
nltk.download['punkt']
sentence = "help, me"
nltk.word_tokenize[sentence]
sh37211
1,2611 gold badge15 silver badges34 bronze badges
answered Nov 8, 2018 at 16:16
1
Here's my entry.
I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases [note the "!!!" grouped together; this may or may not be a good thing].
>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map[string.strip, re.split["[\W+]", s]] if len[item] > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>
One obvious optimization would be to compile the regex before hand [using re.compile] if you're going to be doing this on a line-by-line basis.
answered Dec 15, 2008 at 1:30
1
Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.
This might only be a little faster since ''.join[] is used in place of +=, which is known to be faster.
import string
d = "Hello, I'm a string!"
result = []
word = ''
for char in d:
if char not in string.whitespace:
if char not in string.ascii_letters + "'":
if word:
result.append[word]
result.append[char]
word = ''
else:
word = ''.join[[word,char]]
else:
if word:
result.append[word]
word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']
answered Dec 15, 2008 at 1:05
monkutmonkut
40.3k23 gold badges118 silver badges148 bronze badges
3
This worked for me
import re
i = 'Sandra went to the hallway.!!'
l = re.split['[\W+?]', i]
print[l]
empty = ['', ' ']
l = [el for el in l if el not in empty]
print[l]
Output:
['Sandra', ' ', 'went', ' ', 'to', ' ', 'the', ' ', 'hallway', '.', '', '!', '', '!', '']
['Sandra', 'went', 'to', 'the', 'hallway', '.', '!', '!']
answered Apr 21, 2020 at 8:41
MalgoMalgo
1,5331 gold badge14 silver badges28 bronze badges
I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.
answered Dec 15, 2008 at 0:34
dkretzdkretz
37.1k13 gold badges79 silver badges136 bronze badges
I came up with a way to tokenize all words and \W+
patterns using \b
which doesn't need hardcoding:
>>> import re
>>> sentence = 'Hello, world!'
>>> tokens = [t.strip[] for t in re.findall[r'\b.*?\S.*?[?:\b|$]', sentence]]
['Hello', ',', 'world', '!']
Here .*?\S.*?
is a pattern matching anything that is not a space and $
is added to match last token in a string if it's a punctuation symbol.
Note the following though -- this will group punctuation that consists of more than one symbol:
>>> print [t.strip[] for t in re.findall[r'\b.*?\S.*?[?:\b|$]', '"Oh no", she said']]
['Oh', 'no', '",', 'she', 'said']
Of course, you can find and split such groups with:
>>> for token in [t.strip[] for t in re.findall[r'\b.*?\S.*?[?:\b|$]', '"You can", she said']]:
... print re.findall[r'[?:\w+|\W]', token]
['You']
['can']
['"', ',']
['she']
['said']
answered Apr 15, 2014 at 19:11
FrauHahnhenFrauHahnhen
1332 silver badges11 bronze badges
Try this:
string_big = "One of Python's coolest features is the string format operator This operator is unique to strings"
my_list =[]
x = len[string_big]
poistion_ofspace = 0
while poistion_ofspace < x:
for i in range[poistion_ofspace,x]:
if string_big[i] == ' ':
break
else:
continue
print string_big[poistion_ofspace:[i+1]]
my_list.append[string_big[poistion_ofspace:[i+1]]]
poistion_ofspace = i+1
print my_list
Aurasphere
3,68112 gold badges42 silver badges69 bronze badges
answered Apr 18, 2017 at 9:03
Have you tried using a regex?
//docs.python.org/library/re.html#re-syntax
By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.
[0]
","
[1]
","
So if you want to add the "," you can just do it after each iteration when you use the array..
answered Dec 14, 2008 at 23:34
Filip EkbergFilip Ekberg
35.8k19 gold badges123 silver badges181 bronze badges
In case you are not allowed to import anything,use this!
word = "Hello,there"
word = word.replace["," , " ," ]
word = word.replace["." , " ."]
return word.split[]
Kosuke Sakai
2,1912 gold badges4 silver badges11 bronze badges
answered Nov 27, 2019 at 9:14