python - Preserve empty lines with NLTK's Punkt Tokenizer -
i'm using nltk's punkt sentence tokenizer split file list of sentences, , preserve empty lines within file:
from nltk import data tokenizer = data.load('tokenizers/punkt/english.pickle') s = "that loud beep.\n\n don't know\n if working. mark?\n\n mark there?\n\n\n" sentences = tokenizer.tokenize(s) print sentences
i print:
['that loud beep.\n\n', "i don't know\n if working.", 'mark?\n\n', 'mark there?\n\n\n']
but content that's printed shows trailing empty lines have been removed first , third sentences:
['that loud beep.', "i don't know\n if working.", 'mark?', 'mark there?\n\n\n']
other tokenizers in nltk have blanklines='keep'
parameter, don't see such option in case of punkt tokenizer. it's possible i'm missing simple. there way retrain these trailing empty lines using punkt sentence tokenizer? i'd grateful insights others can offer!
the problem
sadly, can't make tokenizer keep blanklines, not way written.
starting here , following function calls through span_tokenize() , _slices_from_text(), can see there condition
if match.group('next_tok'):
that designed ensure tokenizer skips whitespace until next possible sentence starting token occurs. looking regex refers to, end looking @ _period_context_fmt, see next_tok
named group preceded \s+
, blanklines won't captured.
the solution
break down, change part don't like, reassemble custom solution.
now regex in punktlanguagevars class, used initialize punktsentencetokenizer class. have derive custom class punktlanguagevars , fix regex way want be.
the fix want include trailing newlines @ end of sentence, suggest replacing _period_context_fmt
, going this:
_period_context_fmt = r""" \s* # word material %(sentendchars)s # potential sentence ending (?=(?p<after_tok> %(nonword)s # either other punctuation | \s+(?p<next_tok>\s+) # or whitespace , other token ))"""
to this:
_period_context_fmt = r""" \s* # word material %(sentendchars)s # potential sentence ending \s* # <-- changed (?=(?p<after_tok> %(nonword)s # either other punctuation | (?p<next_tok>\s+) # <-- have \s+ here ))"""
now tokenizer using regex instead of older include 0 or more \s
characters after end of sentence.
the whole script
import nltk.tokenize.punkt pkt class customlanguagevars(pkt.punktlanguagevars): _period_context_fmt = r""" \s* # word material %(sentendchars)s # potential sentence ending \s* # <-- changed (?=(?p<after_tok> %(nonword)s # either other punctuation | (?p<next_tok>\s+) # <-- have \s+ here ))""" custom_tknzr = pkt.punktsentencetokenizer(lang_vars=customlanguagevars()) s = "that loud beep.\n\n don't know\n if working. mark?\n\n mark there?\n\n\n" print(custom_tknzr.tokenize(s))
this outputs:
['that loud beep.\n\n ', "i don't know\n if working. ", 'mark?\n\n ', 'mark there?\n\n\n']
Comments
Post a Comment