python - Preserve empty lines with NLTK's Punkt Tokenizer -


i'm using nltk's punkt sentence tokenizer split file list of sentences, , preserve empty lines within file:

from nltk import data tokenizer = data.load('tokenizers/punkt/english.pickle') s = "that loud beep.\n\n don't know\n if working. mark?\n\n mark there?\n\n\n" sentences = tokenizer.tokenize(s) print sentences 

i print:

['that loud beep.\n\n', "i don't know\n if working.", 'mark?\n\n', 'mark there?\n\n\n'] 

but content that's printed shows trailing empty lines have been removed first , third sentences:

['that loud beep.', "i don't know\n if working.", 'mark?', 'mark there?\n\n\n'] 

other tokenizers in nltk have blanklines='keep' parameter, don't see such option in case of punkt tokenizer. it's possible i'm missing simple. there way retrain these trailing empty lines using punkt sentence tokenizer? i'd grateful insights others can offer!

the problem

sadly, can't make tokenizer keep blanklines, not way written.

starting here , following function calls through span_tokenize() , _slices_from_text(), can see there condition

if match.group('next_tok'):

that designed ensure tokenizer skips whitespace until next possible sentence starting token occurs. looking regex refers to, end looking @ _period_context_fmt, see next_tok named group preceded \s+, blanklines won't captured.

the solution

break down, change part don't like, reassemble custom solution.

now regex in punktlanguagevars class, used initialize punktsentencetokenizer class. have derive custom class punktlanguagevars , fix regex way want be.

the fix want include trailing newlines @ end of sentence, suggest replacing _period_context_fmt, going this:

_period_context_fmt = r"""     \s*                          # word material     %(sentendchars)s             # potential sentence ending     (?=(?p<after_tok>         %(nonword)s              # either other punctuation         |         \s+(?p<next_tok>\s+)     # or whitespace , other token     ))""" 

to this:

_period_context_fmt = r"""     \s*                          # word material     %(sentendchars)s             # potential sentence ending     \s*                       #  <-- changed     (?=(?p<after_tok>         %(nonword)s              # either other punctuation         |         (?p<next_tok>\s+)     #  <-- have \s+ here     ))""" 

now tokenizer using regex instead of older include 0 or more \s characters after end of sentence.

the whole script

import nltk.tokenize.punkt pkt  class customlanguagevars(pkt.punktlanguagevars):      _period_context_fmt = r"""         \s*                          # word material         %(sentendchars)s             # potential sentence ending         \s*                       #  <-- changed         (?=(?p<after_tok>             %(nonword)s              # either other punctuation             |             (?p<next_tok>\s+)     #  <-- have \s+ here         ))"""  custom_tknzr = pkt.punktsentencetokenizer(lang_vars=customlanguagevars())  s = "that loud beep.\n\n don't know\n if working. mark?\n\n mark there?\n\n\n"  print(custom_tknzr.tokenize(s)) 

this outputs:

['that loud beep.\n\n ', "i don't know\n if working. ", 'mark?\n\n ', 'mark there?\n\n\n'] 

Comments

Popular posts from this blog

javascript - Chart.js (Radar Chart) different scaleLineColor for each scaleLine -

apache - Error with PHP mail(): Multiple or malformed newlines found in additional_header -

java - Android – MapFragment overlay button shadow, just like MyLocation button -