Extracting text from HTML file using Python -
i'd extract text html file using python. want same output if copied text browser , pasted notepad.
i'd more robust using regular expressions may fail on poorly formed html. i've seen many people recommend beautiful soup, i've had few problems using it. one, picked unwanted text, such javascript source. also, did not interpret html entities. example, expect ' in html source converted apostrophe in text, if i'd pasted browser content notepad.
update html2text
looks promising. handles html entities correctly , ignores javascript. however, not produce plain text; produces markdown have turned plain text. comes no examples or documentation, code looks clean.
related questions:
html2text python program pretty job @ this.
Comments
Post a Comment