Parse HTML - Get Clean Text from HTML

DATASERVICER May 16, 2022 Python

Get Clean Text from HTML:

The below python function helps in fetching the pure text from HTML. you can use htmllaundry package as well, but this function helps you in customizing the text that you want it as per your requirement

## html parser
def parse_html(content):
   
    from bs4.element import Comment
    soup = BeautifulSoup(content,'html.parser')
    text_elements = soup.findAll(text=True)
    tag_blacklist = ['style','script','head','title','meta','[document]','img']
    clean_text = []
    for element in text_elements:
        if element.parent.name in tag_blacklist or isinstance(element,Comment):
            continue
        else:
            text_ = element.strip()
            clean_text.append(text_)
    result_text = " ".join(clean_text)
    result_text = result_text.replace(r'[\r\n]','')
    tag_remove_pattern = re.compile(r'<[^>]+>')
    result_text = tag_remove_pattern.sub('',result_text)
    result_text = re.sub(r'\\','',result_text)
    return result_text

These are all the list of "tag_blacklist = ['style','script','head','title','meta','[document]','img']"that has been ignored from the HTML content. Not only these tags but the Comment portion from the HTML content is handled with the library from bs4.element import Comment

After upon doing the requests.get on a particular url, the response.content is passed as an argument in the above function.

Keywords: Get clean text from html,html text extraction,pure text from html using,get text from html using python, how to get text from html using python

No comments

Subscribe to: Post Comments ( Atom )