Welcome to HTML DOM Parser
htmldom parses the HTML file and provides methods for iterating
and searching the parse tree in a similar way as Jquery.
Language Requirement: Python 3.2.x
Platforms Available: Linux, Windows
Download
- You can download the latest version from sourceforge.net HTML DOM Parser
- For Windows, you can download from Python packeg index: HTML DOM Parser
Getting Started
- Contents
-
Installing the library:
- Dowload the source code from the links mentioned above.
- Extract the files and go to htmlom-2.0 directory.
- Execute sudo python setup.py install. ( The interpreter must be of version 3.x)
-
Creating HTML DOM Object:
- Open your python interpreter.
from htmldom import htmldom dom = htmldom.HtmlDom() #or dom = htmldom.HtmlDom( "http://www.example.com" )
The above code creates a HtmlDom object.The HtmlDom takes a default parameter, the url of the page. If not provided you can create elements dynamically.
dom = dom.createDom("<html></html>') #or, if you have provided the url then just createDom() call will suffice dom = dom.createDom()
Once the dom object is created, you need to call createDom method of HtmlDom. This will parse the html data and constructs the parse tree which then can be used for searching and manipulating the html data. The only restriction the library imposes is that the data whether it is html or xml must have a root element.
- Open your python interpreter.
-
Searching HTML Elements from parse tree:
-
You can search the parse tree using CSS2 selector expressions or the methods provided by HtmlDom object and HtmlNodeList object.
The selector expressions supported by this library is given below:Selector expression Meaning * Universal Selector E Matched any element E E F Matches any F element that is a descendant of an E element. E > F Matches any F element that is a child of an element E. E + F Matches any F element immediately preceded by a sibling element E. E[foo] Matches any E element with the "foo" attribute set (whatever the value). E[foo=value] Matches any E element whose "foo" attribute value is exactly equal to "value". E[foo~=value] Matches any E element whose "foo" attribute value is a list of space-separated values, one of which is exactly equal to "value". E.dummy Matches any element which has class attribute and have a value of "dummy". E#dummy Matches any element which has id attribute and have a value of "dummy". -
You can query the elements using the "find" method of HtmlDom object. This function takes "css selector" as a parameter
and returs a HtmlNodeList object containing matched nodes.
#create a dom instance from htmldom import htmldom dom = htmldom.HtmlDom().createDom( """<html> <div id='one'><p>This is paragraph<strong>strong Element</strong></p></div> <div id='two'><p>This is paragraph<strong>strong Element</strong></p></div> <p id='three'><p>This is paragraph<strong>strong Element</strong></p></p> <h4 id='four'><p>This is paragraph<strong>strong Element</strong></p></h4></html>""") " ) # Getting p element from html data p = dom.find( "p" ) # You can print html content using "html" method of HtmlNodeList object print( p.html() ) # Getting all elements all = dom.find( "*" ) # Getting sibling elements using '+' sibling = dom.find( "div + div" ) # Getting Descendant element desc = dom.find( "div p strong" ) # Getting child element using '>' child = dom.find( "div > p > strong" ) # Selecting elements through attributes elem = dom.find( "div[id=one]" ) #or elem = dom.find( "[id]" ) #or elem = dom.find( "div[id] p" ) #or elem = dom.find( "div#one" ) #If 'one' were a class then, elem = dom.find( "div.one" )
-
You can search the parse tree using CSS2 selector expressions or the methods provided by HtmlDom object and HtmlNodeList object.
-
Searching through HtmlDom and HtmlNodeList objects methods:
- HtmlDom_instance.find( selector = 'css selector expression' )
This function takes a css selector expression and returns a HtmlNodeList object containing selected nodes.
Examples:from htmldom import htmldom dom = htmldom.HtmlDom( "http://www.example.com" ).createDom() # Find all the links present on a page and prints its "href" value a = dom.find( "a" ) for link in a: print( link.attr( "href" ) )
- HtmlNodeList_instance.children( selector = None, all_children = False )
This function returns all the direct children of the nodes present in the current set. Takes an optional selector parameter, if given, the returned set will be filtered according to the selector.
If all_children = True is passed, then the returned set will also contain the text nodes. Returns HtmlNodeList object.
Examples:#Using the dom instance from the above code snippet div = dom.find( "div" ) # Gets all the children chldrn = div.children() #or, select only those children which have class 'dummy' chldrn = div.children( ".dummy" )
- HtmlNodeList_instance.html( data = None )
This function is used to get the "html" of current element set. It takes an optional "data" parameter in string form, which can be used to replace innerHTML of current elements set.If data is given,returns HtmlNodeList object else it will return string.
Examples:dom = htmldom.HtmlDom().createDom( """<html> <div id='one'><p>This is paragraph<strong>strong Element</strong></p></div> <div id='two'><p>This is paragraph<strong>strong Element</strong></p></div> <p id='three'><p>This is paragraph<strong>strong Element</strong></p></p> <h4 id='four'><p>This is paragraph<strong>strong Element</strong></p></h4></html>""") # Get first div`s html div = dom.find( "div" ).first().html() # div=<div id='one'><p>This is paragraph<strong>strong Element</strong></p></div> #replace first "div`s" content with "b" tag: dom.find( "div" ).html( "<b>b Element</b>" )
-
HtmlNodeList_instance.text( data = None )
This function is used to get the "text" content of the current element set. It takes an optional parameter in string form, which can be used to replace innerText of current elements set. If data is given, returns HtmlNodeList object else it will return string.
Examples:#Using the dom instance from the above code snippet dom.find( "div" ).first().text( "div contents replaced" )
- HtmlNodeList_instance.attr( attrName, val = False )
This function can be used to query attributes of a particular element. It takes an optional parameter "val" which can be used to change the specified attributes value or to add a new attribute if does not exist. Returns HtmlNodeList object.
Examples:#Using the dom instance from the above code snippet dom.find( "div" ).first().attr( "id" ) # returns "one" #Adding new attribute dom.find( "div" ).first().attr( "class", "dummy" )
- HtmlNodeList_instance.removeAttr( attrName )
This function can be used to remove an attribute from an element. Returns HtmlNodeList object.
Examples:#Using the dom instance from the above code snippet dom.find( "div" ).first().removeAttr( "id" )
- HtmlNodeList_instance.filter( selector )
This function can be used to get specific elements from the current set as specified by the "selector" expression. Returns HtmlNodeList object. Examples:#Using the dom instance from the above code snippet # Gets only that div which has id attribute with value "one" div_one = dom.find( "div" ).filter( "[id=one]" )
- HtmlNodeList_instance._not( selector )
This function can be used to remove specific elements from the current set as specified by the "selector" expression. Examples:#Using the dom instance from the above code snippet # Remove div#one from the current div`s set div_not_one = dom.find( "div" )._not( "[id=one]" )
- HtmlNodeList_instance.eq( index )
This function is used to get nth element from the current set.( n < current sets length )
Since HtmlNodeList implments __getitem__ method, you can index the set using list index syntax,slice the set as you do with list data type. Returns HtmlNodeList object.
Examples:#Using the dom instance from the above code snippet div = dom.find( "div" ).eq( 0 ) # Using list index syntax div = dom.find( "div" )[0] # Slicing div = dom.find( "div" )[1:]
- HtmlNodeList_instance.first()
Selectes first element in the set. Returns HtmlNodeList object.
Examples:#Using the dom instance from the above code snippet. div_first = div.find( "div" ).first()
- HtmlNodeList_instance.last()
Selects last element in the set. Returns HtmlNodeList object.
Examples:#Using the dom instance from the above code snippet. div_last = div.find( "div" ).last()
- HtmlNodeList_instance.has( selector )
This function can be used to select those elements which contain elements specified by the selector expression.Returns HtmlNodeList object. Returns HtmlNodeList object.
Examples:#Using the dom instance from the above code snippet. # Find all "div" elements which contain "strong" element(s) as its descendant. div = dom.find( "div" ).has( "strong" )
- HtmlNodeList_instance._is( selector )
This function can be used to check whether cetain element specified by selector exist in the current set. if exists returns True else False. Example:#Using the dom instance from the above code snippet strong = dom.find( "div" ).children().children() if strong._is( "strong" ): print( "strong element is in the set" ) else: print( "strong element is not in the set" )
-
i. HtmlNodeList_instance.next( selector = None )
ii. HtmlNodeList_instance.nextAll( selctor = None )
iii.HtmlNodeList_instance.nextUntil( selector )
These functions can be used to select "next sibling elements" of the elements in the current set. "next" is used to select immediate next sibling element of the current set,
"nextAll" is used to select all the next sibling elements of the current set, Both functions take an optional selector expression to filter the result set.
"nextUntil" is used to select all the next sibling elements until a speicific element is encountered specified by selector expression.All returns HtmlNodeList object.
Examples:dom = htmldom.HtmlDom().createDom( """<html> <div id='one'><p>This is paragraph<strong>strong Element</strong></p></div> <div id='two'><p>This is paragraph<strong>strong Element</strong></p></div> <p id='three'><p>This is paragraph<strong>strong Element</strong></p></p> <h4 id='four'><p>This is paragraph<strong>strong Element</strong></p></h4></html>""") # Gets next sibling elements of div element next = dom.find( "div" ).next() # next = [ div#two, p#three ] # Filtering the result set. next = dom.find( "div" ).next( "p#three" ) # next = [ p#three ] # Getting all the next elements of div next_all = dom.find( "div" ).nextAll() # next_all = [ div#two, p#three, h4#four ] # Filtering the result set. next_all = dom.find( "div" ).nextAll( "h4#three" ) # next_all = [ h4#four ] # Getting next sibling elements until div#one prevs = dom.find( "div#one" ).prevUntil( "h4" ) # prevs = [ div#two, p#three ]
-
i. HtmlNodeList_instance.prev( selector = None )
ii. HtmlNodeList_instance.prevAll( selctor = None )
iii.HtmlNodeList_instance.prevUntil( selector )
These functions can be used to select "previous sibling elements" of the elements in the current set. "prev" is used to select immediate previous sibling element of the current set,
"prevAll" is used to select all the previous sibling elements of the current set, Both functions take an optional selector expression to filter the result set.
"prevUntil" is used to select all the previous sibling elements until a speicific element is encountered specified by selector expression. All returns HtmlNodeList object.
Examples:dom = htmldom.HtmlDom().createDom( """<html> <div id='one'><p>This is paragraph<strong>strong Element</strong></p></div> <div id='two'><p>This is paragraph<strong>strong Element</strong></p></div> <p id='three'><p>This is paragraph<strong>strong Element</strong></p></p> <h4 id='four'><p>This is paragraph<strong>strong Element</strong></p></h4></html>""") # Gets previous sibling elements of div element. next = dom.find( "div" ).prev() # next = [ div#one ] # Filtering the result set. next = dom.find( "div" ).prev( "p#three" ) # next = [] # Getting all the prev elements of h4. next_all = dom.find( "h4" ).prevAll() # next_all = [ div#two, p#three, div#one ] # Filtering the result set. next_all = dom.find( "h4" ).prevAll( "#one" ) # next_all = [ div#one ] # Getting previous sibling elements until div#one. prevs = dom.find( "h4" ).prevUntil( "div#one" ) # prevs = [ div#two, p#three ]
- HtmlNodeList_instance.siblings( selector = None )
This function is used to get all the next and previous sibligns elements. It takes an optional selector expression, which can be used to filter the result set. Returns HtmlNodeList object.
Examples:#Using the dom instance from the above code snippet. siblings = dom.find( "div#two" ).siblings() #siblings = [ div#one, p#three, h4#four ] # Filtering the result set. siblings = dom.find( "div#two" ).siblings( "#three" ) #siblings = [ p#three ]
-
i. HtmlNodeList_instance.parent( selector = None )
ii. HtmlNodeList_instance.parents( selctor = None )
iii.HtmlNodeList_instance.parentsUntil( selector )
These functions can be used to select "parent elements" of the elements in the current set. "parent" is used to select immediate parent elements of the current set,
"parents" is used to select all the parent elements of the current set, Both functions take an optional selector expression to filter the result set.
"parentsUntil" is used to select all the parent elements until a speicific element is encountered specified by selector expression. All returns HtmlNodeList object.
Examples:dom = htmldom.HtmlDom().createDom( """<html> <div id='one'><p id="five">This is paragraph<strong>strong Element</strong></p></div> <div id='two'><p id="six">This is paragraph<strong>strong Element</strong></p></div> <p id='three'><p id="seven">This is paragraph<strong>strong Element</strong></p></p> <h4 id='four'><p id="eight">This is paragraph<strong>strong Element</strong></p></h4></html> """) # Gets parent elements of strong element. parent = dom.find( "strong" ).parent() # parent = [ p#five, p#six, p#seven, p#eight ] # Filtering the result set. parent = dom.find( "strong" ).parent( "p#seven" ) # parent = [ p#seven ] # Getting all the parents elements of strong parents = dom.find( "strong" ).parents() # parent = [ div#two, p#three, div#one,p#five, p#six, p#seven, p#eight, html ] # Filtering the result set. parents = dom.find( "strong" ).prevAll( "#one" ) # parents = [ div#one ] # Getting parent elements until div#one. parent = dom.find( "strong" ).first().parentsUntil( "div#one" ) # parent = [ p#five ]
- HtmlNodeList_instance.add( selector )
This function is used to add new elements to the current set. Returns HtmlNodeList object which is the union of current elements set and the elements matched by the selector expression.
Examples:#Using the dom instance from the above code snippet. # First find all the strong elements. elems = dom.find( "strong" ) #then add p#three element to the set. elems.add( "p#three" )
- HtmlNodeList_instance.andSelf( self )
This function can be used to add previous set of elements into the current set. Retunrs newly modified HtmlNodeList objcect.
Examples:#Using the dom instance from the above code snippet. elems = dom.find( "p" ).prev().andSelf() #elems = [ div#two, p#three ]
- HtmlNodeList_instance.end( self )
This function can be used to return to the previously matched elements set.
Examples:#Using the dom instance from the above code snippet. # First selects "html" element then finds "p", # adds a text node to it then revert back to the set containing "html" print( dom.find( "html" ).find( "p" ).append( "This is a paragraph" ).end().html() )
- HtmlNodeList_instance.find( selector )
This function gets the descendants of each element in the current set of matched elements. Filtered by selector. Examples:#Using the dom instance from the above code snippet. # Gets "p" element nested inside "html" element p = dom.find( "html" ).find( "p" )
-
HtmlNodeList_instance.contains( regex )
This function return all those nodes which contain the pattern specified by their regex in thier text nodes. - HtmlNodeList_instance.add( selector )
This function adds new elements specified by the selector paremeter to the current set. Examples:#Using the dom instance from the above code snippet. # First select p elements p = dom.find( "p" ) # Then add "strong" elements to it. p_added = p.add( "strong" )
- HtmlDom_instance.find( selector = 'css selector expression' )
-
Modifying Parse Tree
- HtmlDom_instance.createDom( raw_html )
This function can be used to create dom tree using raw html. The only restriction it imposes is that string passed must have a root element. Once constructed you can use all the functions mentioned above on this parse tree. It returns HtmlDom object.
Examples:dom = htmldom.HtmlDom().createDom( """<html> <div id='one'><p id="five">This is paragraph<strong>strong Element</strong></p></div> <div id='two'><p id="six">This is paragraph<strong>strong Element</strong></p></div> <p id='three'><p id="seven">This is paragraph<strong>strong Element</strong></p></p> <h4 id='four'><p id="eight">This is paragraph<strong>strong Element</strong></p></h4></html> """) # Getting strong element strong = dom.find( "html div#one strong" )
- i. HtmlNodeList_instance.append( nodes )
ii. HtmlNodeList_instance.prepend( nodes )
iii. HtmlNodeList_instance.after( nodes )
iv. HtmlNodeList_instance.before( nodes )
Here HtmlNodeList_instance is the target and nodes are the source.
v. HtmlNodeList_instance.appendTo( nodes, context = None )
vi. HtmlNodeList_instance.prependTo( nodes, context = None )
vii. HtmlNodeList_instance.insertAfter( nodes, context = None )
viii. HtmlNodeList_instance.insertBefore( nodes, context = None )
Here HtmlNodeList_instance is the source and nodes are the target.
i. "append" function can be used to append( at the end ) nodes to the elements of the current set.
ii. "prepend" function can be used to prepend( at the begining ) nodes to the elements of the current set.
In the above mentioned functions nodes will be added as children of the elements of current set.
iii. "after" function can be used to attache nodes after the elements of the current set.
iv. "before" function can be used to attache node before the elements of the current set.
In the above mentioned functions nodes will be attached as siblings of the elements of current set.
v. "appendTo","prependTo","insertAfter","insertBefore" is similar to the above functions but the only difference is that, HtmlNodeList_instance will be the source and nodes will be the target. context paremeter will be required when you are moving nodes from one parse tree to another parse tree.( context is requrired for searching nodes )
context must be an instance of HtmlDom object otherwise HtmlNodeList_instance`s context will be used.
All functions( i - viii ) take either HtmlNodeList instance or raw_html( in this case it will be ok if you do not provide html with no root element ). Nodes passed will be removed from their previous position and will be attached to the new position.
Examples:#Using the dom instance from the above code snippet. dom.find( "div#one" ).append( "<b>b Element</b>" ) #or dom.find( "div#one" ).prepend( "<b>b Element</b>" ) #or dom.find( "div#one" ).after( "<b>b Element</b>" ) #or dom.find( "div#one" ).before( "<b>b Element</b>" ) # print its html to see the effect print( dom.find( "div#one" ).html() ) #or you can pass the HtmlNodeList object. dom.find( "div#one" ).append( dom.find( "div#two" ) ) #or dom.find( "div#one" ).prepend( dom.find( "div#two" ) ) #or dom.find( "div#one" ).after( dom.find( "div#two" ) ) #or dom.find( "div#one" ).before( dom.find( "div#two" ) ) # print its html to see the effect print( dom.find( "div#one" ).html() ) # Here "div#one" will be appended to "div#two" dom.find( "div#one" ).appendTo( dom.find( "div#two" ) ) # Here "div#one" will be prepended to "div#two" dom.find( "div#one" ).prependTo( dom.find( "div#two" ) ) # Here "div#one" will be attached as next sibling to "div#two" dom.find( "div#one" ).insertAfter( dom.find( "div#two" ) ) # Here "div#one" will be attached as next sibling of "div#two" dom.find( "div#one" ).insertAfter( dom.find( "div#two" ) ) # Here "div#one" will be attached as previous sibling of "div#two" dom.find( "div#one" ).insertBefore( dom.find( "div#two" ) ) # print its html to see the effect print( dom.find( "div#two" ).html() )
- HtmlDom_instance.createDom( raw_html )