lxml Cheatsheet
Get Data
lmxl Version
Get from a string:
xml_root = etree.fromstring(data)
lxmldata = etree.fromstring(data)
or from a file:
with open(opffile, 'r') as f:
tree = etree.parse(f)
xml_root = tree.getroot()
Soup Version
soup = BeautifulSoup(data, 'xml')
Target element from data
variable = lxmldata.xpath(INSERT_COMMAND)
Send Data
Serialize XML
xml_string = etree.tostring(xml_root, pretty_print=True).decode()
f.write(xml_string)
Serialize HTML
html_string = etree.tostring(html_root, pretty_print=True, method="html").decode()
print(html_string)
Or, preferably, just write the xml tree directly:
with open(opffile, 'wb') as f:
tree.write(f, pretty_print=True, xml_declaration=True, encoding='UTF-8')
XPath, ElementTree or CSS Selectors
There are several tools/methods for extracting elements
xpath
- / —Selects from the root node.
- // — Selects nodes anywhere in the document.
- . — Refers to the current node.
- .. — Selects the parent of the current node.
- @ — Selects an attribute.
Extract all 'title' elements from the XML
variable = lxmldata.xpath ('//title/text()')
print(titles) # Output: ['Python Programming', 'Mastering XML']
Extract the title of the book with id="1"
book_title = lxmldata.xpath('//book[@id="1"]/title/text()')
print(book_title) # Output: ['Python Programming']
Get all 'book' ids
book_ids = lxmldata.xpath('//book/@id')
print(book_ids) # Output: ['1', '2']
ElementTree
Find the first 'book' element
first_book = xml_root.find('book')
print(first_book.find('title').text) # Output: Python Programming
Find all 'book' elements
all_books = xml_root.findall('book')
for book in all_books:
print(book.find('author').text) # Output: John Smith, Jane Doe
CSS Selectors
from lxml.cssselect import CSSSelector
Select all 'p' elements with class 'highlight'
sel = CSSSelector('p.highlight')
highlighted_elements = sel(html_root)
Select element by ID
header = CSSSelector('#main')(html_root)
Select elements by attribute
links = CSSSelector('a[href]')(html_root)