Pythonic HTML Parsing for Humans™

Requests-HTML: HTML Parsing for Humans™

https://farm5.staticflickr.com/4695/39152770914_a3ab8af40d_k_d.jpg

https://travis-ci.com/psf/requests-html.svg?branch=master

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

When using this library you automatically get:

  • Full JavaScript support!
  • CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
  • XPath Selectors, for the faint of heart.
  • Mocked user-agent (like a real web browser).
  • Automatic following of redirects.
  • Connection–pooling and cookie persistence.
  • The Requests experience you know and love, with magical parsing abilities.
  • Async Support

Tutorial & Usage

Make a GET request to 'python.org', using Requests:

>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://python.org/')

Try async and get some sites at the same time:

>>> from requests_html import AsyncHTMLSession
>>> asession = AsyncHTMLSession()
>>> async def get_pythonorg():
...     r = await asession.get('https://python.org/')
...     return r
...
>>> async def get_reddit():
...    r = await asession.get('https://reddit.com/')
...    return r
...
>>> async def get_google():
...    r = await asession.get('https://google.com/')
...    return r
...
>>> results = asession.run(get_pythonorg, get_reddit, get_google)
>>> results # check the requests all returned a 200 (success) code
[<Response [200]>, <Response [200]>, <Response [200]>]
>>> # Each item in the results list is a response object and can be interacted with as such
>>> for result in results:
...     print(result.html.url)
...
https://www.python.org/
https://www.google.com/
https://www.reddit.com/

Note that the order of the objects in the results list represents the order they were returned in, not the order that the coroutines are passed to the run method, which is shown in the example by the order being different.

Grab a list of all links on the page, as–is (anchors excluded):

>>> r.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://status.python.org/', '/community/workshops/', '/community/lists/', 'http://buildbot.net/', '/community/awards', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', '/psf/donations/', 'http://wiki.python.org/moin/Languages', '/dev/', '/events/python-user-group/', 'https://wiki.qt.io/PySide', '/community/sigs/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'http://planetpython.org/', '/events/python-events', '/about/help/', '/events/python-user-group/past/', '/about/success/', '/psf-landing/', '/about/apps', '/about/', 'http://www.wxpython.org/', '/events/python-user-group/665/', 'https://www.python.org/psf/codeofconduct/', '/dev/peps/peps.rss', '/downloads/source/', '/psf/sponsorship/sponsors/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://bugs.python.org/', '/community/merchandise/', 'http://tornadoweb.org', '/events/python-user-group/650/', 'http://flask.pocoo.org/', '/downloads/release/python-364/', '/events/python-user-group/660/', '/events/python-user-group/638/', '/psf/', '/doc/', 'http://blog.python.org', '/events/python-events/604/', '/about/success/#government', 'http://python.org/dev/peps/', 'https://docs.python.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/users/membership/', '/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', '/downloads/', '/jobs/', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', '/privacy/', 'https://pypi.python.org/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'http://www.scipy.org', '/community/forums/', '/about/success/#scientific', '/about/success/#software-development', '/shell/', '/accounts/signup/', 'http://www.facebook.com/pythonlang?fref=ts', '/community/', 'https://kivy.org/', '/about/quotes/', 'http://www.web2py.com/', '/community/logos/', '/community/diversity/', '/events/calendars/', 'https://wiki.python.org/moin/BeginnersGuide', '/success-stories/', '/doc/essays/', '/dev/core-mentorship/', 'http://ipython.org', '/events/', '//docs.python.org/3/tutorial/controlflow.html', '/about/success/#education', '/blogs/', '/community/irc/', 'http://pycon.blogspot.com/', '//jobs.python.org', 'http://www.pylonsproject.org/', 'http://www.djangoproject.com/', '/downloads/mac-osx/', '/about/success/#business', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}

Grab a list of all links on the page, in absolute form (anchors excluded):

>>> r.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', 'https://www.python.org/dev/peps/', 'https://mail.python.org/mailman/listinfo/python-dev', 'https://www.python.org/doc/', 'https://www.python.org/', 'https://www.python.org/about/', 'https://www.python.org/events/python-events/past/', 'https://devguide.python.org/', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', 'https://docs.python.org/3/tutorial/introduction.html#lists', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', 'http://pyfound.blogspot.com/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://www.python.org/events/python-events', 'https://status.python.org/', 'https://www.python.org/about/apps', 'https://www.python.org/downloads/release/python-2714/', 'https://www.python.org/psf/donations/', 'http://buildbot.net/', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', 'http://wiki.python.org/moin/Languages', 'https://docs.python.org/faq/', 'https://jobs.python.org', 'https://www.python.org/about/success/#software-development', 'https://www.python.org/about/success/#education', 'https://www.python.org/community/logos/', 'https://www.python.org/doc/av', 'https://wiki.qt.io/PySide', 'https://www.python.org/events/python-user-group/660/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'https://www.python.org/dev/peps/peps.rss', 'http://planetpython.org/', 'https://www.python.org/events/python-user-group/past/', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/community/diversity/', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://www.python.org/community/awards', 'https://www.python.org/events/python-user-group/638/', 'https://www.python.org/about/legal/', 'https://www.python.org/dev/', 'https://www.python.org/download/alternatives', 'https://www.python.org/downloads/', 'https://www.python.org/community/lists/', 'http://www.wxpython.org/', 'https://www.python.org/about/success/#government', 'https://www.python.org/psf/', 'https://www.python.org/psf/codeofconduct/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://www.python.org/downloads/source/', 'https://bugs.python.org/', 'https://www.python.org/downloads/mac-osx/', 'https://www.python.org/about/help/', 'http://tornadoweb.org', 'http://flask.pocoo.org/', 'https://www.python.org/users/membership/', 'http://blog.python.org', 'https://www.python.org/privacy/', 'https://www.python.org/about/gettingstarted/', 'http://python.org/dev/peps/', 'https://www.python.org/about/apps/', 'https://docs.python.org', 'https://www.python.org/success-stories/', 'https://www.python.org/community/forums/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/community/merchandise/', 'https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', 'https://pypi.python.org/', 'https://www.python.org/events/python-user-group/650/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'https://www.python.org/about/quotes/', 'https://www.python.org/downloads/windows/', 'https://www.python.org/events/calendars/', 'http://www.scipy.org', 'https://www.python.org/community/workshops/', 'https://www.python.org/blogs/', 'https://www.python.org/accounts/signup/', 'https://www.python.org/events/', 'https://kivy.org/', 'http://www.facebook.com/pythonlang?fref=ts', 'http://www.web2py.com/', 'https://www.python.org/psf/sponsorship/sponsors/', 'https://www.python.org/community/', 'https://www.python.org/download/other/', 'https://www.python.org/psf-landing/', 'https://www.python.org/events/python-user-group/665/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/accounts/login/', 'https://www.python.org/downloads/release/python-364/', 'https://www.python.org/dev/core-mentorship/', 'https://www.python.org/about/success/#business', 'https://www.python.org/community/sigs/', 'https://www.python.org/events/python-user-group/', 'http://ipython.org', 'https://www.python.org/shell/', 'https://www.python.org/community/irc/', 'https://www.python.org/about/success/#engineering', 'http://www.pylonsproject.org/', 'http://pycon.blogspot.com/', 'https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'}

Select an element with a CSS Selector:

>>> about = r.html.find('#about', first=True)

Grab an element's text contents:

>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure

Introspect an Element's attributes:

>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}

Render out an Element's HTML:

>>> about.html
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'

Select Elements within Elements:

>>> about.find('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]

Search for links within an element:

>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}

Search for text on the page:

>>> r.html.search('Python is a {} language')[0]
programming

More complex CSS Selector example (copied from Chrome dev tools):

>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'
>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.

XPath is also supported:

>>> r.html.xpath('/html/body/div[1]/a')
[<Element 'a' class=('px-2', 'py-4', 'show-on-focus', 'js-skip-to-content') href='#start-of-content' tabindex='1'>]

JavaScript Support

Let's grab some text that's rendered by JavaScript. Until 2020, the Python 2.7 countdown clock (https://pythonclock.org) will serve as a good test page:

>>> r = session.get('https://pythonclock.org')

Let's try and see the dynamically rendered code (The countdown clock). To do that quickly at first, we'll search between the last text we see before it ('Python 2.7 will retire in...') and the first text we see after it ('Enable Guido Mode').

>>> r.html.search('Python 2.7 will retire in...{}Enable Guido Mode')[0]
'</h1>\n        </div>\n        <div class="python-27-clock"></div>\n        <div class="center">\n            <div class="guido-button-block">\n                <button class="js-guido-mode guido-button">'

Notice the clock is missing. The render() method takes the response and renders the dynamic content just like a web browser would.

>>> r.html.render()
>>> r.html.search('Python 2.7 will retire in...{}Enable Guido Mode')[0]
'</h1>\n        </div>\n        <div class="python-27-clock is-countdown"><span class="countdown-row countdown-show6"><span class="countdown-section"><span class="countdown-amount">1</span><span class="countdown-period">Year</span></span><span class="countdown-section"><span class="countdown-amount">2</span><span class="countdown-period">Months</span></span><span class="countdown-section"><span class="countdown-amount">28</span><span class="countdown-period">Days</span></span><span class="countdown-section"><span class="countdown-amount">16</span><span class="countdown-period">Hours</span></span><span class="countdown-section"><span class="countdown-amount">52</span><span class="countdown-period">Minutes</span></span><span class="countdown-section"><span class="countdown-amount">46</span><span class="countdown-period">Seconds</span></span></span></div>\n        <div class="center">\n            <div class="guido-button-block">\n                <button class="js-guido-mode guido-button">'

Let's clean it up a bit. This step is not needed, it just makes it a bit easier to visualize the returned html to see what we need to target to extract our required information.

       >>> from pprint import pprint
       >>> pprint(r.html.search('Python 2.7 will retire in...{}Enable')[0])
       ('</h1>\n'
'        </div>\n'
'        <div class="python-27-clock is-countdown"><span class="countdown-row '
'countdown-show6"><span class="countdown-section"><span '
'class="countdown-amount">1</span><span '
'class="countdown-period">Year</span></span><span '
'class="countdown-section"><span class="countdown-amount">2</span><span '
'class="countdown-period">Months</span></span><span '
'class="countdown-section"><span class="countdown-amount">28</span><span '
'class="countdown-period">Days</span></span><span '
'class="countdown-section"><span class="countdown-amount">16</span><span '
'class="countdown-period">Hours</span></span><span '
'class="countdown-section"><span class="countdown-amount">52</span><span '
'class="countdown-period">Minutes</span></span><span '
'class="countdown-section"><span class="countdown-amount">46</span><span '
'class="countdown-period">Seconds</span></span></span></div>\n'
'        <div class="center">\n'
'            <div class="guido-button-block">\n'
'                <button class="js-guido-mode guido-button">')

The rendered html has all the same methods and attributes as above. Let's extract just the data that we want out of the clock into something easy to use elsewhere and introspect like a dictionary.

>>> periods = [element.text for element in r.html.find('.countdown-period')]
>>> amounts = [element.text for element in r.html.find('.countdown-amount')]
>>> countdown_data = dict(zip(periods, amounts))
>>> countdown_data
{'Year': '1', 'Months': '2', 'Days': '5', 'Hours': '23', 'Minutes': '34', 'Seconds': '37'}

Or you can do this async also:

>>> async def get_pyclock():
...     r = await asession.get('https://pythonclock.org/')
...     await r.html.arender()
...     return r
...
>>> results = asession.run(get_pyclock, get_pyclock, get_pyclock)

The rest of the code operates the same way as the synchronous version except that results is a list containing multiple response objects however the same basic processes can be applied as above to extract the data you want.

Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once.

Using without Requests

You can also use this library without Requests:

>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""
>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}

Installation

$ pipenv install requests-html
✨🍰✨

Only Python 3.6 and above is supported.

Comments
  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 89: invalid continuation byte

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 89: invalid continuation byte

    Hi @kennethreitz , First, thanks for the great library.

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 89: invalid continuation byte I suffer from this problem #78.

    • pip install -U git+https://github.com/kennethreitz/requests-html
    • Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)]
    from requests_html import HTMLSession 
    session = HTMLSession()
    r = session.get('http://www.nm-n-tax.gov.cn/nmgsj/ssxc/msdt/list_1.shtml')
    r.html.render()
    

    d:\python36\lib\site-packages\pyppeteer\launcher.py in launch(self) 127 raise BrowserError('Unexpectedly chrome process closed with ' 128 f'return code: {self.proc.returncode}') --> 129 msg = self.proc.stdout.readline().decode() 130 if not msg: 131 continue

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 89: invalid continuation byte

  • Getting a http.client.BadStatusLine error after calling render()

    Getting a http.client.BadStatusLine error after calling render()

    I basically just followed the example in the documentation:

    session = HTMLSession()

    r = session.get('https://python.org/')

    After running this

    r.html.render()

    I'm getting this error

    File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.6/urllib/request.py", line 526, in open response = self._open(req, data) File "/usr/lib/python3.6/urllib/request.py", line 544, in _open '_open', req) File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain result = func(*args) File "/usr/lib/python3.6/urllib/request.py", line 1346, in http_open return self.do_open(http.client.HTTPConnection, req) File "/usr/lib/python3.6/urllib/request.py", line 1321, in do_open r = h.getresponse() File "/usr/lib/python3.6/http/client.py", line 1346, in getresponse response.begin() File "/usr/lib/python3.6/http/client.py", line 307, in begin version, status, reason = self._read_status() File "/usr/lib/python3.6/http/client.py", line 289, in _read_status raise BadStatusLine(line) http.client.BadStatusLine: GET /json/version HTTP/1.1

    r.html.html prints the entire DOM but I'm not sure why I would get a http.client.BadStatusLine error.

    Is this the right way to do this? or am I missing something here?

    I'm currently using Python 3.6.9

    Thanks

  • Scraper throws error instead of pulling values from a webpage

    Scraper throws error instead of pulling values from a webpage

    I've written a script in python to get the price of last trade from a javascript rendered webpage. I can get the content If I choose to go with selenium. My goal here is not to use any browser simulator because the latest release of Requests-HTML is supposed to have the ability to parse javascript encrypted content. However, I am not being able to make a go successfully.

    import requests_html
    
    with requests_html.HTMLSession() as session:
        r = session.get('https://www.gdax.com/trade/LTC-EUR')
        js = r.html.render()
        item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
        print(item)
    

    When I execute the script I get the following error (partial traceback):

    Traceback (most recent call last):
      File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\new_line_one.py", line 27, in <module>
        item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
    AttributeError: 'NoneType' object has no attribute 'find'
    Error in atexit._run_exitfuncs:
    Traceback (most recent call last):
      File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\lib\shutil.py", line 381, in _rmtree_unsafe
        os.unlink(fullname)
    PermissionError: [WinError 5] Access is denied:
    
  • Render w/o request doesn't execute inline JS

    Render w/o request doesn't execute inline JS

    This lib looks great, thanks :)... Just a note, I was expecting:

    doc = """<a href='https://httpbin.org'>"""
    html = HTML(html=doc)
    html.render()
    html.html
    

    to output : <a href='https://httpbin.org'>

    Instead I get the content from example.org, which is the default url.

    How can I set the html content and then render it? I can't seem to pass it to:

    doc = """<a href='https://httpbin.org'>"""
    html = HTML(html=doc)
    html.render(script=doc)
    html.html
    

    either, as I get an:

    BrowserError: Evaluation failed: SyntaxError: Unexpected token <
    pageFunction:
    <a href='https://httpbin.org'>
    

    I could set the url to the local file and patch it in, but that solution seems lacking.

  • decode error

    decode error

    from requests_html import HTML
    from pyquery import PyQuery
    
    default_encoding = 'gbk'
    test_html = "<html><body><p>Hello World!--你好世界</p></body></html>".encode(default_encoding)
    
    element = HTML(url='http://example.com/hello_world', html=test_html, default_encoding=default_encoding)
    print(element.text)
    
    print(PyQuery(test_html)('html').text())
    print(PyQuery(test_html.decode(default_encoding))('html').text())
    
    

    output:

    C:\Users\what\PycharmProjects\untitled\venv\Scripts\python.exe C:/Users/what/PycharmProjects/requests-html/BUG.py
    Hello World!--ÄãºÃÊÀ½ç
    Hello World!--ÄãºÃÊÀ½ç
    Hello World!--你好世界
    
    Process finished with exit code 0
    

    So, https://github.com/kennethreitz/requests-html/blob/master/requests_html.py#L319 html should be decode.

  • Django Support?

    Django Support?

    Can anyone point me to a way to use this in a Django view? Is it currently possible? I've had success with this framework on the command line but haven't been able to get it working within Django.

    def render_javascript(url):
        session = HTMLSession()
        response = session.get(url)
        session.close()
        return response.html.render()
    

    Gives me RuntimeError: There is no current event loop in thread 'Thread-1'. and

    def render_javascript(url):
        session = AsyncHTMLSession()
        response = await session.get(url)
        await session.close()
        return await response.html.arender()
    

    Gives me 'coroutine' object has no attribute 'get' (general lack of support for async views in Django)

    I've tried a bunch of stuff suggested for Flask in similar issues: https://github.com/psf/requests-html/issues/155 https://github.com/psf/requests-html/issues/326 https://github.com/psf/requests-html/issues/293 ...but still no luck with Django.

    I'm hoping this is a common enough use case that someone can advise or point me to an example of working code.

    Thank you

  • pyppeteer.errors.BrowserError: Failed to connect to browser port: http://127.0.0.1:58331/json/version

    pyppeteer.errors.BrowserError: Failed to connect to browser port: http://127.0.0.1:58331/json/version

    default

    I use pycharm to connect to Ubuntu remotely, using the requests-html library for the first time, but when using r.html.render(), I get an error: I can't connect to the browser port. I want to know why this is the case. Solutions

  • Every time while i call r.html.render() , it tell me error

    Every time while i call r.html.render() , it tell me error "This event loop is already running"

    I wrote code like this:

    from requests_html import HTMLSession
    session = HTMLSession()
    r = session.get(url)
    r.html.links
    

    I used this to get data from website, and found it had to load javascript, so i wrote the following:

    r.html.render()
    

    it gave message like the below:

    RuntimeError: This event loop is already running

    but i checked the html resource, it did not change. so i tried again and again, but it did report the same error. And the chromium started by it stop to response. These code run on jupyter notebook OS: mac OSX 10.12.6 python: 3.6.2

    I don't know what happened and how to resolve it.

  • Can't find the element that is visible in page

    Can't find the element that is visible in page

    Hi, I have met a problem when find a element in page:

    html.find("#productDetails_detailBullets_sections1")
    

    and get an empty list,but the element is visible in the page:(

  • error while using html.render

    error while using html.render

    session = HTMLSession()
        r = session.get('https://python.org/')
        r.html.render()
        r.html.search('Python 2 will retire in only {months} months!')['months']
    

    The error generates:

    File "/anaconda3/anaconda/lib/python3.6/site-packages/requests_html.py", line 559, in render
        content, result, page = loop.run_until_complete(_async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout))
      File "/anaconda3/anaconda/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
        return future.result()
      File "/anaconda3/anaconda/lib/python3.6/site-packages/requests_html.py", line 517, in _async_render
        page = await browser.newPage()
    AttributeError: 'coroutine' object has no attribute 'newPage'
    sys:1: RuntimeWarning: coroutine 'launch' was never awaited
    
  • gettting error pyppeteer.errors.BrowserError: Unexpectedly chrome process closed with return code: 1

    gettting error pyppeteer.errors.BrowserError: Unexpectedly chrome process closed with return code: 1

    r = session.get('http://python-requests.org/')

    r.html.render()

    r.html.render() [W:pyppeteer.chromium_downloader] start chromium download. Download may take a few minutes. [W:pyppeteer.chromium_downloader] chromium download done. [W:pyppeteer.chromium_downloader] chromium extracted to: /home/hadn/.pyppeteer/local-chromium/533271 Traceback (most recent call last): File "", line 1, in File "/mnt/website/flask_miguel/venv/lib64/python3.6/site-packages/requests_html.py", line 282, in render content, result = loop.run_until_complete(_async_render(url=self.url, script=script, sleep=sleep, scrolldown=scrolldown)) File "/usr/lib64/python3.6/asyncio/base_events.py", line 467, in run_until_complete return future.result() File "/mnt/website/flask_miguel/venv/lib64/python3.6/site-packages/requests_html.py", line 250, in _async_render browser = pyppeteer.launch(headless=True) File "/mnt/website/flask_miguel/venv/lib64/python3.6/site-packages/pyppeteer/launcher.py", line 146, in launch return Launcher(options, **kwargs).launch() File "/mnt/website/flask_miguel/venv/lib64/python3.6/site-packages/pyppeteer/launcher.py", line 111, in launch raise BrowserError('Unexpectedly chrome process closed with ' pyppeteer.errors.BrowserError: Unexpectedly chrome process closed with return code: 1

    my pyppeteer-0.0.10

  • Is there an option to disable the chromium download?

    Is there an option to disable the chromium download?

    more of a request, when using this in a container i find it more efficient to download chromium separately when running the compose file - is there anyway to disable the download of chromium from the library? it just seems to slow down the initialise part of my api

  • typo in the docs

    typo in the docs

    in docs, right when u parse 3 websites using asyncio

    u declare a session with the name of "assesion" but at the end, u use "session"

    (to find that page, just find "reddit" across the files)

  • Requests-HTML documentation site has incorrect python version support listed

    Requests-HTML documentation site has incorrect python version support listed

    This site: https://requests.readthedocs.io/projects/requests-html/en/latest/ Has listed the python version support as: Only Python 3.6 is supported This is obviously wrong and someone should change it

  • RuntimeError: There is no current event loop in thread 'Dummy-5'.

    RuntimeError: There is no current event loop in thread 'Dummy-5'.

    Im using requests-html inside Celery task but i get this error "RuntimeError: There is no current event loop in thread 'Dummy-5'."

    I fIx it like this

    class HTMLSession(BaseSession):
    
        def __init__(self, **kwargs):
            super(HTMLSession, self).__init__(**kwargs)
    
        @property
        def browser(self):
            if not hasattr(self, "_browser"):
                try:
                    self.loop = asyncio.get_event_loop()
                except RuntimeError as ex:
                    if "There is no current event loop in thread" in str(ex):
                        self.loop = asyncio.new_event_loop()
                        asyncio.set_event_loop(self.loop)
                        self.loop = asyncio.get_event_loop()
                if self.loop.is_running():
                    raise RuntimeError("Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.")
                self._browser = self.loop.run_until_complete(super().browser)
    
            return self._browser
    

    it works fine but, im having problem closing celery worker, it cant be because im using celery worker with gevents ? does someone has an idea how gevents and asyncio work

  • Pages containing paginated tables confusing parser

    Pages containing paginated tables confusing parser

    Hi: I am trying to parse a table contents on a page containing 1 normal table (the one I am interested in) and other tables (paginated). I think I can get hold of the correct table element but when I try accessing its rows, the find returns rows from other tables... Beautiful soup seems to give the correct result.

    Here is the code to reproduce the problem:

    from requests_html import HTMLSession
    
    url = "https://www.meds-sdmm.dfo-mpo.gc.ca/isdm-gdsi/twl-mne/inventory-inventaire/sd-ds-eng.asp?no=9422&user=isdm-gdsi&region=MEDS"
    
    html = HTMLSession().get(url).html
    
    # get the first table and its rows
    t = html.find("table", first=True)
    
    print("--- rows parsed by requests_html --- ")
    for row in t.find("tbody > tr"):
        print(row.html)
    
    print(t.attrs)
    
    
    # using beautiful soup
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html.html, "html.parser")
    
    print("--- rows parsed by bs4 --- ")
    for row in soup.table.find_all("tr"):
        print(row)
    

    I am using requests-html 0.10.0 with python 3.9.

    Thanks

A HTML-code compiler-thing that lets you reuse HTML code.

RHTML RHTML stands for Reusable-Hyper-Text-Markup-Language, and is pronounced "Rech-tee-em-el" despite how its abbreviation is. As the name stands, RH

Nov 15, 2021
That project takes as input special TXT File, divides its content into lsit of HTML objects and then creates HTML file from them.

That project takes as input special TXT File, divides its content into lsit of HTML objects and then creates HTML file from them.

Jan 10, 2022
Lektor-html-pretify - Lektor plugin to pretify the HTML DOM using Beautiful Soup

html-pretify Lektor plugin to pretify the HTML DOM using Beautiful Soup. How doe

Jan 26, 2022
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Bleach Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes. Bleach can also linkify text safely, appl

Sep 22, 2022
Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

Sep 12, 2022
A library for converting HTML into PDFs using ReportLab

XHTML2PDF The current release of xhtml2pdf is xhtml2pdf 0.2.5. Release Notes can be found here: Release Notes As with all open-source software, its us

Sep 22, 2022
Generate HTML using python 3 with an API that follows the DOM standard specfication.
Generate HTML using python 3 with an API that follows the DOM standard specfication.

Generate HTML using python 3 with an API that follows the DOM standard specfication. A JavaScript API and tons of cool features. Can be used as a fast prototyping tool.

Sep 13, 2022
A python HTML builder library.

PyML A python HTML builder library. Goals Fully functional html builder similar to the javascript node manipulation. Implement an html parser that ret

Jul 4, 2022
Modded MD conversion to HTML

MDPortal A module to convert a md-eqsue lang to html Basically I ruined md in an attempt to convert it to html Overview Here is a demo file from parse

Nov 27, 2021
Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API

Dominate Dominate is a Python library for creating and manipulating HTML documents using an elegant DOM API. It allows you to write HTML pages in pure

Sep 20, 2022
Pythonic HTML Parsing for Humans™
Pythonic HTML Parsing for Humans™

Requests-HTML: HTML Parsing for Humans™ This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. When us

Sep 26, 2022
NYCT-GTFS - Real-time NYC subway data parsing for humans

NYCT-GTFS - Real-time NYC subway data parsing for humans This python library provides a human-friendly, native python interface for dealing with the N

Aug 10, 2022
Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

Fast MST Algorithm Implementation of fast algorithms for (Maximum Spanning Tree) MST parsing that includes fast ArcMax+Reweighting+Tarjan algorithm fo

Feb 26, 2022
Course-parsing - Parsing Course Info for NIT Kurukshetra

Parsing Course Info for NIT Kurukshetra Overview This repository houses code for

Feb 3, 2022
Standards-compliant library for parsing and serializing HTML documents and fragments in Python

html5lib html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all majo

Sep 15, 2022
Parser manager for parsing DOC, DOCX, PDF or HTML files

Parser manager Description Parser gets PDF, DOC, DOCX or HTML file via API and saves parsed data to the database. Implemented in Ruby 3.0.1 using Acti

Dec 4, 2021
A markdown lexer and parser which gives the programmer atomic control over markdown parsing to html.

A markdown lexer and parser which gives the programmer atomic control over markdown parsing to html.

Aug 13, 2022
A HTML-code compiler-thing that lets you reuse HTML code.

RHTML RHTML stands for Reusable-Hyper-Text-Markup-Language, and is pronounced "Rech-tee-em-el" despite how its abbreviation is. As the name stands, RH

Nov 15, 2021
Use minify-html, the extremely fast HTML + JS + CSS minifier, with Django.

django-minify-html Use minify-html, the extremely fast HTML + JS + CSS minifier, with Django. Requirements Python 3.8 to 3.10 supported. Django 2.2 to

Sep 5, 2022
That project takes as input special TXT File, divides its content into lsit of HTML objects and then creates HTML file from them.

That project takes as input special TXT File, divides its content into lsit of HTML objects and then creates HTML file from them.

Jan 10, 2022