What is the quickest method for retrieving li data using selenium?

Question

What is the quickest method for retrieving li data using selenium?

Greetings! Your attention to this post is greatly appreciated.

I recently set out to gather insights on a particular news article. Out of the staggering 11,000 comments attached to the news piece, I was able to acquire data from approximately 6,000 comments. For those interested, you can access the full list of comments through this link: (Don't worry if it's in Korean, as the content will be easily navigable for all).

Please note that this link leads to a mobile version of the webpage, and a specific code needs to utilized to reveal the entire comment thread:

driver.find_element_by_xpath("//span[@class='u_cbox_page_more']").click()

The challenge I encountered was the sluggish approach I took to extract the data. The process extended beyond an hour before I ultimately had to intervene. Here is the snippet of the code I employed:

content = []
name = []
r_time = []

comment_list = driver.find_elements_by_xpath("//ul[@class='u_cbox_list']/li")
              
for comment in comment_list:
    try:
        con = comment.find_element_by_xpath(".//span[@class='u_cbox_contents']").text
        content.append(con)
    except NoSuchElementException:
        continue

    name.append(comment.find_element_by_xpath(".//span[@class='u_cbox_nick']").text)        
    r_time.append(comment.find_element_by_xpath(".//span[@class='u_cbox_date']").text)

I have a multitude of news articles lined up for extraction, and waiting around for each crawl operation is not feasible. There must be a more efficient method to obtain the necessary information. I dabbled with Java Script but couldn't locate a Selenium-compatible solution written in Python. Unfortunately, my knowledge of JavaScript is limited.

If there exists an alternative approach and someone could furnish me with a working example, I am eager to learn and adapt swiftly. Any guidance or assistance provided would be immensely appreciated.

Thank you for dedicating your time and expertise to aid in this endeavor. Your invaluable support is anticipated and warmly welcomed.

javascript python-3.x selenium selenium-webdriver

Answer 1

Answer №1

I have managed to decrease the time it takes to retrieve comments from this page to around 17 minutes (11 minutes - clicking on show more link, 6 minutes - fetching data).

Code:

driver = webdriver.Chrome()
driver.get('https://n.news.naver.com/mnews/article/comment/023/0003390153?sid=102')

content = []
name = []
r_time = []

WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "u_cbox_page_more")))    # need for click by JS

while True:
    try:
        driver.execute_script("document.querySelector(\".u_cbox_paginate[style=''] .u_cbox_page_more\").click(); window.scrollTo(0,document.body.scrollHeight);")
        # WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "u_cbox_page_more"))).click()
        # WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".u_cbox_page_more"))).click()
        # WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//span[@class='u_cbox_page_more']"))).click()
    except:
        break

comment_list = driver.find_elements_by_xpath("//ul[@class='u_cbox_list']/li")

for comment in comment_list:
    try:
        con = driver.execute_script("return arguments[0].querySelector('.u_cbox_contents').innerText;", comment)
        content.append(con)
    except Exception:
        continue

    name.append(driver.execute_script("return arguments[0].querySelector('.u_cbox_nick').innerText;", comment))
    r_time.append(driver.execute_script("return arguments[0].querySelector('.u_cbox_date').innerText;", comment))

Bonus. In the code above you can see 4 different methods for displaying all comments. I conducted a comparison:

|---------------------|------------------|
|    locator type     |       time, s    |
|---------------------|------------------|
|          JS         |        656.9     |
|---------------------|------------------|
|       class name    |        728.1     |
|---------------------|------------------|
|         css         |        736.5     |
|---------------------|------------------|
|        xpath        |        774.3     |
|---------------------|------------------|

Answer 2

I have managed to decrease the time it takes to retrieve comments from this page to around 17 minutes (11 minutes - clicking on show more link, 6 minutes - fetching data).

Code:

driver = webdriver.Chrome()
driver.get('https://n.news.naver.com/mnews/article/comment/023/0003390153?sid=102')

content = []
name = []
r_time = []

WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "u_cbox_page_more")))    # need for click by JS

while True:
    try:
        driver.execute_script("document.querySelector(\".u_cbox_paginate[style=''] .u_cbox_page_more\").click(); window.scrollTo(0,document.body.scrollHeight);")
        # WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "u_cbox_page_more"))).click()
        # WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".u_cbox_page_more"))).click()
        # WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//span[@class='u_cbox_page_more']"))).click()
    except:
        break

comment_list = driver.find_elements_by_xpath("//ul[@class='u_cbox_list']/li")

for comment in comment_list:
    try:
        con = driver.execute_script("return arguments[0].querySelector('.u_cbox_contents').innerText;", comment)
        content.append(con)
    except Exception:
        continue

    name.append(driver.execute_script("return arguments[0].querySelector('.u_cbox_nick').innerText;", comment))
    r_time.append(driver.execute_script("return arguments[0].querySelector('.u_cbox_date').innerText;", comment))

Bonus. In the code above you can see 4 different methods for displaying all comments. I conducted a comparison:

|---------------------|------------------|
|    locator type     |       time, s    |
|---------------------|------------------|
|          JS         |        656.9     |
|---------------------|------------------|
|       class name    |        728.1     |
|---------------------|------------------|
|         css         |        736.5     |
|---------------------|------------------|
|        xpath        |        774.3     |
|---------------------|------------------|

What is the quickest method for retrieving li data using selenium?

Answer №1

Similar questions

Scraping the Web: Combining Selenium Webdriver, Beautifulsoup, and Dealing with Error 416

What is the method for choosing elements within an iframe using Xpath?

Delaying between typed characters in Selenium SendKeys can be achieved by implementing a small pause

Create a relative xpath expression that targets every item in the list

Not receiving connections on localhost port 3000

What is the most efficient way to update a counter when a button is clicked in React and display the result on a different page?

What could be the reason I am unable to choose data properties from the dropdown options?

What is the best way to efficiently load all of my web applications within the web application that I am currently developing?

Having issues sending multiple variables to PHP through Ajax

Verifying the functionality of a custom directive in Angular 2 (Ionic 2) through unit

Exploring the intricacies of Knockout JS mapping nested models using fromJS function

What is the best way to adjust the placement of a component to remain in sync with the v-model it is connected to?

When utilizing AJAX XMLHttpRequest, the concatenated response text from Symfony's StreamedResponse becomes apparent

Customizing hyperlink styles with JavaScript on click

Refreshing Data on Vuetify Range Slider

What is the best way to combine key-value pairs objects into a single object using JavaScript?

Connect Angular ngx-datatable accountid to a specific details page

Explore the versatile Bootstrap Table for class

An elusive melody that plays only when I execute the play command

"By implementing an event listener, we ensure that the same action cannot be