Currently, I am utilizing Selenium WebDriver to retrieve the content from a website that unfortunately lacks an API. The site employs AJAX for dynamically loading content as the user scrolls through the page. In order to access this content, my approach involves using JavaScript to scroll down and then attempting to fetch the content using findElements().
In terms of the setup, the webpage consists of various nested elements, with one specific div featuring the "GridItems" class (lacking a name or id). Within this div are numerous child elements carrying the "Item" class (once again, missing a name or id, only possessing the class attribute). My objective is to obtain each element with the "Item" class within the div container. Initially, there are about 25 items accessible when the page loads, with additional items loading as the user scrolls further.
The main challenges I encounter are twofold: first, determining when to stop scrolling once reaching the bottom of the page poses an issue. Identifying the appropriate stopping condition remains elusive. While considering using Window.scrollheight, it proves ineffective as it provides the height of the current window rather than the total height after all content is loaded. One proposed solution involves testing if a certain element at the bottom of the page is visible/clickable; however, visibility issues may be due to delayed loading instead of actual reachability. Even implementing a Wait mechanism may prove futile if timed out without clarity on the underlying cause.
The second dilemma arises during scrolling, where newer elements load dynamically while pushing older ones off the DOM. Thus, simply scrolling to the bottom and applying findElements() may result in overlooking items displaced by newer additions. Currently, I address this by:
int numitems = 135;
List<WebElement> newitems;
List<WebElement> allitems = new ArrayList<WebElement>(50);
do {
//scroll down the full length of the visible window three times
for(int i=0; i < 3; i++)
{
//scroll down
js.executeScript("window.scrollTo(0, document.body.offsetHeight)");
}
//validate presence of desired div before proceeding
WebElement cont = (new WebDriverWait(driver, 100))
.until(ExpectedConditions.presenceOfElementLocated(By.className("GridItems")));
//retrieve all Items within the div
newitems = cont.findElements(By.className("Item"));
//append extracted items after each round of scrolling
allitems.addAll(newitems);
//continue until list surpasses expected item count
}while(numitems > allitems.size());
This process entails scrolling thrice, capturing newly available elements, and appending them to a list for repeated cycles until surpassing the anticipated number of items found.
An inherent flaw lies in varying numbers of items added during each scroll, leading to overlaps within the allitems list across iterations. As these Elements lack unique identifiers and content information, deduplication becomes challenging. Moreover, potential losses occur when scrolling fails to perfectly cover the entire content area. Consequently, stale references to earlier items present a drawback upon processing.
Should I opt for immediate item processing to counteract these stability concerns, resulting code complexity looms ahead. Adhering to this method allows content verification and duplicate identification but might not guarantee comprehensive coverage.
If you have any recommendations on optimal strategies to overcome these obstacles or perhaps identify crucial oversights, your insights would be greatly appreciated. Existing Stack Overflow queries regarding AJAX-driven content loads touch upon distinct challenges; mine predominantly addresses efficient extraction mechanisms. Intuitively, a superior approach can possibly streamline this process - does one exist?
Apologies for the verbose narrative; clarity is paramount. Your input is invaluable.
Many thanks, bsg
Edit:
Please note that the accepted response partially answers my query. For unresolved aspects, iteratively scrolling one screen at a time and aggregating fresh elements mitigated data loss. Following each scroll action, all loaded elements underwent processing for content preservation. Redundancy concerns were minimized utilizing a HashSet. Ceasing scrolling upon hitting rock bottom, verified by methods from the aforementioned response, ensured seamless operation. Trust this serves as helpful guidance.