What is the process for clearing the cache of a crawling URL?

Currently, I am operating a crawler that gets triggered through an expressjs call.

However, whenever I make the same request again, the crawler runs once more but indicates that all routes have already been completed. I even went to the extent of deleting the './storage' folder

I have gone through the documentation multiple times but still can't figure out how to successfully execute the purgeDefaultStorages() function.

Is there a way for me to completely "reset" the crawler so that there are no cached results?

import express from 'express'
import { PlaywrightCrawler, purgeDefaultStorages, enqueueLinks, Configuration } from 'crawlee';

const app = express();

let crawler

let run = async () => {
    const config = new Configuration({ 'persistStorage': false, persistStorage: false }); //already tried with and without quotes.
    Configuration.set('persistStorage', false) //added this direct configuration as a test too.
     crawler = new PlaywrightCrawler({
        launchContext: {
            launchOptions: {
                headless: true,
            },
        },

    }, config);
    crawler.router.addDefaultHandler(async ({ request, page, enqueueLinks }) => {
        console.log(`Title of ${request.loadedUrl} ': img: ${request.id}`);
        await enqueueLinks({
            strategy: 'same-domain'
        });
    });

    await crawler.run(['http://localhost:8088/']);

    try {
        await config.getStorageClient().purge()
        await config.getStorageClient().teardown() //also tried including this just in case.
        console.log('purging')
    } catch (e) {
        console.log(e)
    }
}

app.get('/', async (req, res) => {
    try {
        await run();
        res.status(200)
    } catch (e) {
        res.status(500)
    }

});

const PORT = process.env.PORT || 8889;
app.listen(PORT, () => {
    console.log(
        `The container started successfully and is listening for HTTP requests on ${PORT}`
    );

Answer №1

To address the problem at hand, it is recommended to utilize both purgeOnStart and purgeDefaultStorages, or alternatively configure the necessary environment variables.

Ensure that the CRAWLEE_PURGE_ON_START environment variable is set to either 0 or false
Additionally, set the CRAWLEE_PERSIST_STORAGE environment variable to 0 or false

Following these steps should resolve the issue you are currently facing. Feel free to refer to this helpful link for more information.

Answer №2

If you want to customize the identifier for each request, follow these steps.

By default, crawlee generates a unique key based on the URL. This means that if you add the same URL to the queue, it won't be crawled again.

For instance, if you generate a UUID for every session, you can initiate the crawl in this manner:

await crawler.run([
  { url: "http://localhost:8088", uniqueKey: `http://localhost:8088:${uuid}` },
]);

When enqueuing links, utilize transformRequestFunction:

await enqueueLinks({
    strategy: 'same-domain',
    transformRequestFunction: (request) => {
        request.uniqueKey = `${request.url}:${uuid}`;
        return request;
    }
});

Find more information about unique keys at this link.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Having trouble retrieving accurate JSON data from an excel workbook

Currently, I am utilizing the npm module xlsx for the purpose of writing and reading JSON data. My goal is to take this JSON data and write it into an Excel file: { "name": "John", "class": 1, "address" : [ { "street": "12th Cross", "city": "London" }, { ...

`the dynamic URL request in Express is not returning the expected parameters`

Having trouble retrieving a unique id from a dynamic express route when clicking on a table row. The id is visible upon inspection, but returns undefined when clicked. Check out the code snippet below: PUG (JADE): .row .col-xs-12 .box .bo ...

Is it possible to capture and store server responses on the client side using Node.js and React Native?

Here's a POST request I have: router.post("/projects", async (req, res) => { const { projectName, projectDescription, projectBudget, projectDuration, industry, companyName, numberOfEmployees, diamond, } = req.bod ...

EJS unable to display template content

I am having an issue with rendering a template that contains the following code block: <% if(type === 'Not Within Specifications'){ %> <% if(Length !== undefined) { %><h5>Length: <%= Length %> </h5> <% ...

Convert a list into a hierarchical structure of nested objects

Working with angular, I aim to display a nested tree structure of folders in an HTML format like below: <div id="tree"> <ul> <li ng-repeat='folder in folderList' ng-include="'/templates/tree-renderer.html'" ...

Having trouble sending and parsing parameters in app.post() with Express.js?

I came across this question on Stack Overflow, but unfortunately, the provided solution does not seem to work for me. In my express.js setup, I have: ... bodyParser = require('body-parser') app.use(bodyParser.urlencoded({ extended: true })); ...

Encountering a "react net::ERR_CONNECTION_REFUSED" error while attempting to retrieve data from a backend node running

In my project, the frontend built with react makes use of axios to fetch data from a separate node server. The frontend is hosted on a secure .app domain with an SSL certificate, while the backend is accessed using http://localhost:3001, using vanilla http ...

There was an error due to a TypeError: The 'page' property cannot be read because it is undefined

Can anyone assist me with an Angular issue I'm facing? I've been working on integrating server-side pagination, but no matter how many times I revise my code, I keep encountering the same error message: ERROR TypeError: Cannot read properties o ...

"Enhance your Magento store with the ability to showcase multiple configurable products on the category page, even when dropdown values are not

As I work on adding multiple configurable products to a category list page in Magento 1.7.2, I am facing some challenges due to using the Organic Internet SCP extension and EM Gala Colorswatches. While following tutorials from various sources like Inchoo a ...

The error "TypeError: ollama.chat is not a function" has occurred when trying to use the ollama module in

Currently, I am grappling with a Node.js project that requires me to utilize the ollama module (ollama-js). The problem arises when I invoke the async function chatWithLlama() which contains ollama.chat(), resulting in the following error being thrown: Ty ...

The functionality of the controls is not functioning properly when attempting to play a video after clicking on an image in HTML5

While working with two HTML5 videos, I encountered an issue with the play/pause functionality. Despite writing Javascript code to control this, clicking on one video's poster sometimes results in the other video playing instead. This inconsistency is ...

Developing a unique JavaScript object by extracting information from a jQuery AJAX response

Is there a recommended approach for creating a custom JavaScript object that contains data retrieved from a jQuery AJAX request? I'm considering two methods, but unsure which is the most appropriate. The first method involves including the AJAX reques ...

Utilizing JSON data from Jade in local JavaScript: A comprehensive guide

Attempting to utilize a JSON object (the entire object, not just a portion) from Node via Jade in my local myScript.js. Here is what my Jade file looks like: span(class="glyphicon glyphicon-pencil" onclick="confirm(\"#{myJSON.taskid}\", \" ...

Is there a way to deactivate keyboard input on an HTML number input field? How about in a React or Material-UI environment?

I am working with an <input> tag that has the attribute type="number", and I want to disable keyboard input so that users are required to adjust the value using the spinner (up and down arrows). This will allow me to consume the input value on each c ...

Encountered an unhandled runtime error: TypeError - the function destroy is not recognized

While working with Next.js and attempting to create a component, I encountered an Unhandled Runtime Error stating "TypeError: destroy is not a function" when using useEffect. "use client" import { useEffect, useState} from "react"; exp ...

Converting data from Node.js 6.10 from hexadecimal to base64 and then to UTF-8

I have a code snippet that generates "data" containing a JSON object. My goal is to extract the HEX-value from the Buffer in the data, and then decode it from HEX to BASE64 to UTF8 in order to convert it into a string. Here is the code snippet: console.l ...

Express.js encountering an `ERR_HTTP_HEADERS_SENT` issue with a fresh Mongoose Schema

My Objective Is If data is found using the findOne() function, update the current endpoint with new content. If no data is found, create a new element with the Schema. Issue If there is no data in the database, then the first if statement throws an ERR_H ...

Unable to pass data to the onChange event for the material-ui datePicker components

Need help with a form that includes a material-ui DatePicker. Here is an example: <DatePicker name="startDate" autoOk={true} floatingLabelText="startDate" onChange={(x, event) => {console.log(arguments);}} /> When I change the date, the console ...

What is the best way to design a Global Navigation menu for websites?

For example, I am looking to integrate a Navigation menu into my website using just one file. I have considered using PHP or creating an HTML frame, but I am wondering what the current industry standard is for professionals. Any insights? ...

What is the correct way to dynamically switch between RTL and LTR in React with Material UI?

I recently learned that in order to support right-to-left (RTL) languages with Material UI, you need to follow these steps. I have a select input that allows users to switch between languages, changing the overall direction of the app. The core of my appl ...