What is the process for clearing the cache of a crawling URL?

Currently, I am operating a crawler that gets triggered through an expressjs call.

However, whenever I make the same request again, the crawler runs once more but indicates that all routes have already been completed. I even went to the extent of deleting the './storage' folder

I have gone through the documentation multiple times but still can't figure out how to successfully execute the purgeDefaultStorages() function.

Is there a way for me to completely "reset" the crawler so that there are no cached results?

import express from 'express'
import { PlaywrightCrawler, purgeDefaultStorages, enqueueLinks, Configuration } from 'crawlee';

const app = express();

let crawler

let run = async () => {
    const config = new Configuration({ 'persistStorage': false, persistStorage: false }); //already tried with and without quotes.
    Configuration.set('persistStorage', false) //added this direct configuration as a test too.
     crawler = new PlaywrightCrawler({
        launchContext: {
            launchOptions: {
                headless: true,
            },
        },

    }, config);
    crawler.router.addDefaultHandler(async ({ request, page, enqueueLinks }) => {
        console.log(`Title of ${request.loadedUrl} ': img: ${request.id}`);
        await enqueueLinks({
            strategy: 'same-domain'
        });
    });

    await crawler.run(['http://localhost:8088/']);

    try {
        await config.getStorageClient().purge()
        await config.getStorageClient().teardown() //also tried including this just in case.
        console.log('purging')
    } catch (e) {
        console.log(e)
    }
}

app.get('/', async (req, res) => {
    try {
        await run();
        res.status(200)
    } catch (e) {
        res.status(500)
    }

});

const PORT = process.env.PORT || 8889;
app.listen(PORT, () => {
    console.log(
        `The container started successfully and is listening for HTTP requests on ${PORT}`
    );

Answer №1

To address the problem at hand, it is recommended to utilize both purgeOnStart and purgeDefaultStorages, or alternatively configure the necessary environment variables.

Ensure that the CRAWLEE_PURGE_ON_START environment variable is set to either 0 or false
Additionally, set the CRAWLEE_PERSIST_STORAGE environment variable to 0 or false

Following these steps should resolve the issue you are currently facing. Feel free to refer to this helpful link for more information.

Answer №2

If you want to customize the identifier for each request, follow these steps.

By default, crawlee generates a unique key based on the URL. This means that if you add the same URL to the queue, it won't be crawled again.

For instance, if you generate a UUID for every session, you can initiate the crawl in this manner:

await crawler.run([
  { url: "http://localhost:8088", uniqueKey: `http://localhost:8088:${uuid}` },
]);

When enqueuing links, utilize transformRequestFunction:

await enqueueLinks({
    strategy: 'same-domain',
    transformRequestFunction: (request) => {
        request.uniqueKey = `${request.url}:${uuid}`;
        return request;
    }
});

Find more information about unique keys at this link.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Retrieving a JSON object using a for loop

I'm working on a basic link redirector project. Currently, I have set up an Express server in the following way: const express = require('express'); const app = express() const path = require('path'); const json = require('a ...

Retrieve the order in which the class names are displayed within the user interface

In the following code snippet, each div element is assigned a common class name. <div id="category_9" class="list_item" data-item_ids="[38]"</div> <div id="category_2" class="list_item" data-ite ...

Regular expression in Javascript to match a year

I'm still learning javascript and I have a question. How can I determine if a specific piece of text includes a four digit year? Here's an example: var copyright = $('#copyright').val(); if \d{4} appears in copyright: take ac ...

Mongoose: An unexpected error has occurred

Recently, I developed an express app with a nested app called users using Typescript. The structure of my app.js file is as follows: ///<reference path='d.ts/DefinitelyTyped/node/node.d.ts' /> ///<reference path='d.ts/DefinitelyTyp ...

AngularJS ng-map defines the view position using rectangular coordinates

Is there a way to set the position of ng-map view using the ng-map directive not as the center value of [40.74, -74.18], but instead as a rectangle defined by the corner values of the map view (north, south, east, west)? Currently, I have this code: < ...

What is the best way to generate bootstrap rows from this code in React JS?

In my current array of objects, I have twelve items: { data:[ { type:"tweets", id:"1", attributes:{ user_name:"AKyleAlex", tweet:"<a href="https://twitter.com/Javi" target="_blank"> ...

Creating a Node API that can patiently listen for external data

My current project involves building a server that fetches data from an external API and returns it to the endpoint localhost:3000/v1/api/. However, I'm facing a challenge where the data retrieval process takes approximately 2 seconds, leading to empt ...

Utilize jQuery to showcase elements in a dropdown menu

Hey everyone, I'm working on an ASP.NET MVC4 project and I'm using a jQuery script on the edit page. However, I am encountering an issue with displaying elements on the page. Here is the initial HTML markup of my dropdown before any changes: & ...

transferring a string parameter from PHP to a JavaScript function

I have been searching for a way to transfer a string (stored as a variable $x) from PHP to JavaScript. I came across several code solutions, but I am wondering if these strings need to be declared as global variables? Even after declaring it as a global va ...

Utilizing the Jquery hover feature to reveal or conceal an element

My Hover function is designed to display and hide sub menus when a person hovers on them. The issue I'm facing is that the menu disappears when I move the mouse down towards it. Can someone help me identify what I am doing wrong here? ...

How can we use response.render in Express.js to render HTML on the client side?

I have set up a simple Express.js application with the following routes: router.get('/', function(req, res){ res.render('login'); }); Everything is working fine - when I log into the main page on my localhost, the HTML fro ...

Interacting with YouTube Data API without requiring user input

I'm currently developing a music website that enables users to create YouTube playlists. Initially, I experimented with JavaScript: https://developers.google.com/youtube/v3/code_samples/javascript The procedure involves an initial authorization ste ...

Developing a Nodejs Controller with the power of massive

By utilizing a substantial postgreSQL driver, I successfully establish a connection to postgreSQL and retrieve records from the database. var Massive = require("massive"); var connectionString = "postgres://postgres:postgres@localhost/postgres"; var db = ...

Dealing with Unwanted Keys When Parsing JSON Objects

Struggling with parsing a list of Objects, for example: After running the code JSON.parse("[{},{},{},{},{}]"); The result is as follows: 0: Object 1: Object 2: Object 3: Object 4: Object 5: Object Expecting an array of 5 objects like this: [Object,Ob ...

JavaScript - Uncaught TypeError: type[totypeIndex] is not defined

After following a tutorial and successfully completing the project, I encountered a JavaScript error saying "Uncaught TypeError: totype[totypeIndex] is undefined". When I tried to log the type of totype[totypeIndex], it initially showed as String, but late ...

Is there a way to add 100 headings to a webpage without using a loop when the page loads

Just joining this platform, so please be patient with me! The task at hand is to insert 100 h3 headings on page load ("Accusation 1, Accusation 2, Accusation 3,...Accusation 100"). We are restricted to using only 1 loop throughout the lab, which will also ...

Data sent as FormData will be received as arrays separated by commas

When constructing form data, I compile arrays and use POST to send it. Here's the code snippet: let fd = new FormData(); for (section in this.data.choices) { let key = section+(this.data.choices[section] instanceof Array ? '[]' : '& ...

Can you explain how to access the -u parameter from a curl request in a Node.js request message?

My curl request looks like this: curl -u user:password http://localhost:3000/user Is there a way for me to retrieve the user and password from the http request on the server? ...

Using Angular JS to connect Promises while preserving data

There have been discussions about chaining promises, but this scenario presents a unique challenge. I am currently working on making multiple http get requests in my code. The initial call returns an array, and for each object in this array, another http c ...

Is it possible to load a JS file without using the require function?

Is there a method to load a JavaScript file without using require, but with fs for instance? I am aware that for JSON files I can utilize: const jsonFile = JSON.parse(fs.readFileSync("/jsonfile.json")) Can the same be done for a JavaScript file? I am inq ...