I am facing an issue while trying to load a complex PDF file with tables and figures, spanning approximately 600 pages. When utilizing the fast option in Langchain-JS with NextJS Unstructured API, it partially works but misses out on some crucial data. On the other hand, selecting the hi_res option results in a timeout error. Despite adjusting the timeout settings to different values, the problem persists. I am willing to wait for the process to complete, so any assistance would be greatly appreciated.
ERROR:
error TypeError: fetch failed
at Object.fetch (node:internal/deps/undici/undici:11576:11)
at UnstructuredLoader._partition (e:/Web-Development/Developing/Nextjs/projects/gpt4-pdf/node_modules/langchain/dist/document_loaders/fs/unstructured.js:139:26)
at UnstructuredLoader.load (e:/Web-Development/Developing/Nextjs/projects/gpt4-pdf/node_modules/langchain/dist/document_loaders/fs/unstructured.js:154:26)
at UnstructuredDirectoryLoader.load (e:/Web-Development/Developing/Nextjs/projects/gpt4-pdf/node_modules/langchain/dist/document_loaders/fs/directory.js:80:40)
at run (e:\Web-Development\Developing\Nextjs\projects\gpt4-pdf\scripts\ingest.ts:48:21)
at <anonymous> (e:\Web-Development\Developing\Nextjs\projects\gpt4-pdf\scripts\ingest.ts:78:3) {
cause: HeadersTimeoutError: Headers Timeout Error
at Timeout.onParserTimeout [as callback] (node:internal/deps/undici/undici:9748:32)
at Timeout.onTimeout [as _onTimeout] (node:internal/deps/undici/undici:8047:17)
at listOnTimeout (node:internal/timers:573:17)
at process.processTimers (node:internal/timers:514:7) {
code: 'UND_ERR_HEADERS_TIMEOUT'
}
}
The code causing the error:
const options = {
apiKey: process.env.UNSTRUCTURED_API_KEY,
strategy: "hi_res",
timeout: 10000, //Tried various from 10000-10000000
};
const unstructuredLoader = new UnstructuredDirectoryLoader(
filePath,
options
);
const rawDocs = await unstructuredLoader.load();