An algorithm designed to identify matching song lyrics based on a keyword or its fragments

Question

An algorithm designed to identify matching song lyrics based on a keyword or its fragments

I am currently dealing with a large text file consisting of over 852000 lines, each containing song verses preceded by different numbers like 1., 134-20., or 1231.. The verses may have four or more lines. Additionally, there are variations within the lines that I need to ignore for now.

This is the code I've been struggling with and haven't achieved satisfactory results so far:

$.ajax({url:"LD.txt",dataType:'text',success:function(data){
//var lines=data.match(/(.*)\r\n(^[A-Z].*)+/mg);
var lines=data.match(/(.*)(^[A-Z].*)+/mg);
for(var i=0;i<50/*lines.length*/;i++){
var line=lines[i].replace("\r\n","");console.log(i+" "+line);
}}});

Here is an excerpt from the UTF-8 text file:

/* 1970  #1.#  PAR DZIESMĀM UN DZIEDAŠANU
#1. Dziesmas un dziedašana vispāriga tautas manta un cilvēka mūža pavadoņi.
1.Dziesmas visai Latvijai kopeja manta. */

15.
Dziesmiņ' mana, kā dziedama,
Ne ta mana pamanita;
Vecā māte pamācija,
Aizkrāsnē tupedama.
#279a.

16.
Māci, māte, man' dziedāt,
Mā...

The javascript solution I'm aiming for should allow searching for specific words in the text input. For example, if one searches for the exact word dziedama, the output should display the preceding number (which could be several lines before) along with the verse part containing the searched word highlighted in bold.

15. Dziesmiņ' mana, kā <b>dziedama</b>, Ne ta mana pamanita; Vecā māte pamācija, Aizkrāsnē tupedama.

If the search query contains an asterisk like dzie*, the full word should be shown in bold within the results.

15. <b>Dziesmiņ'</b> mana, kā <b>dziedama</b>, Ne ta mana pamanita; Vecā māte pamācija, Aizkrāsnē tupedama.
16. Māci, māte, man' <b>dziedāt</b>, Māc' ar vienu Dieva <b>dziesmu</b>, Ko <b>dziedās</b> dvēselite, Pie Dieviņa aizgājuse.
...

The search functionality should also cover words with an asterisk at the beginning like *esmu, which can match variations such as dziesmu, iesmu, Dievadziesmu, etc., with variable characters hidden behind the asterisk.

If the query includes letters followed by a question mark like dzied?, the search should return verses containing similar words like dziedu, dziedi, etc., with one character represented by the question mark.

In case the search query is enclosed in double quotes like vienu Dieva, it should precisely match the sequence of words in the verses.

The search should support diacritics-rich text and also provide options for normalization without diacritics.

Thank you for your assistance!

javascript regex full-text-search wildcard regex-lookarounds

Answer 1

Answer №1

Alright, let's look at the regex needed to match an entire verse that starts with a number on its own line and contains the word xxxxx:

^[0-9]+\.$(?:.(?!^[0-9]+\.$))+\b(xxxxx)\b.*?(?=^[0-9]+\.$)

with flags gmsu

Breaking it down:

^[0-9]+. matches a line starting with a number
(?:.(?!^[0-9]+$))+ matches any characters not followed by another line starting with a number
\b(xxxxx)\b ensures xxxxx is matched as a whole word
.*?(?=^[0-9]+\.$) grabs the shortest string before the next line with a number

However, there are issues with using the \b boundary. It doesn't fully support Unicode characters.

According to this source, for Unicode equivalent matching, we should use [^\p{L}\p{N}\p{M}\p{Pc}] instead of \W and [\p{L}\p{N}\p{M}\p{Pc}] for \w.

Using these Unicode patterns in our look-arounds instead of \b, the updated regex would be:

^[0-9]+\.$(?:.(?!^[0-9]+\.$))+(?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])(xxxxx)(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]).*?(?=^[0-9]+\.$)

with flags gmsu

Addressing special characters like * and ?, we must preprocess the user input accordingly:

Take the user input
Escape all regex-special characters with a backslash (\)
Replace \? with [\p{L}\p{N}\p{M}\p{Pc}]
Replace \* with [\p{L}\p{N}\p{M}\p{Pc}]+

Substitute this adjusted input for xxxxx in the following modified regex:

^[0-9]+\.$(?:.(?!^[0-9]+\.$))+(?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])xxxxx(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]).*?(?=^[0-9]+\.$)

with flags gmsu

To illustrate, consider the word dziedās in the pattern: Regex Example

Answer 2

Alright, let's look at the regex needed to match an entire verse that starts with a number on its own line and contains the word xxxxx:

^[0-9]+\.$(?:.(?!^[0-9]+\.$))+\b(xxxxx)\b.*?(?=^[0-9]+\.$)

with flags gmsu

Breaking it down:

^[0-9]+. matches a line starting with a number
(?:.(?!^[0-9]+$))+ matches any characters not followed by another line starting with a number
\b(xxxxx)\b ensures xxxxx is matched as a whole word
.*?(?=^[0-9]+\.$) grabs the shortest string before the next line with a number

However, there are issues with using the \b boundary. It doesn't fully support Unicode characters.

According to this source, for Unicode equivalent matching, we should use [^\p{L}\p{N}\p{M}\p{Pc}] instead of \W and [\p{L}\p{N}\p{M}\p{Pc}] for \w.

Using these Unicode patterns in our look-arounds instead of \b, the updated regex would be:

^[0-9]+\.$(?:.(?!^[0-9]+\.$))+(?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])(xxxxx)(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]).*?(?=^[0-9]+\.$)

with flags gmsu

Addressing special characters like * and ?, we must preprocess the user input accordingly:

Take the user input
Escape all regex-special characters with a backslash (\)
Replace \? with [\p{L}\p{N}\p{M}\p{Pc}]
Replace \* with [\p{L}\p{N}\p{M}\p{Pc}]+

Substitute this adjusted input for xxxxx in the following modified regex:

^[0-9]+\.$(?:.(?!^[0-9]+\.$))+(?<=^|[^\p{L}\p{N}\p{M}\p{Pc}])xxxxx(?=$|[^\p{L}\p{N}\p{M}\p{Pc}]).*?(?=^[0-9]+\.$)

with flags gmsu

To illustrate, consider the word dziedās in the pattern: Regex Example

An algorithm designed to identify matching song lyrics based on a keyword or its fragments

Answer №1

Similar questions

Utilizing JQuery for asynchronous calls with Ajax

Reading a JSON file using Javascript (JQuery)

What could be causing Jquery's $.ajax to trigger all status codes even when the call is successful?

Click the button to increase the counter up to 2, and then decrease it back to 0 starting from 2

Is it possible to assign default values to optional properties in JavaScript?

Updating the CSS properties of a specific element within a dynamically generated JavaScript list

Circular arrangement using D3 Circle Pack Layout in a horizontal orientation

Experimenting with TypeScript code using namespaces through jest (ts-jest) testing framework

Does the functionality of Protractor rely on a specific version of AngularJS?

The digest string for the crypto.pbkdf2Sync function is malfunctioning

Revamp the website's design

Displaying a collapsible table directly centered within the table header

The Node.js Express server seems to be having trouble accessing static files

Cleaning a string of word characters in Javascript: a step-by-step guide

Utilizing PHP Variables in an External JavaScript: A Step-by-Step Guide

Is it not possible to generate HTML tags using jQuery and JavaScript in JSF?

Establish a connection between two pre-existing tables by utilizing the Sequelize framework

Is it Possible to Achieve Callbacks From AJAX to PHP Despite the Inability to Serialize Closures?

Unable to show input in Javascript HTML

Developed a hierarchical JSON format using a JavaScript array