Currently, I am extracting key/value pairs from the body text of incoming emails.
Here is an example of an email body:
First Name: John
Last Name: Smith
Email : [email protected]
Comments = Just a test comment that
may span multiple lines.
I attempted to use a RegEx pattern ([\w\d\s]+)\s?[=|:]\s?(.+)
in multiline mode. While this works for most emails, it fails when there is a line break within the value. My knowledge of RegEx is limited, so I seek further guidance.
Another approach I have taken involves parsing each line individually to locate key/value pairs, merging lines into the previous value if no pair is found. This method is written in Scala.
val lines = text.split("\\r?\\n").toList
var lastLabelled: Int = -1
val linesBuffer = mutable.ListBuffer[(String, String)]()
// only parse lines until the first blank line
// null_? method is checks for empty strings and nulls
lines.takeWhile(!_.null_?).foreach(line => {
line.splitAt(delimiter) match {
case Nil if line.nonEmpty => {
val l = linesBuffer(lastLabelled)
linesBuffer(lastLabelled) = (l._1, l._2 + "\n" + line)
}
case pair :: Nil => {
lastLabelled = linesBuffer.length
linesBuffer += pair
}
case _ => // skip this line
}
})
I aim to utilize RegEx so that I can store the parser in the database and customize it for different senders at runtime (implementing various parsers for various senders).
Is it possible to modify my RegEx to recognize values containing newlines? Or should I abandon RegEx in favor of JavaScript? I already possess a JavaScript parser that allows me to save the JS in the DB and accomplish everything I want with the RegEx parser.