I am in the process of extracting metadata from various websites. While utilizing Cheerio to retrieve elements like
$('meta[property="article:published_time"]').attr('content')
works smoothly for most sites, there are some where this specific metadata property is not clearly defined but can still be found within the HTML.
For instance, if I attempt to extract data from this particular page, there is no explicit published_time
metadata property listed, yet the information is present within the file...
{"@context":"http://schema.org","@type":"NewsArticle","mainEntityOfPage":"https://news.yahoo.com/venezuela-deploys-soldiers-face-guyana-175722970.html","headline":"Venezuela Deploys Troops to East Caribbean Coast, Citing Guyana Threat","datePublished":"2023-12-28T19:53:10.000Z","dateModified":"2023-12-28T19:53:10.000Z","keywords":["Nicolas Maduro","Venezuela","Bloomberg","Guyana","Essequibo","Exxon Mobil Corp"],"description":"(Bloomberg) -- Venezuela has decided to deploy more than 5,000 soldiers on its eastern Caribbean coast after neighboring Guyana received a warship from the...","publisher":{"@type":"Organization","name":"Yahoo News","logo":{"@type":"ImageObject","url":"https://s.yimg.com/rz/p/yahoo_news_en-US_h_p_news_2.png","width":310,"height":50},"url":"https://news.yahoo.com/"},"author":{"@type":"Person","name":"Andreina Itriago Acosta","url":"","jobTitle":""},"creator":{"@type":"Person","name":"Andreina Itriago Acosta","url":"","jobTitle":""},"provider":{"@type":"Organization","name":"Bloomberg","url":"https://www.bloomberg.com/","logo":{"@type":"ImageObject","width":339,"height":100,"url":"https://s.yimg.com/cv/apiv2/hlogos/bloomberg_Light.png"}},"image":{"@type":"ImageObject","url":"https://s.yimg.com/ny/api/res/1.2/hs3Vjof2BqloeagLdsvfDw--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD0xMjAy/https://media.zenfs.com/en/bloomberg_politics_602/2db14d66c52bec70cb0ec6d0553968c6","width":1200,"height":1202},"thumbnailUrl":"https://s.yimg.com/ny/api/res/1.2/hs3Vjof2BqloeagLdsvfDw--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD0xMjAy/https://media.zenfs.com/en/bloomberg_politics_602/2db14d66c52bec70cb0ec6d0553968c6"}
Within this object, the "datePublished"
field is available. How can I access this property using Cheerio?