DataWeave For Web Scraping

In an earlier post, I described how you could use DataWeave by Mulesoft to replace a reserved keyword in Salesforce.

Today, I’ll illustrate how you can use a DataWeave script to scrape a web page for data that you can upload to your Salesforce org.

Sometimes, a service or site does not provide an adequate API to extract data from it. In certain circumstances, running a simple data scrape is a good solution, especially if you are retrieving a list of records.

First, go to your website and run your query. In this instance, let’s assume we have a list of news articles. Grab the contents of the web page and save this as a string. Something like this:

String bodyText = 
'  <main>'+
'    <div class="search-results main-search">'+
'      <article>'+
'        <header>'+
'          <a href=""><h2>A Headline</h2></a>'+
'          <p class="date">December 13, 2023</p>'+
'        </header>'+
'        <p>Details of Article</p>'+
'      </article>'+
'      <article>'+
'        <header>'+
'          <a href=""><h2>Another Headline</h2></a>'+
'           <p class="date">January 12, 2024</p>'+
'        </header>'+
'        <p>Details of Another Article</p>'+
'      </article>'+
'    </div>'+
'  </main>';

This means you can now test without performing a callout.

Now isolate the part of the page you want:

Integer start = bodyText.indexOf('<main>') ;
Integer finish = bodyText.indexOf('</main>') + 7;

String partial = bodyText.mid(start, (finish-start));

Now, you have a chunk of text that you can pass into your function.

From here, you can design your DataWeave script to extract the data. In this case, the script looks like this:

%dw 2.0
input incomingHtml application/xml
output application/json duplicateKeyAsArray=true, writeAttributes=true

fun replaceStr(val) = (val replace ('\t') with('')) replace '\n' with ('')
incomingHtml.main.div.*article map (record) -> {
   title: replaceStr(record.header.a.h2 as String), 
   href: record.header.a.@href,
   date: replaceStr(record.header.p),
   body: replaceStr(record.p),

This pulls each article out and maps the contents into a title, href, date, and body.

The replaceStr function removes unwanted characters.

The href attribute is interesting as it demonstrates that we can pull attributes from the html string – to do this, you need to use the writeAttributes flag, which preserves attributes as they are fed into the script.

From here, you can invoke the script like this:

DataWeave.Script script = new DataWeaveScriptResource.parseSearchHtml();
DataWeave.Result result = script.execute(new Map<String, Object>{ 'incomingHtml' => partial });
String output = result.getValueAsString();

This returns a JSON object of the form:


Now, you can render this any way you like!

The only thing you need to do now is change your static text to the text of the request body:

HttpResponse res = http.send(req);
String bodyText = res.getBody();

You are done! Don’t forget to add some error checking here, of course. From here, you can upload this data into your Salesforce org.

Further Customizing Your Salesforce Capabilities

