Brunson: DOMDocument encoding issue when scraping a page

DOMDocument encoding issue when scraping a page

I'm scraping a google play link for some information to retrieve the app
name.
The problem is that some applications return unreadable characters.
$div2 = $div->getElementsByTagName("div");
if ($div2->length)
{
$gpAppName = DOMinnerHTML($div2->item(0));
$counter++;
if(checkIfMaxedOutAndReturn($counter)){
buildObjAndReturn($gpIcon,$gpBg,$gpAppName,$gpBtnLink);
}
}
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument('1.0','UTF-8');
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
When scrapping the page:
https://play.google.com/store/apps/details?id=com.vascogames.TransportTruck,
the code you see here will scrape the App name which is "Truck Driver –
Cargo delivery" but the code returns "Truck Driver â Cargo
delivery"

Brunson

Tuesday, 13 August 2013

DOMDocument encoding issue when scraping a page

No comments:

Post a Comment