lundi 2 avril 2018

Help needed: Program or configure a NEW web crawler

Hi guys,

this is an update thread to one I opened a bit over 3 years ago. the ae911truth.org website has a new design, and that has changed the way they display a list of petition signers, which I wish to read out and download into a table format from time to time.

So I need someone familiar with some easy web-crawling and XML-reading techniques who can program or customize a crawler for me to read out data records from a website.

Here is the new site:
http://www.ae911truth.org/signatures/#/AE/

This page has almost 3000 links to personal profiles; each link has a local href like this:
http://www.ae911truth.org/signatures...eLafayetteCAUS

Previously, those links refered to .txt files which contained XML, and the job was to parse the XML. Now, the content seems to hidden behind cloudflare somehow.

These petition signature records are based on a database with field names contained in the XML files. Now, I struggle to see the markup text that my browser (Chrome on Win7) parses.

What I want to get, with the help of a script that I can run on my amateur Win7 notebook with free amateur tools, is a table in a form like CSV/spreadsheet similar to this:
Code:

url|first_name|middle_name|last_name|title|degree|city|state|country|occupation_status|tech_biography|statement_911|photo|license_info
xml/supporters/U/KenGorskiEl-PasoTXUS.xml.txt|Ken||Gorski||B Architecture Professional Degree, University of Kansas, 1972|El Paso|TX|US|Degreed + Licensed|I'm a licensed architect and AIA member.|I am supportive of the intent for a complete investigation of the 9/11. Questionable structural and architectural explanations have heretofore been provided to the public.|6477 TX

(Same info must go into same column every time; I believe that all .txt files contain tags for every data item, so it would suffice to output just the data without headers, provided you return an empty field / "|" sign as field delimiter when tag contains no CDATA)

And the same then for
http://www.ae911truth.org/signatures/#/General/A/
http://www.ae911truth.org/signatures/#/General/B/
etc.


Am I making sense?

HELP!

Thx ;)


via International Skeptics Forum https://ift.tt/2GtpzDE

Aucun commentaire:

Enregistrer un commentaire