Branje izbrane vsebine spletne strani z uporabo Python Web Scraping

Branje izbrane vsebine spletne strani z uporabo Python Web Scraping
Predpogoj: Prenos datotek v Pythonu Spletno strganje z BeautifulSoup Vsi vemo, da je Python zelo enostaven programski jezik, toda kul je veliko število odprtokodnih knjižnic, napisanih zanj. Zahteve so ena najpogosteje uporabljenih knjižnic. Omogoča nam, da odpremo katero koli spletno stran HTTP/HTTPS in nam omogoči, da počnemo vse, kar običajno počnemo v spletu, lahko pa tudi shrani seje, npr. piškotek. Kot vsi vemo, je spletna stran le delček kode HTML, ki jo spletni strežnik pošlje našemu brskalniku, ki se pretvori v čudovito stran. Zdaj potrebujemo mehanizem za pridobitev izvorne kode HTML, tj. iskanje določenih oznak s paketom, imenovanim BeautifulSoup. Namestitev:
pip3 install requests  
pip3 install beautifulsoup4  

Vzemimo primer z branjem spletnega mesta z novicami Hindustan Times

Kodo lahko razdelimo na tri dele.
  • Zahtevanje spletne strani
  • Pregledovanje oznak
  • Natisnite ustrezno vsebino
Koraki:
    Zahtevanje spletne strani: Najprej vidimo besedilo novice z desnim klikom, da vidimo izvorno kodo Branje izbrane vsebine spletne strani z uporabo Python Web Scraping Pregledovanje oznak: Ugotoviti moramo, v katerem telesu izvorne kode je razdelek z novicami, ki ga želimo zavreči. To je pod uli.e neurejen seznam 'searchNews', ki vsebuje razdelek z novicami. Branje izbrane vsebine spletne strani z uporabo Python Web Scraping Opomba Besedilo novice je prisotno v besedilnem delu oznake sidra. Natančno opazovanje nam da idejo, da so vse novice v oznakah seznama li neurejene oznake. Branje izbrane vsebine spletne strani z uporabo Python Web Scraping Natisnite ustrezno vsebino: The content is printed with the help of code given below. Python
       import   requests   from   bs4   import   BeautifulSoup   def   news  ():   # the target we want to open    url  =  'http://www.hindustantimes.com/top-news'   #open with GET method   resp  =  requests  .  get  (  url  )   #http_respone 200 means OK status   if   resp  .  status_code  ==  200  :   print  (  'Successfully opened the web page'  )   print  (  'The news are as follow :-  n  '  )   # we need a parserPython built-in HTML parser is enough .   soup  =  BeautifulSoup  (  resp  .  text    'html.parser'  )   # l is the list which contains all the text i.e news    l  =  soup  .  find  (  'ul'  {  'class'  :  'searchNews'  })   #now we want to print only the text part of the anchor.   #find all the elements of a i.e anchor   for   i   in   l  .  findAll  (  'a'  ):   print  (  i  .  text  )   else  :   print  (  'Error'  )   news  ()   

    Izhod

    Successfully opened the web page The news are as follow :- Govt extends toll tax suspension use of old notes for utility bills extended till Nov 14 Modi Abe seal historic civil nuclear pact: What it means for India Rahul queues up at bank says it is to show solidarity with common man IS kills over 60 in Mosul victims dressed in orange and marked 'traitors' Rock On 2 review: Farhan Akhtar Arjun Rampal's band hasn't lost its magic Rumours of shortage in salt supply spark panic among consumers in UP Worrying truth: India ranks first in pneumonia diarrhoea deaths among kids To hell with romance here's why being single is the coolest way to be India vs England: Cheteshwar Pujara Murali Vijay make merry with tons in Rajkot Akshay-Bhumi SRK-Alia Ajay-Parineeti: Age difference doesn't matter anymore Currency ban: Only one-third have bank access; NE backward regions worst hit Nepal's central bank halts transactions with Rs 500 Rs 1000 Indian notes Political upheaval in Punjab after SC tells it to share Sutlej water Let's not kid ourselves with Trump what we have seen is what we will get Want to colour your hair? Try rose gold the hottest hair trend this winter  

Reference

Ustvari kviz