๋ฐ์—”์œผ๋กœ ์„ฑ์žฅ์ค‘ ๐ŸŒฑ

Python/[๊ธฐ์ดˆ ๊ฐ•์˜ ์ •๋ฆฌ]

python ๊ธฐ์ดˆ 7

์จ๋ฐ 2023. 2. 26. 18:17

โœ๐Ÿป ๋ฐฐ์šด์ 

 

selenium ๋ฟ ๋งŒ ์•„๋‹ˆ๋ผ api ๋„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๋ฐฐ์šธ ์ˆ˜ ์žˆ์—ˆ๊ณ , DataFrame ์„ ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๋” ์•Œ๊ฒŒ ๋˜์—ˆ๋‹ค.

 

 

 

 


 

 

 

๋“ค์–ด๊ฐ€๋ฉฐ

 

์ด๋ฒˆ์—๋Š” ์„ธ๋ธ์ผ๋ ˆ๋ธ ํŽธ์˜์  ์ ํฌ ๊ธฐ์ค€์œผ๋กœ ๊ธ€์„ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฐธ๊ณ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค :)

 

 

 

 

 

webdriver Options

 

 

webdriver ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Chrome ์ฐฝ์„ ๋„์šธ ๋•Œ, ์šฐ๋ฆฌ ๋ˆˆ์— ๋ณด์ด์ง€ ์•Š๊ณ  ๋ฐฑ๊ทธ๋ผ์šด๋“œ ํ™˜๊ฒฝ์—์„œ ๋Œ๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

import selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('headless')

 

ChromeOption ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํฌ๋กฌ ํŽ˜์ด์ง€๊ฐ€ ๋ณด์ด์ง€ ์•Š์•„๋„ ์ž˜ ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

์ „๊ตญ ์„ธ๋ธ์ผ๋ ˆ๋ธ ์ง€์  ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ by selenium

 

์ „๊ตญ์— ์žˆ๋Š” ์„ธ๋ธ์ผ๋ ˆ๋ธ ํŽธ์˜์  ์ง€์  ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ค๋„๋ก ํ•˜์ž.

 

 

๋ฉ”์ธ ํŽ˜์ด์ง€์—์„œ ์ ํฌ์ฐพ๊ธฐ ๋ฒ„ํŠผ์„ ํด๋ฆญํ•˜๋ฉด get ๋ฐฉ์‹์„ ํ†ตํ•ด ์›น ํŽ˜์ด์ง€์— ์ ‘๊ทผํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

๋ฉ”์ธ ํŽ˜์ด์ง€์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ๊ณณ๊นŒ์ง€ ๊ฐ€๊ธฐ์œ„ํ•ด '์ ํฌ์ฐพ๊ธฐ' ๋ผ๋Š” ๋ฒ„ํŠผ์„ ํด๋ฆญํ•ด์•ผ ํ•˜๋ฏ€๋กœ selenium ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž๋™ํ™” ํด๋ฆญ์„ ํ•˜๋„๋ก ํ•œ๋‹ค.

 

 

from selenium import webdriver
from selenium.webdriver.common.by import By
import time 
from bs4 import BeautifulSoup as BS
import requests
import re
import pandas as pd


driver = webdriver.Chrome()

url = "https://www.7-eleven.co.kr/"
driver.get(url) # get ๋ฐฉ์‹

# ์ ํฌ ์ฐพ๊ธฐ ๋ฒ„ํŠผ ์œ„์น˜
target_store = "#header > div > div > div.head_util > a.util_store.store_open"
# ์ ํฌ ์ฐพ๊ธฐ ๋ฒ„ํŠผ ์ž๋™ ํด๋ฆญ
driver.find_element(By.CSS_SELECTOR, target_store).click()
time.sleep(3) # ํŽ˜์ด์ง€ ๋กœ๋”ฉ ์‹œ๊ฐ„ ๊ธฐ๋‹ค๋ ค์ฃผ๊ธฐ

 

์œ„์—์„œ ๋งํ•œ ์ž‘์—…์„ ๋‹คํ–ˆ๊ณ  ์ด์ œ ์ง€์—ญ๋ณ„ ๋„์‹œ ์ด๋ฆ„๋“ค์„ ๋‹ค ๋ฐ›์•„์™€์•ผ ํ•œ๋‹ค.

 

 

 

๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด ๋˜ selenium ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ •๋ณด๋ฅผ ์–ป์–ด์™€์•ผ ํ•œ๋‹ค.

 

 

city_dict = {} # dictionary ๋กœ ์ €์žฅ

for x in range(2, 19): # ์‹œ, ๋„ ๊ฐœ์ˆ˜๋งŒํผ
    driver.find_element(By.CSS_SELECTOR, f"#storeLaySido > option:nth-child({x})").click()
    time.sleep(2)
    
    # ์‹œ, ๋„ ์ด๋ฆ„ ๊ฐ€์ ธ์˜ค๊ธฐ
    city = driver.find_element(By.CSS_SELECTOR, f"#storeLaySido > option:nth-child({x})").text
    
    # key: ์‹œ/๋„ , valaue: ๊ตฌ/๊ตฐ
    city_dict[city] = [x.text for x in BS(drvier.page_source).find("select", id="storeLayGu")][3::2]
    

# ์ž˜ ์ €์žฅ๋˜์–ด ์žˆ๋Š” ์ง€ ํ™•์ธํ•˜๊ธฐ
for key, value in city_dict.items():
	print(key, value)

 

 

css ๋ฌธ๋ฒ•์œผ๋กœ ์ ‘๊ทผํ•˜๋ฉด ํŽธ๋ฆฌํ•˜๊ธฐ์—... ์ž˜ ์•Œ๋„๋ก ํ•˜์ž...

 

 

 

 

์ „๊ตญ ์„ธ๋ธ์ผ๋ ˆ๋ธ ์ง€์  ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ by API

 

์ด๋ฒˆ์—๋Š” ์„ธ๋ธ์ผ๋ ˆ๋ธ์—์„œ ์ œ๊ณตํ•˜๋Š” API ๋ฅผ ์ด์šฉํ•ด๋ณด์ž.

 

 

from selenium import webdriver
from selenium.webdriver.common.by import By
import time 
from bs4 import BeautifulSoup as BS
import requests
import re
import pandas as pd


seven_url = "https://www.7-eleven.co.kr/util/storeLayerPop.asp"
payload = payload = {"storeLaySido": "์„œ์šธ",
            "storeLayGu": "๊ตฌ๋กœ๊ตฌ",
            "hiddentext": "none"}
r = requests.post(seven_url, data=payload)

 

 

def api_seven(page):
    seven = BS(page)
    seven_total = []
    
    for temp in seven.find("div", class_="list_stroe").findAll("li"):
    	seven_dict = {}
        
        # ์ง€์ ๋ณ„ ์„œ๋น„์Šค ํ•ญ๋ชฉ ์ฐพ์•„์„œ ์ €์žฅ
        seven_dict['offeringService'] = [ x['alt'] for x in temp.findAll("img")]
        
        # ์ง€์ ๋ช… ์žˆ๋Š” ๊ณณ ์ฐพ์•„ ์ €์žฅ(๊ณต๋ฐฑ ์ œ๊ฑฐ์šฉ strip ์‚ฌ์šฉ)
        seven_dict['shopName'] = temp.find("span").text.strip()
        
        try:
        	# ์ง€์  ์ฃผ์†Œ ์ฐพ์•„ ์ €์žฅ
        	seven_dict['address'] = " ".join(temp.findAll("span")[-2].text.split())
        except:
        	return []
        
        # ์ค‘๋ณต ์—†์ด ์ž˜ ์žˆ๋Š” ๊ฒฝ์šฐ
        if len(seven_dict['address']) < 2:
        	seven_dict['address'] = " ".join(temp.findAll("span")[-3].text.split())
        
        # ์œ„๋„ ๊ฒฝ๋„ ์ฐพ์•„ ์ €์žฅ
        _, lat, lon = re.findall("(?<=\().+(?=\))", temp.find('a')['href'])[0].split(",")
        seven_dict['longs'] = lon
        seven_dict['lat'] = lat
        
        # seven_total ์— ๋„ฃ๊ธฐ
        seven_total.append(seven_dict)
        
    return pd.DataFrame(seven_total)

 

seven_total ์— ์ง€์ ๋ณ„ ์„œ๋น„์Šค, ์ฃผ์†Œ, ์œ„๋„, ๊ฒฝ๋„๋ฅผ ์ €์žฅํ•˜์—ฌ df ๋กœ ๋งŒ๋“œ๋Š” ํ•จ์ˆ˜์ด๋‹ค.

 

 

total = []

for key, value in city_dict.items(): # ์‹œ/๋„(key), ๊ตฌ/๊ตฐ(value)
	payload['storeLaySido'] = key
    
    for x in value:
    	payload['storeLayGu'] = x
        
        # total ์— ์œ„์—์„œ ๋งŒ๋“  ์ •๋ณด ๋‹ค ์ €์žฅํ•˜๊ธฐ
        total.append(api_seven( requests.post(seven_url, data=payload).text ))

 

์ด์ œ api ๋กœ ์ „๊ตญ ์„ธ๋ธ ํŽธ์˜์  ์ •๋ณด์— ๋Œ€ํ•ด total dictionary ์— ์ €์žฅํ–ˆ๋‹ค.

 

 

 

 

pandas concat

 

pandas ์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ฉ”์„œ๋“œ์ธ concat() ์„ ์‚ฌ์šฉํ•˜๋ฉด df ๋ผ๋ฆฌ ํ•ฉ์ณ์„œ (์›๋ณธ ๋ฐ์ดํ„ฐ ๋ณ€๊ฒฝ์—†์ด) ์ƒˆ๋กœ์šด df ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

 

 

total = pd.concat([x for x in total if type(x) == type(pd.DataFrame())])
total

 

๋งŒ์•ฝ total ์—์„œ df ๋กœ ์ž˜ ๋งŒ๋“ค์–ด์กŒ๋‹ค๋ฉด, df ๋กœ ์ž˜ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

DataFrame ์—์„œ index ์ œ๊ฑฐํ•˜๊ธฐ

 

ํŒŒ์ด์ฌ์—์„œ DataFrame ์˜ ๊ธฐ๋ณธ index ๋ฅผ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ๋‹ค.

 

total.reset_index(drop=True, inplace=True)

 

pandas ์˜ ๋ฉ”์„œ๋“œ์ธ reset_index ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด, index column ์„ ์ œ๊ฑฐํ•  ๊ฒƒ์ธ์ง€, ์›๋ณธ df ๋ฅผ ๋ฐ”๊ฟ€ ๊ฒƒ์ธ์ง€ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

 

DataFrame ์—์„œ ์ค‘๋ณต ์ œ๊ฑฐํ•˜๊ธฐ

 

ํŒŒ์ด์ฌ์—์„œ DataFrame ์˜ ํŠน์ • column ์—์„œ ๊ฐ’์ด ์ค‘๋ณต๋œ ๊ฒฝ์šฐ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ๋‹ค.

 

total.drop_duplicates('shopName')

 

pandas ์˜ ๋ฉ”์„œ๋“œ์ธ  drop_duplicates() ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด, ํŠน์ • column ๊ฐ’์ด ์ค‘๋ณต๋œ ๊ฒƒ๋“ค์„ df ์—์„œ ์ œ๊ฑฐํ•ด์ค€๋‹ค.

 

 

 

 

 

DataFrame ์—์„œ ๊ณ ์œ ํ•œ ๊ฐ’ ์ฐพ๊ธฐ

 

ํŒŒ์ด์ฌ์—์„œ DataFrame ์˜ ํŠน์ • column ์—์„œ ๊ณ ์œ ํ•œ ๊ฐ’๋งŒ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.

 

total['shopName'].unique() # .size() ๋งŒ ๋ถ™์ด๋ฉด ๊ณ ์œ ๊ฐ’ ๊ฐœ์ˆ˜ ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ

 

์ค‘๋ณต๋œ ๊ฐ’ ์—†์ด, ๊ณ ์œ ํ•œ ๊ฐ’์ด ๋ฌด์—‡์ธ์ง€ ์ฐพ๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.

 

 

 

 

DataFrame ์„ excel ๋กœ ์ €์žฅํ•˜๊ธฐ

 

ํŒŒ์ด์ฌ์—์„œ DataFrame ์„ excel ํŒŒ์ผ๋กœ ์ €์žฅํ•  ์ˆ˜ ์žˆ๋‹ค.

 

total.to_excel("./์„ธ๋ธ์ผ๋ ˆ๋ธ_์ „๊ตญ_ํ˜„ํ™ฉ.xlsx")

 

pandas ์˜ ๋ฉ”์„œ๋“œ์ธ to_excel() ์„ ์‚ฌ์šฉํ•˜๋ฉด ์—‘์…€ ํŒŒ์ผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

 

'Python > [๊ธฐ์ดˆ ๊ฐ•์˜ ์ •๋ฆฌ]' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

python ๊ธฐ์ดˆ 9  (0) 2023.03.01
python ๊ธฐ์ดˆ 8  (0) 2023.02.26
python ๊ธฐ์ดˆ 6  (0) 2023.02.26
python ๊ธฐ์ดˆ 5  (0) 2023.02.26
python ๊ธฐ์ดˆ 4  (1) 2023.02.25