ABC부트캠프: Day15 [웹 데이터 수집 및 크롤링]

July 26, 2023 2 분 소요

ABC 부트캠프 [웹 데이터 수집 및 크롤링]

오늘은 뮤직차트 크롤링 및 시각화를 하려고 한다

멜론 차트 top 100위를 크롤링을 해보자

멜론 차트 100위 크롤링

1) 필요 라이브러리 설치

import requests
from bs4 import BeautifulSoup
​

2) 멜론 차트 URL

url = "https://www.melon.com/chart/index.htm"

헤더 변경

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"}
res = requests.get(url, headers=headers)
res

파서(Parser) 사용

#html = res.text
html = res.text
print(repr(html[:100]))
soup = BeautifulSoup(html, "html.parser")
'<!DOCTYPE html>\r\n<html lang="ko">\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n<head>\r\n\t\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\r\n\t'
​

3) 속성명 지정 방법

wrap_tag_list = soup.select("tr[data-song-no]")   # 3. 속성명 지정 방법
len(wrap_tag_list)

4) 노래 리스트 뽑기 / 컬럼 이름 변경, 추가

song_list = []

for wrap_tag in wrap_tag_list:
    song_no = wrap_tag["data-song-no"]

    # wrap_tag.select_one("[href^=playSong]")  # href.startswith("playSong")의 의미
    # wrap_tag.select_one("[href$=playSong]")  # href.endswith("playSong")의 의미
    song_title = wrap_tag.select_one("[href*=playSong]").text  # "playSong" in href 의 의미
    artist_name = wrap_tag.select_one("[href*=goArtistDetail]").text
    album_name = wrap_tag.select_one("[href*=goAlbumDetail]")["title"]
    
    cover_image_url = wrap_tag.select_one("[onerror*=defaultAlbumImg]")["src"]

    song_list.append({
        "곡일련번호": song_no,
        "앨범명": album_name,
        "곡명": song_title,
        "가수명": artist_name,
        "커버이미지_주소": cover_image_url,
    })
    
    # print(song_no, album_name, song_title, artist_name, cover_image_url)

len(song_list)   # 100
​

5) 데이터 프레임 확인

import pandas as pd

df = pd.DataFrame(song_list).set_index("곡일련번호")
df["순위"] = range(1, df.shape[0]+1)
print(df.shape)
df.head()

6) 좋아요 수 탐색

url = "https://www.melon.com/commonlike/getSongLike.json"   # 좋아요 수 탐색
params = {
    "contsIds": "36599950,36617841",
}

res = requests.get(url, params=params, headers=headers)
res

likes_dict = {}

for song in res.json()['contsLike']:
    print(song["CONTSID"], song["SUMMCNT"])
    likes_dict[song["CONTSID"]] = song["SUMMCNT"]

likes_dict
​

dict comprehension

# dict comprehension
{
    song["CONTSID"]: song["SUMMCNT"]
    for song in res.json()['contsLike']
}
​

list comprehension

numbers = [1,2,3,4,5]

[number ** 2 for number in numbers]  # 리스트(list comprehension)
[number ** 2 
 for number in numbers 
 if number % 2 == 0]  # list comprehension -> [4, 16]
​

set comprehension

# { number ** 2 for number in numbers }   # 집합
{ number % 3 for number in numbers } # 집합(set comprehension) -> {0, 1, 2}
{ number: number % 3 for number in numbers }  # 사전(dict comprehension) 
{1: 1, 2: 2, 3: 0, 4: 1, 5: 2}
(number ** 2 for number in numbers)   # 튜플(tuple comprehension)은 없다!!

7) 좋아요 요청하기

conts_ids = ",".join(df.index)

url = "https://www.melon.com/commonlike/getSongLike.json"   # 좋아요 수 탐색
params = {
    "contsIds": conts_ids,
}

res = requests.get(url, params=params, headers=headers)
res

# dict comprehension
likes_dict = {
    str(song["CONTSID"]): int(song["SUMMCNT"])
    for song in res.json()['contsLike']
}
len(likes_dict)

df["좋아요수"] = likes_dict
print(df.shape)
df.head()

노래 시리즈 종류

song_count_series = df.groupby("가수명").size().sort_values(ascending=False)
song_count_series

시스템 기본 폰트 설정(한글)

import matplotlib.pyplot as plt
import platform

if platform.system() == "Darwin":
    plt.rc("font", family="AppleGothic")  # macOS 시스템 기본 폰트

elif platform.system() == "Windows":
    plt.rc("font", family="Malgun Gothic")  # Windows 시스템 기본 폰트
​

8) 노래 리스트 그래프로 시각화

song_count_series.plot(kind="bar", figsize=(20, 5))

9) 곡의 수가 1인 곡들은 삭제

mask = song_count_series > 1  # boolean mask
chart_series = song_count_series[mask]
chart_series

mask = song_count_series == 1
chart_series["Others"] = song_count_series[mask].sum()   # sum() / count() 로 합계 
chart_series

10) pie 그래프로 시각화

pie 그래프로 시각화

chart_series.plot(kind="pie")

Twitter Facebook LinkedIn

ABC부트캠프: Day15 [웹 데이터 수집 및 크롤링]

ABC 부트캠프 [웹 데이터 수집 및 크롤링]

멜론 차트 100위 크롤링

1) 필요 라이브러리 설치

2) 멜론 차트 URL

3) 속성명 지정 방법

4) 노래 리스트 뽑기 / 컬럼 이름 변경, 추가

5) 데이터 프레임 확인

6) 좋아요 수 탐색

dict comprehension

list comprehension

set comprehension

7) 좋아요 요청하기

노래 시리즈 종류

시스템 기본 폰트 설정(한글)

8) 노래 리스트 그래프로 시각화

9) 곡의 수가 1인 곡들은 삭제

10) pie 그래프로 시각화

pie 그래프로 시각화

공유하기

댓글남기기

참고

12.케창딥 - 오토인코더

11.케창딥 - 텍스트생성

10.케창딥 - 트랜스포머

11.케창딥 - 자연어 처리