Image credit: [**saberiii**](https://www.pixiv.net/member_illust.php?mode=medium&illust_id=39758588) Image credit: saberiii

Douban Neighbor Analysis

Mon   19 Aug 2019   12:21:58 3057 words

Netlify Status

Introduction


This is one of my python crawler practice projects. This project will crawl and retrieve relevant data based on a user's friendly neighbors to generate a corresponding portrait of the Douban.

Data Collection


Simulated user login

Take https://www.douban.com/people/zeqingg/rev_contacts as an example. This page will show all the friends of the target users. The zeqingg is the uid of the target user.

However, it should be noted that the access to the page must be registered to the Douban account, otherwise it will jump to the login page, resulting in the failure of the page.

Now log in to your own Douban account in the browser (as in Google Chrome) and go to this page. Use the network tool in Developer tools to get the Cookie and User-Agent sent to the server (see below) Figure). In the python program, use requests to add a Cookie when sending a request, simulate login, join User-Agent to simulate browser access, prevent being intercepted, and get the static file of the target page.

Tutorial-1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import requests
from bs4 import BeautifulSoup

from settings import user_agent
from settings import cookie
from settings import target_user

session = requests.Session()

url = 'https://www.douban.com/people/' + target_user + '/rev_contacts'
headers = {
    'User-Agent': user_agent,
    'Cookie': cookie,
}

response = session.get(url=url, headers=headers)
if response.status_code != 200:
    print('fail,please check cookie and uid')
soup = BeautifulSoup(response.text, 'lxml')
# print(soup)

print('success')

The resulting Cookie is not permanent and cannot be used all the time. If Cookie is invalid, you will need to log in again to get a new cookie.

Get the target user neighbor list

It can be found that each page can only display at most 70 friends, and the starting neighbors of the display are specified by the start value set in the parameter of the request. Therefore, you must first obtain the number of all neighbors of the target user.

Through the analysis of multiple web pages, it is found that the number of friends per user is displayed in a fixed place (as shown below), and the corresponding css selector is #db-usr-profile > div.info > h1. Use BeautifulSoup to parse the previously obtained web page, use the select function to extract the part and get the number of friends.

Tutorial-2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Get the total number of friends of the user from the crawled page


# css selector

num = soup.select('#db-usr-profile > div.info > h1')
num = BeautifulSoup(str(num[0]), 'lxml').string
# Numbers start from the penultimate character

length = len(num) - 2
while '0' <= num[length] <= '9':
    length -= 1
num = int(num[length + 1:len(num) - 1])
print(num)

Once you have a number of friends, you can iterate through all the pages. For each page, each friend has a link to their own home page. The css selector for the url of the home page is #content > div > div.article > dl > dt > a. Each url can be easily extracted from the form of https://www.douban.com/people/zeqingg/.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Crawl all the neighbor list pages, get the uid of all the neighbors, save to local


from settings import uid_file

with open(uid_file, 'w') as file:
    # Display up to 70 neighbors per page

    for i in range(0, num, 70):
        current_url = url + '?start=' + str(i)
        # print(current_url)

        response = session.get(url=current_url, headers=headers)
        soup = BeautifulSoup(response.text, 'lxml')
        peoples = soup.select('#content > div > div.article > dl > dt > a')
        for people in peoples:
            uid = (BeautifulSoup(str(people), 'lxml').a['href'])[30:-1]
            print(uid)
            file.write(uid + '\n')

Get neighbor data

Each user's home page has its corresponding data, but considering its complicated page structure, it is more complicated to process, and the data content is less, so it is not used.

The Douban mobile version of the book video file page has more detailed information, which can support more in-depth data analysis.

Take https://m.douban.com/people/zeqingg/subject_profile as an example. There is a lot of information available for analysis. However, after obtaining the webpage, I found that there was no data I saw.

After analysis, it is found that the webpage is a dynamic webpage, and all the data is requested after the webpage is opened. Using the network tool in Developer tools again, look at the items in the XHR category and get the request for the real url for https://m.douban.com/rexxar/api/v2/user/ Zeqingg/archives_summary?for_mobile=1&ck=ykqX (as shown below). Requesting this link does not require a mock login. At the same time, the returned data is in the json format, parsed using the loads() function of the json module in python.

Tutorial-3

What is needed is the account information. The json data corresponds to the user item, which has many useful data items, such as birthday birthday, gender gender, resident loc, registration time reg_time. , broadcast number statuses_count and so on. We all save it to a csv file.

What can't be ignored is that some users do not fill in or are unwilling to disclose. There is no corresponding data in json. To prevent the python program from reporting errors, use the try statement, and if it doesn't exist, write an empty string.

The acquisition of the viewing information is the same as the acquisition of the account information, using the mobile phone's web page https://m.douban.com/people/zeqingg/movie_charts For example. This page is also a dynamic page, using the same method as the previous step. After analyzing, the url of the real request is https://m.douban.com/rexxar/api/v2/user/zeqingg/collection_stats?type= Movie&for_mobile=1&ck=ykqX (as shown below).

Tutorial-4

Useful data in the returned json data such as the viewing number total_collections, the time to watch the movie total_spent, the consumption in the cinema total_cost, the average weekly viewing time weekly_avg, most often watched The area countries, the most commonly watched type genres and so on. Again, write them to the csv file.

It should be noted that the watercress has an anti-reptile mechanism and has a limit on the number of visits per unit time. If you make a lot of visits in a short time, you will be blocked. So, use the sleep) function of the time module in python to add a delay of 2 seconds after each request, avoiding the possibility of being intercepted.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
from json import loads


def get_movie_info(nuid=''):
    nurl = 'https://m.douban.com/rexxar/api/v2/user/' + nuid + '/collection_stats?type=movie&for_mobile=1&ck=5Kvd'
    nreferer = 'https://m.douban.com/people/' + nuid + '/movie_charts'
    nheaders = {
        'Referer': nreferer,
        'User-Agent': user_agent,
    }
    nresponse = session.get(url=nurl, headers=nheaders)
    # The returned data is in json format, parsed using loads

    ndecoded = loads(nresponse.text)
    # print(decoded)

    nrow = []
    # Number of views

    try:
        nrow.append(ndecoded['total_collections'])
    except:
        nrow.append('')
    # Watch time

    try:
        nrow.append(int(ndecoded['total_spent']))
    except:
        nrow.append('')
    # Consumption

    try:
        nrow.append(int(ndecoded['total_cost']))
    except:
        nrow.append('')
    # Average weekly viewing time

    try:
        nrow.append(round(ndecoded['weekly_avg'], 1))
    except:
        nrow.append('')
    # The following two are the most frequently viewed areas

    try:
        nrow.append(ndecoded['countries'][0]['name'])
    except:
        nrow.append('')
    try:
        nrow.append(ndecoded['countries'][1]['name'])
    except:
        nrow.append('')
    # The following three items are the most common viewing types

    try:
        nrow.append(ndecoded['genres'][0]['name'])
    except:
        nrow.append('')
    try:
        nrow.append(ndecoded['genres'][1]['name'])
    except:
        nrow.append('')
    try:
        nrow.append(ndecoded['genres'][2]['name'])
    except:
        nrow.append('')
    # print(row)

    return nrow


def get_user_info(nuid=''):
    nurl = 'https://m.douban.com/rexxar/api/v2/user/' + nuid + '/archives_summary?for_mobile=1&ck=5Kvd'
    nreferer = 'https://m.douban.com/people/' + nuid + '/subject_profile'
    nheaders = {
        'Referer': nreferer,
        'User-Agent': user_agent,
    }
    nresponse = session.get(url=nurl, headers=nheaders)
    # The returned data is in json format, parsed using loads

    ndecoded = loads(nresponse.text)
    # print(decoded)

    nrow = []
    # User's location

    try:
        nrow.append(ndecoded['user']['loc']['name'])
    except:
        nrow.append('')
    # User broadcast number

    try:
        nrow.append(ndecoded['user']['statuses_count'])
    except:
        nrow.append('')
    # User registration time

    try:
        nrow.append(ndecoded['user']['reg_time'][:4])
    except:
        nrow.append('')
    # User gender

    try:
        nrow.append(ndecoded['user']['gender'])
    except:
        nrow.append('')
    # print(row)

    return nrow


def get_info(nuid=''):
    nrow = []
    nrow += get_user_info(nuid)
    nrow += get_movie_info(nuid)
    print(nrow)
    return nrow


import csv
from time import sleep

from settings import csv_title
from settings import dataset_file

with open(uid_file) as infile:
    with open(dataset_file, 'w', encoding='utf-8', newline='') as outfile:
        csv_file = csv.writer(outfile, dialect='excel')
        csv_file.writerow(csv_title)
        for line in infile:
            uid = line[:-1]
            # print(uid)

            csv_file.writerow(get_info(uid))
            sleep(2)

Data Analysis


The data visualization for this section will use Plotly as an example, and the version of pyecharts will be used in the code I gave.

Draw a map of neighboring distribution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
from settings import loc_lat
from settings import loc_lon
from settings import mapbox_access_token

loc = []
num = []
lat = []
lon = []

with open(dataset_file, 'r', encoding='utf-8') as file:
    csv_file = csv.reader(file)
    for line in csv_file:
        # Blank line (user has logged out), no data, header line

        if len(line) == 0 or line[0] == '' or line == csv_title:
            continue
        # No latitude and longitude data

        if loc_lat.get(line[0]) is None:
            continue
        try:
            # If this region has been added to the loc array

            index = loc.index(line[0])
            num[index] += 1
        except ValueError:
            # Join a new area

            loc.append(line[0])
            num.append(1)
            lat.append(loc_lat[line[0]])
            lon.append(loc_lon[line[0]])

# print(loc)

# print(num)            


# Text displayed when hovering

text = []
for i in range(len(loc)):
    text.append(str(loc[i]) + '   ' + str(num[i]))

data = [
    go.Scattermapbox(
        lat=lat,
        lon=lon,
        mode='markers',
        marker=go.scattermapbox.Marker(
            # The size of the logo

            size=9
        ),
        text=text,
    )
]

layout = go.Layout(
    autosize=True,
    hovermode='closest',
    height=800,
    title='友邻地区分布',
    mapbox=go.layout.Mapbox(
        # Must have the correct access token to use

        accesstoken=mapbox_access_token,
        bearing=0,
        center=go.layout.mapbox.Center(
            lat=34,
            lon=108
        ),
        pitch=0,
        zoom=3.5,
    ),
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='neighbor_distribution_map')


picture-1

Draw a picture of your neighbors

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from math import ceil
from bisect import bisect_left

from settings import status_range

# Number of male

male_status_num = np.array(list(0 for _ in status_range))
# Number of female

female_status_num = np.array(list(0 for _ in status_range))

with open(dataset_file, 'r', encoding='utf-8') as file:
    csv_file = csv.reader(file)
    for line in csv_file:
        # Blank line (user has logged out), no data, header line

        if len(line) == 0 or line[1] == '' or line == csv_title:
            continue
        # The neighbor is a male

        if line[3] == 'M':
            # Query which interval the number of broadcasts of the friend is in

            index = bisect_left(status_range, int(line[1]))
            male_status_num[index - 1] += 1
        # The neighbor is a female

        elif line[3] == 'F':
            index = bisect_left(status_range, int(line[1]))
            female_status_num[index - 1] -= 1

# print(male_status_num)

# print(female_status_num)


# Maximum number of people

length = max(max(male_status_num), -max(female_status_num))
# print(length)


# The x-axis boundary is set to a multiple of 30

boundary = 30 * ceil(length / 30)
# print(boundary)


# The interval displayed by the y-axis

label = []
for index in range(1, len(status_range)):
    label.append('{} - {}'.format(str(status_range[index - 1]), str(status_range[index])))
label.append(str(status_range[-1]) + ' +')
# print(label)


layout = go.Layout(title='友邻广播',
                   yaxis=go.layout.YAxis(title='广播数量'),
                   xaxis=go.layout.XAxis(
                       range=[-boundary, boundary],
                       # Numerical value when drawing

                       tickvals=list(val for val in range(20 - boundary, boundary, 20)),
                       # Value when displayed (positive value)

                       ticktext=list(abs(text) for text in range(20 - boundary, boundary, 20)),
                       title='人数'),
                   barmode='overlay',
                   bargap=0.1)

data = [go.Bar(y=label,
               x=male_status_num,
               orientation='h',
               name='',
               hoverinfo='x',
               marker=dict(color='lightskyblue'),
               opacity=0.8
               ),
        go.Bar(y=label,
               x=female_status_num,
               orientation='h',
               name='',
               text=-1 * female_status_num.astype('int'),
               hoverinfo='text',
               marker=dict(color='gold'),
               opacity=0.8
               )]

py.iplot(dict(data=data, layout=layout), filename='status_pyramid_chart')

picture-2

Draw registration time chart

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from settings import reg_year_range

reg_year_num = np.array(list(0 for _ in reg_year_range))

with open(dataset_file, 'r', encoding='utf-8') as file:
    csv_file = csv.reader(file)
    for line in csv_file:
        # Blank line (user has logged out), no data, header line

        if len(line) == 0 or line[2] == '' or line == csv_title:
            continue
        reg_year_num[reg_year_range.index(int(line[2]))] += 1

# print(reg_year_num)


trace = go.Pie(
    labels=reg_year_range,
    values=reg_year_num,
    textinfo='label',
    marker=dict(line=dict(color='black', width=1))
)

py.iplot([trace], filename='reg_year_pie_chart')

picture-3

Drawing a viewing data graph

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import cufflinks as cf
import pandas as pd

cf.set_config_file(offline=False, world_readable=True)

df = pd.read_csv(dataset_file).dropna()

# x axis:Watch time

# y axis:Consumption

# size:Number of views

df.iplot(kind='bubble', x=csv_title[5], y=csv_title[6], size=csv_title[4], text=csv_title[4],
         xTitle='观看时间', yTitle='消费', colorscale='blues', filename='movie_bubble_chart')

picture-4

Draw a neighboring movie type distribution-map

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from settings import genre_range

genre_num = np.array(list(0 for _ in genre_range))

with open(dataset_file, 'r', encoding='utf-8') as file:
    csv_file = csv.reader(file)
    for line in csv_file:
        # Blank line (user has logged out), header line

        if len(line) == 0 or line == csv_title:
            continue
        # Read the three types most viewed by each neighbor

        if line[10] != '':
            genre_num[genre_range.index(line[10])] += 1
        if line[11] != '':
            genre_num[genre_range.index(line[11])] += 1
        if line[12] != '':
            genre_num[genre_range.index(line[12])] += 1

# print(genre_num)

            
num = []
label = []
# Filter out the six types most common to all your friends

for i in range(6):
    index = np.argmax(genre_num)
    label.append(genre_range[index])
    num.append(genre_num[index])
    genre_num[index] = 0

num.reverse()
label.reverse()

# print(num)

# print(label)


data = [go.Bar(
    x=num,
    y=label,
    text=num,
    textposition='auto',
    orientation='h',
    marker=dict(color='gold'),
    opacity=0.8
)]

py.iplot(data, filename='genre_horizontal_bar_chart')

picture-5

Draw a map of the neighborhood of the movie

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
from settings import country_range

total = 0
country_num = np.array(list(0 for _ in country_range))

with open(dataset_file, 'r', encoding='utf-8') as file:
    csv_file = csv.reader(file)
    for line in csv_file:
        # Blank line (user has logged out), header line

        if len(line) == 0 or line == csv_title:
            continue
        # Read the two most visited areas of each neighborhood

        if line[8] != '':
            country_num[country_range.index(line[8])] += 1
            total += 1
        if line[9] != '':
            country_num[country_range.index(line[9])] += 1
            total += 1

# print(country_num)


# Pie chart's x coordinate

domain_x = ([0, 0.24], [0.38, 0.62], [0.76, 1], [0, 0.24], [0.38, 0.62], [0.76, 1])
# Pie chart's y coordinate

domain_y = ([0.6, 1], [0.6, 1], [0.6, 1], [0, 0.4], [0, 0.4], [0, 0.4])
colors = ('lightskyblue', 'lightcoral', 'lightgreen', 'lightskyblue', 'lightcoral', 'lightgreen')
# Text's x coordinate

x = (0.09, 0.5, 0.91, 0.09, 0.5, 0.91)
# Text's y coordinate

y = (0.84, 0.84, 0.84, 0.16, 0.16, 0.16)

# Drawing data

data = []
# The text displayed in the center of the pie chart

annotations = []
# Screen out the six most visited areas of all your neighbors

for i in range(6):
    index = np.argmax(country_num)
    num = country_num[index]
    country_num[index] = 0

    data.append({
        'labels': [country_range[index], '其他'],
        'values': [num, total - num],
        'type': 'pie',
        'marker': {'colors': [colors[i], 'whitesmoke']},
        'domain': {'x': domain_x[i], 'y': domain_y[i]},
        'hoverinfo': 'label+percent',
        'hole': .75,
    })

    annotations.append({
        'font': {'size': 16},
        'showarrow': False,
        'text': country_range[index],
        'x': x[i],
        'y': y[i]
    })

fig = {
    'data': data,
    'layout': {
        'title': '友邻常看电影地区分布图',
        'grid': {'rows': 2, 'columns': 3},
        'annotations': annotations
    }
}

py.iplot(fig, filename='country_pie_chart')

picture-6

Summary


This blog directly picks up the self report, in order to make up the number of words, and more content is not written in my report (can be viewed in the warehouse).


Last modified on 2020-02-26 00:41:21

YXL
YXL Nakajiri

Original link:https://www.yxl76.net/en/post/douban-neighbor-analysis/

When reprinting, the original source must be indicated in the form of a link.