这一个多月,一直在做数据采集,写程序,然后创建好几个表一个个填字段,我吐了。

于是开发了一个懒人工具。


运行环境 Runtime environment

1
2
3
4
5
操作系统: Windos10  
navicat: 15.0.2
mysql: 5.7.31
IDE: Pycharm 2021.3.2 x64
语言: Python 3.9.1

背景

这一个多月,一直在做数据采集。

开发程序,然后为每个创建好几个表,

并且一个个填字段,excel|Pycharm|Navicat 三个窗口切着换,眼睛都看疼了。

于是开发了一个懒人工具,

通过python的字典,直接生成可供Navicat(别的能够执行sql语句的工具也可以,只是我没测试过)使用的建表sql语句,简称DDL。

应用场景

假如爬虫采集一条数据,并进行清洗生产了一个变量名为item的字典

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
itemDict = {
'chembl_id': 'cXXXl_XXXX_target',
'chembl_target_type': 'PROTEIN COMPLEX',
'chembl_target_name': 'Interleukin-23',
'chembl_target_alias': ['CLMF p40', 'Cytotoxic lymphocyte maturation factor 40 kDa subunit', 'IL12B', 'IL-12B'],
'chembl_target_organisim': 'Homo sapiens',
'chembl_target_classification': ['Secreted protein', 'Secreted protein'],
'chembl_drug_name': None,
'chembl_drug': None,
'guidetopharmacology': None,
'human_protein_atlas': {'EXXXXXXXXX2': 'http://www.XXXX.org/Gene/Summary?g=EXXXXXXXXX2',
'EXXXXXXXXX3': 'http://www.XXXX.org/Gene/Summary?g=EXXXXXXXXX3'},
'open_targets': {'EXXXXXXXXX2': 'http://www.XXXX.org/Gene/Summary?g=EXXXXXXXXX2',
'EXXXXXXXXX3': 'http://www.XXXX.org/Gene/Summary?g=EXXXXXXXXX3'},
'chembl_pharos_url': {'Tclin': 'https://XXXX.XXX.gov/idg/targets/QXXXX'},
'ebi_chembl_url': 'https://www.XXXX.uk/chembl/target_report_card/XXXXX/',
'raw_datas': {'_id': 'XXXXX', '_index': 'cXXXl_XXXX_target', '_primary_term': 1, '_seq_no': 179}

要把这条数据库存入mysql数据库,就要先创建相关的数据表。

1
2
3
CREATE TABLE `表名` (
......
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;

代码

直接上代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Author : RaXianch
# CreatDATE : 2021/4/21
# CreatTIME : 9:21
# Blog : https://blog.raxianch.moe/
# Github : https://github.com/DeSireFire
__author__ = 'RaXianch'

"""
运行结果:
CREATE TABLE `!!!!此处填写表名!!!!` (
`id` int(25) unsigned NOT NULL AUTO_INCREMENT COMMENT '自增id(仅作计数使用不可作为关联依据!)',
`chembl_id` varchar(648) NOT NULL COMMENT 'chembl_id',
`chembl_target_type` varchar(648) NOT NULL COMMENT 'chembl_target_type',
`chembl_target_name` varchar(648) NOT NULL COMMENT 'chembl_target_name',
`chembl_target_alias` json DEFAULT NULL COMMENT 'chembl_target_alias',
`chembl_target_organisim` varchar(648) NOT NULL COMMENT 'chembl_target_organisim',
`chembl_target_classification` json DEFAULT NULL COMMENT 'chembl_target_classification',
`chembl_drug_name` varchar(648) NOT NULL COMMENT 'chembl_drug_name',
`chembl_drug` varchar(648) NOT NULL COMMENT 'chembl_drug',
`guidetopharmacology` varchar(648) NOT NULL COMMENT 'guidetopharmacology',
`human_protein_atlas` json DEFAULT NULL COMMENT 'human_protein_atlas',
`open_targets` json DEFAULT NULL COMMENT 'open_targets',
`chembl_pharos_url` json DEFAULT NULL COMMENT 'chembl_pharos_url',
`ebi_chembl_url` varchar(648) NOT NULL COMMENT 'ebi_chembl_url',
`raw_datas` json DEFAULT NULL COMMENT 'raw_datas',
PRIMARY KEY (`id`) USING BTREE,
UNIQUE KEY `unique_id` (`id`) USING BTREE,
`update_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '采集时间'
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COMMENT='总之就是特别牛逼的一表数据';
"""


def dict2sqlDDL(tempDict: dict):
result = []
sql_orm = {
'dict': " `{conName}` json DEFAULT NULL COMMENT '{COMMENT}',",
'list': " `{conName}` json DEFAULT NULL COMMENT '{COMMENT}',",
'bool': " `{conName}` varchar(648) NOT NULL COMMENT '{COMMENT}',",
'NoneType': " `{conName}` json DEFAULT NULL COMMENT '{COMMENT}',",
# 'str': " `{conName}` varchar(648) NOT NULL COMMENT '{COMMENT}',",
'str': str_to_varchar_text_checker,
}
must_sql = {
0: "CREATE TABLE `!!!!此处填写表名!!!!` (",
1: " `id` int(25) unsigned NOT NULL AUTO_INCREMENT COMMENT '自增id(仅作计数使用不可作为关联依据!)',",
"autoEnd0": " `update_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '采集时间',",
"autoEnd1": " PRIMARY KEY (`id`) USING BTREE,",
"autoEnd2": " UNIQUE KEY `unique_id` (`id`) USING BTREE,",
"End": ") ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COMMENT='总之就是特别牛逼的一表数据';",
}
for k, v in tempDict.items():
typeStr = str(type(v))[8:-2]
if isinstance(v, str):
result.append(sql_orm[typeStr](k, v).format(conName=k, COMMENT=k))
else:
result.append(sql_orm[typeStr].format(conName=k, COMMENT=k))

for n, c in must_sql.items():
if isinstance(n, int):
result.insert(n, c)
else:
# 自动添加到行数末尾
if "autoEnd" in n:
result.append(c)
# 绝对末尾
if "End" == n and list(must_sql.values()).index(c) == len(list(must_sql.keys()))-1:
result.insert(len(result), c)

if "(" not in c and ")" in c:
# 去掉末尾逗号
if result[-2][-1] == ",":
result[-2] = result[-2][0:-1]

for line in result:
print(line)

def str_to_varchar_text_checker(tempkey:str, tempStr: str):
"""
根据传入的值长短,判断sql是使用varchar还是text类型
:return:
"""
resStr = ""
if not tempStr:
return " `{conName}` varchar(648) DEFAULT NULL COMMENT '{COMMENT}',"
lenInt = len(tempStr)

if lenInt < 648:
resStr = " `{conName}` varchar(648) DEFAULT NULL COMMENT '{COMMENT}',"

if lenInt >= 648:
resStr = " `{conName}` text COMMENT '{COMMENT}',"

if "url" in tempkey:
resStr = " `{conName}` text COMMENT '{COMMENT}',"

# 字段名称包含id
if "_id" in tempkey or "id" == tempkey:
if "NOT NULL" not in resStr:
if "DEFAULT NULL" in resStr:
resStr = resStr.replace("DEFAULT NULL", "NOT NULL")
else:
resStr = resStr.replace("NULL", "NOT NULL")

return resStr

if __name__ == '__main__':

itemDict = {
'chembl_id': 'cXXXl_XXXX_target',
'chembl_target_type': 'PROTEIN COMPLEX',
'chembl_target_name': 'Interleukin-23',
'chembl_target_alias': ['CLMF p40', 'Cytotoxic lymphocyte maturation factor 40 kDa subunit', 'IL12B', 'IL-12B'],
'chembl_target_organisim': 'Homo sapiens',
'chembl_target_classification': ['Secreted protein', 'Secreted protein'],
'chembl_drug_name': None,
'chembl_drug': None,
'guidetopharmacology': None,
'human_protein_atlas': {'EXXXXXXXXX2': 'http://www.XXXX.org/Gene/Summary?g=EXXXXXXXXX2',
'EXXXXXXXXX3': 'http://www.XXXX.org/Gene/Summary?g=EXXXXXXXXX3'},
'open_targets': {'EXXXXXXXXX2': 'http://www.XXXX.org/Gene/Summary?g=EXXXXXXXXX2',
'EXXXXXXXXX3': 'http://www.XXXX.org/Gene/Summary?g=EXXXXXXXXX3'},
'chembl_pharos_url': {'Tclin': 'https://XXXX.XXX.gov/idg/targets/QXXXX'},
'ebi_chembl_url': 'https://www.XXXX.uk/chembl/target_report_card/XXXXX/',
'raw_datas': {'_id': 'XXXXX', '_index': 'cXXXl_XXXX_target', '_primary_term': 1, '_seq_no': 179}

dict2sqlDDL(itemDict)

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

CREATE TABLE `!!!!此处填写表名!!!!` (
`id` int(25) unsigned NOT NULL AUTO_INCREMENT COMMENT '自增id(仅作计数使用不可作为关联依据!)',
`chembl_id` varchar(648) NOT NULL COMMENT 'chembl_id',
`chembl_target_type` varchar(648) NOT NULL COMMENT 'chembl_target_type',
`chembl_target_name` varchar(648) NOT NULL COMMENT 'chembl_target_name',
`chembl_target_alias` json DEFAULT NULL COMMENT 'chembl_target_alias',
`chembl_target_organisim` varchar(648) NOT NULL COMMENT 'chembl_target_organisim',
`chembl_target_classification` json DEFAULT NULL COMMENT 'chembl_target_classification',
`chembl_drug_name` varchar(648) NOT NULL COMMENT 'chembl_drug_name',
`chembl_drug` varchar(648) NOT NULL COMMENT 'chembl_drug',
`guidetopharmacology` varchar(648) NOT NULL COMMENT 'guidetopharmacology',
`human_protein_atlas` json DEFAULT NULL COMMENT 'human_protein_atlas',
`open_targets` json DEFAULT NULL COMMENT 'open_targets',
`chembl_pharos_url` json DEFAULT NULL COMMENT 'chembl_pharos_url',
`ebi_chembl_url` varchar(648) NOT NULL COMMENT 'ebi_chembl_url',
`raw_datas` json DEFAULT NULL COMMENT 'raw_datas',
PRIMARY KEY (`id`) USING BTREE,
UNIQUE KEY `unique_id` (`id`) USING BTREE,
`update_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '采集时间'
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COMMENT='总之就是特别牛逼的一表数据';

直接复制下来,放入navicat即可。

Navicat 运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
CREATE TABLE `test_demo` (
`id` int(25) unsigned NOT NULL AUTO_INCREMENT COMMENT '自增id(仅作计数使用不可作为关联依据!)',
`chembl_id` varchar(648) NOT NULL COMMENT 'chembl_id',
`chembl_target_type` varchar(648) DEFAULT NULL COMMENT 'chembl_target_type',
`chembl_target_name` varchar(648) DEFAULT NULL COMMENT 'chembl_target_name',
`chembl_target_alias` json DEFAULT NULL COMMENT 'chembl_target_alias',
`chembl_target_organisim` varchar(648) DEFAULT NULL COMMENT 'chembl_target_organisim',
`chembl_target_classification` json DEFAULT NULL COMMENT 'chembl_target_classification',
`chembl_drug_name` json DEFAULT NULL COMMENT 'chembl_drug_name',
`chembl_drug` json DEFAULT NULL COMMENT 'chembl_drug',
`guidetopharmacology` json DEFAULT NULL COMMENT 'guidetopharmacology',
`human_protein_atlas` json DEFAULT NULL COMMENT 'human_protein_atlas',
`open_targets` json DEFAULT NULL COMMENT 'open_targets',
`chembl_pharos_url` json DEFAULT NULL COMMENT 'chembl_pharos_url',
`ebi_chembl_url` text COMMENT 'ebi_chembl_url',
`raw_datas` json DEFAULT NULL COMMENT 'raw_datas',
`update_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '采集时间',
PRIMARY KEY (`id`) USING BTREE,
UNIQUE KEY `unique_id` (`id`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COMMENT='总之就是特别牛逼的一表数据'
> OK
> 时间: 0.073s

总结

代码总归是写得比较死的,不适合所有的场景,但是生成出来以后,删着改,总比从0开始去写要方便多了。