v0.6.0
版本发布时间: 2023-12-14 10:05:42
NaiboWang/EasySpider最新发布版本:v0.6.2(2024-04-22 06:37:17)
发布说明:感谢大家对易采集EasySpider的支持,前段时间一直在忙论文,文章刚投出去就开始马不停蹄的开发新功能,现在发布0.6.0的Windows 64位的Beta版本,欢迎大家去测试新功能新版本的全新功能,并及时在Github issues向我反馈使用中遇到的问题,如果一周之内经过大家测试问题不大,则一周之后会放出所有其它操作系统版本。
Release Note: Thank you for your support of EasySpider. I have been busy with my thesis recently and started to develop new features as soon as my paper was submitted. Now, I am releasing the 0.6.0 Beta version for Windows 64-bit. Everyone is welcome to test out the new features and provide feedback on any issues encountered on Github issues. If the testing goes well without significant problems for a week, I will release versions for all operating systems after one week.
如果下载速度慢,可以考虑中国境内下载地址:中国境内下载地址。
更新说明
-
在任务设计阶段,打开浏览器的情况下,单击相关操作后浏览器会自动标记映射对应元素方便用户调试,任何和浏览器相关的操作均可调试,包括JavaScript命令调试以及条件分支自动检测是否满足条件,自动标记元素等。
-
在任务设计阶段,打开浏览器的情况下,双击相关操作后将试运行该操作以进行动态调试,并将执行结果实时显示在浏览器中:
-
加速:对于循环提取数据的操作,如果没有额外的如执行JS,下载图片等操作,数据提取的速度将会得到极大的提升。
-
用eval功能动态修改XPath和代码片段:在任意的XPath,JavaScript代码片段中,均可以使用
eval("表达式值")
来直接表示python环境中的表达式,无需用自定义操作储存变量做中转,因此,示例:
-
用自定义操作的exec选项定义一个变量a:
self.a = 1
-
在提取数据的操作中的XPath中,使用下面的值来表示/html/body/div[1]:
/html/body/div[eval("self.a")]
-
再次使用自定义操作的exec选项改变a的值:
self.a = self.a +1
-
则此时提取数据的XPath将会变为/html/body/div[2]
适用于以下没有下一页按钮只能依次点击不同页码翻页的场景,详细教程将在近期放出,示例任务文件:290.json
- 所有的Exec和Eval选项可选择外挂代码文件,可在本地用IDE如VSCode写好Python代码之后,直接在任务输入框写入
outside:myCode.py
,程序将会读取并执行EasySpider目录下的myCode.py中的代码,此功能适合执行大量代码需要IDE辅助的场景。
注意EasySpider支持自定义Python函数,引入外部Python包以及使用try...except...进行异常处理等操作。
-
可以处理多层嵌套的iframe,体验和无iframe时相同,但需要注意的是XPath需设定为只有指定iframe页面中才能定位到的XPath,因此类似
//body
这种XPath将只会定位到第一层iframe中的body标签。 -
在设计完提取数据操作后,浏览器操作台将提示是否要进行进一步的翻页操作,此时可以指定翻页按钮位置,流程图中将自动生成好带翻页功能的提取数据操作:
-
浏览器操作台新增批量输入文字功能,将自动生成带文本列表的循环操作。
-
提取数据操作设置是否作为新的一行存储,如果为否,则不生成新行而是暂时将数据存储下来,等待其他提取数据操作生成新行的操作一起作为新的一行,适用于列表联动的场景:https://github.com/NaiboWang/EasySpider/issues/35 https://github.com/NaiboWang/EasySpider/issues/189
-
自定义操作增加暂停程序执行功能,用于在弹出验证码等页面时自动暂停等待用户手动验证码等操作。
-
自定义操作增加刷新页面操作。
-
自定义操作新增发送邮件功能。
-
点击元素操作可进行Alert弹窗处理,可选择接受或拒绝弹窗。
-
并行多开优化:对于带用户信息的浏览器的执行模式,执行时改为先复制用户目录后再执行的模式,以解决并行多开问题,现在可直接多次点击任务执行(带用户信息模式)按钮或同时运行多个命令行程序来并行执行带用户信息模式的任务,无需手动复制多个用户信息文件夹。任务执行完成后将会自动删除复制的用户信息文件夹(如果中途手动退出,则需要手动删除
TempUserDataFolder
文件夹下的用户信息临时目录)。 -
各种操作的名称将会根据场景自动匹配和修改,省去了修改操作名称的繁琐,如点击元素和移动到元素的默认名称更改为点击/移动到的元素的文本值,循环操作按照循环类型更名,以及切换自定义操作/循环/条件分支类型时自动更名等。
-
对于单个元素循环,如一直点击翻页按钮的循环,检测内容不变的条件值可以仅仅限制在检测某个元素的内容而不是整个页面的内容。
-
文件默认下载地址更改为任务文件夹内。
-
新增条件分支改为增加到最右侧。
-
可在流程图任意操作操作点击右键弹出右键菜单,即可试运行(调试运行),复制,剪切,删除元素以及调整条件分支的前后顺序。
-
操作提示框增加右下角关闭提示,适用于登录时二维码被遮挡的情况,可点击右下角×关闭操作台:
-
保存任务时,可自定义暂停/控制按键,实现不同的多开程序使用不同的按键来控制暂停/继续。
-
保存任务时,可设置任务运行时是否最大化浏览器窗口再运行任务。
-
保存任务时,可设置写入模式为数据覆盖模式,此时每次执行相同任务ID的任务,都会先删除源文件再重新采集(需要文件名设置为静态文件名)。
-
写入MySQL数据库时,当遇到重复数据时,忽略此条数据并继续运行,适用于不想要插入重复数据的场景(需要自行设定数据库表格主键为指定字段,否则按照EasySpider自己设计的表格,主键为自增ID,不会出现重复数据的情况)。
-
增加data:base64类型的图片下载功能,并可以处理需要登录才能下载的图片(不一定全部有效)。
-
更好的异常处理,防止采集过程中意外中断,中断会重试,如历史记录回退的bug修复。
-
ddddocr库升级。
-
软件UI更新。
-
Chrome浏览器版本升级为120。
Update Notes
- Click Interaction in Browser: During task design, clicking on an element in the browser will now automatically highlight and map it for easier debugging. This applies to all browser-related operations, including JavaScript command debugging and automatic element marking for conditional branches.
- Double-Click for Dynamic Debugging: In task design, double-clicking an operation will test run it for dynamic debugging, displaying the results in real-time in the browser.
- Speed Optimization: Data extraction operations, especially those without additional tasks like executing JavaScript or downloading images, will see significantly improved speed.
-
Dynamic XPath and Code Modification Using
eval
: Any XPath or JavaScript code snippet can now incorporate expressions directly from the Python environment usingeval("expression_value")
, eliminating the need for intermediate storage variables. For instance:- Define a variable
a
using the exec option in a custom operation:self.a = 1
- In an XPath data extraction operation, use the following to represent
/html/body/div[1]
:/html/body/div[eval("self.a")]
- Change the value of
a
using the exec option:self.a = self.a + 1
- The XPath for data extraction will now correspond to
/html/body/div[2]
.
- Define a variable
This is particularly useful for scenarios where there is no "next page" button, and pages must be turned by clicking different page numbers. A detailed tutorial and example task file (290.json
) will be released soon: 290.json
5. External Code File for Exec and Eval: Users can now write Python code in an IDE like VSCode and input outside:myCode.py
in the task input box. The program will execute the code from myCode.py
in the EasySpider directory. This is suitable for scenarios requiring extensive code that benefits from an IDE.
Note that EasySpider supports custom Python functions, importing external Python packages, and using try...except for exception handling.
7. Handling Multi-Layer Nested iframes: The experience is the same as with no iframes, but XPath should be set to locate elements only within the specified iframe. Thus, a generic XPath like //body
will only target the body tag of the first iframe layer.
8. Post-Data Extraction Paging Prompt: After designing a data extraction operation, the browser console will suggest whether to add paging. Specifying the paging button location automatically generates a data extraction operation with paging functionality:
9. Batch Text Input Feature: Automatically generates a loop operation with a text list.
10. Option to Store Extracted Data as a New Row: If set to 'no', the data isn't stored as a new row but temporarily held until another data extraction operation creates a new row. This is suitable for linked list scenarios: Issue #35, Issue #189
11. Pause Function in Custom Operations: Allows pausing the program, useful when a captcha or other interactive page appears.
12. Refresh Page Function in Custom Operations.
13. Send Email Feature in Custom Operations.
14. Alert Dialog Handling in Click Element Operations: Choose to accept or dismiss alerts.
15. Optimizations for Parallel Execution: For browser executions with user information, the user directory is now copied before execution to solve parallel execution issues. Multiple task executions or command line programs can be run in parallel. After task completion, the copied user information folder is automatically deleted (if manually exited, delete the TempUserDataFolder
directory manually).
16. Automatic Operation Naming: Operations are automatically named based on the scenario, eliminating the need to manually rename operations. Examples include default names for click and move operations based on the text value of the element, loop operations named according to loop type, and automatic renaming when switching custom operations/loops/conditional branches.
17. Single Element Loop Optimization: For loops like continuously clicking a pagination button, the unchanged content check can be limited to a single element instead of the entire page.
18. Default File Download Location: Now set to the task folder.
19. New Conditional Branches Added to the Right Side.
20. Right-Click Menu in Flowchart: Enables trial run (debug run), copy, cut, delete elements, and adjust the order of conditional branches.
21. 20. Add a close hint at the bottom right of the operation prompt box, which is useful for cases where the QR code is occluded during login. You can click the "×" at the bottom right to close the operation panel.
22. Custom Pause/Control Keys When Saving Tasks: Different programs can use different keys to pause/continue.
23. Maximize Browser Window Option When Saving Tasks.
24. Data Overwrite Mode When Writing Data: Each execution of the same task ID will delete the original file and recollect data (requires static file name setting).
25. MySQL Database Writing: When encountering duplicate data, ignore and continue running. Suitable for scenarios where inserting duplicate data is undesirable (requires setting the database table's primary key to specific fields; otherwise, as per EasySpider's design, the primary key is an auto-increment ID, preventing duplicates).
26. Base64 Image Download: Handles images that require login for download (not always effective).
27. Enhanced Exception Handling: Prevents accidental interruptions during collection; retries in case of interruption, bug fixes for history rollback.
28. ddddocr Library Upgrade.
29. UI Update.
31. Chrome Browser Upgrade to Version 120.
1、 EasySpider_0.6.0_windows_x64_Beta.7z 293.9MB