Codeql 踩坑记录

发表于 2020 年 3 月 30 日

安装

入口: https://help.semmle.com/codeql/codeql-cli/procedures/get-started.html
下载地址: https://github.com/github/codeql-cli-binaries/releases
license: https://securitylab.github.com/tools/codeql/license

license 里面有写:

Further, except (and only to the extent) permitted by applicable law or applicable third-party license, you will not (and have no right to):

work around any technical limitations in the Software that only allow you to use it in certain ways; reverse engineer, decompile or disassemble the Software; remove, minimize, block, or modify any notices of GitHub or its suppliers in the Software; use the Software in any way that is against the law; or share, publish, distribute or lend the Software, provide or make available the Software as a hosted solution (whether on a standalone basis or combined, incorporated or integrated with other software or services) for others to use, or transfer the Software or these Terms to any third party.

GitHub CodeQL can only be used on codebases that are released under an OSI-approved open source license, or to perform academic research. It can’t be used to generate CodeQL databases for or during automated analysis, continuous integration or continuous delivery, whether as part of normal software engineering processes or otherwise. For these uses, contact the sales team.

大概意思就是: 主程序是闭源的, 但是检测规则是开源的, 其中一些 extractor 也开源了, 当然其实 java 写的未混淆, 跟开源差不多.
用于学术研究或者检测开源项目都随便使用, 但是拿来商业等操作, 就需要购买商业 license.

主程序就在 tools/codeql.jar, 也没做混淆啥的. 还是挺良心的.
安装包里面自带了运行环境和支持语言的 extractor (甚至自带 java 运行环境), 开箱即用.

测试

先写个测试用例看看

 1# $ cat flask-example/app.py
 2import flask
 3import subprocess
 4
 5app = flask.Flask(__name__)
 6
 7@app.route('/')
 8def index():
 9    return subprocess.check_output(flask.request.args.get('c', 'ls'))
10
11app.run()

建立 database

1/home/rmb122/repos/codeql/codeql database create codeql-database --language=python --source-root flask-example

然后就喜闻乐见的出错了 233

 1[2020-03-30 17:45:00] [build] [INFO] [8] Extracted module _lzma in 22ms
 2[2020-03-30 17:45:01] [build] [INFO] [3] Extracted file /usr/lib/python3.8/site-packages/cryptography/hazmat/backends/openssl/dsa.py in 765ms
 3[2020-03-30 17:45:01] [build] [INFO] [5] Extracted file /usr/lib/python3.8/site-packages/cryptography/hazmat/primitives/serialization/ssh.py in 404ms
 4[2020-03-30 17:45:01] [build] [INFO] [1] Extracted file /usr/lib/python3.8/site-packages/cryptography/hazmat/backends/openssl/encode_asn1.py in 1572ms
 5[2020-03-30 17:45:01] [build] [ERROR] [1] Failed to extract module _decimal: libmpdec.so.2: cannot open shared object file: No such file or directory
 6[2020-03-30 17:45:01] [build] [TRACEBACK] [1] "semmle/worker.py", line 220, in _extract_loop
 7[2020-03-30 17:45:01] [build] [TRACEBACK] [1] "semmle/extractors/super_extractor.py", line 47, in process
 8[2020-03-30 17:45:01] [build] [TRACEBACK] [1] "semmle/extractors/builtin_extractor.py", line 14, in process
 9[2020-03-30 17:45:01] [build] [INFO] [5] Extracted module _io in 50ms
10[2020-03-30 17:45:01] [build] [INFO] [8] Extracted file /usr/lib/python3.8/site-packages/cryptography/hazmat/primitives/asymmetric/dh.py in 373ms
11[2020-03-30 17:45:01] [build] [INFO] [1] Extracted file /usr/lib/python3.8/site-packages/werkzeug/wrappers/cors.py in 85ms
12[2020-03-30 17:45:01] [build] [INFO] [7] Extracted file /usr/lib/python3.8/argparse.py in 5850ms
13[2020-03-30 17:45:01] [build] [INFO] [2] Extracted file /usr/lib/python3.8/tarfile.py in 5602ms
14[2020-03-30 17:45:01] [build] [INFO] [3] Extracted file /usr/lib/python3.8/site-packages/werkzeug/debug/repr.py in 611ms
15[2020-03-30 17:45:01] [build] [INFO] [4] Extracted file /usr/lib/python3.8/site-packages/jinja2/parser.py in 2626ms
16[2020-03-30 17:45:03] [build] [ERROR] Process 6 timed out
17[2020-03-30 17:45:04] [build-err] Traceback (most recent call last):
18[2020-03-30 17:45:04] [build-err]   File "/home/rmb122/repos/codeql/python/tools/index.py", line 19, in <module>
19[2020-03-30 17:45:04] [build-err]     buildtools.index.main()
20[2020-03-30 17:45:04] [build-err]   File "/home/rmb122/repos/codeql/python/tools/python3src.zip/buildtools/index.py", line 110, in main
21[2020-03-30 17:45:04] [build-err]   File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
22[2020-03-30 17:45:04] [build-err]     raise CalledProcessError(retcode, cmd)
23[2020-03-30 17:45:04] [build-err] subprocess.CalledProcessError: Command '['python3', '-S', '/home/rmb122/repos/codeql/python/tools/python_tracer.py', '-v', '-z', 'all', '-c', '/home/rmb122/temp/codeql/codeql-database/working/trap_cache', '-p', '/usr/lib/python3.8/site-packages', '-R', '/home/rmb122/temp/codeql/flask-example']' returned non-zero exit status 1.
24[2020-03-30 17:45:04] [ERROR] Spawned process exited abnormally (code 1; tried to run: [/home/rmb122/repos/codeql/python/tools/autobuild.sh])
25A fatal error occurred: Exit status 1 from command: [/home/rmb122/repos/codeql/python/tools/autobuild.sh]

问题出在这
Failed to extract module _decimal: libmpdec.so.2: cannot open shared object file: No such file or directory

repl 里面运行一下, 果然也是如此

1$ python
2Python 3.8.2 (default, Feb 26 2020, 22:21:03) 
3[GCC 9.2.1 20200130] on linux
4Type "help", "copyright", "credits" or "license" for more information.
5>>> import _decimal
6Traceback (most recent call last):
7  File "<stdin>", line 1, in <module>
8ImportError: libmpdec.so.2: cannot open shared object file: No such file or directory
9>>> 

神奇的居然没有装, pacman -S mpdecimal 装一个就完事了.
这应该只扫了用到的库, 没有扫全部的库, 挺快的, 点个赞.

接下来真正开始写 ql
教程在这:
https://help.semmle.com/QL/learn-ql/
https://help.semmle.com/QL/ql-handbook/
https://help.semmle.com/qldoc/

官方的规则库:
https://github.com/Semmle/ql.git
学习一哈.

官方自带的 ql 里面有预先定义好一些数据结构, 直接 import python 就能用了, .qll 是 library, .ql 是真正的查询语句.
正式运行还需要定义一个 qlpack.yml, 意义跟 pom.xml 差不多吧, 定义包名和依赖
https://help.semmle.com/codeql/codeql-cli/reference/qlpack-overview.html

1name: com-rmb122-test
2version: 0.0.1
3libraryPathDependencies: codeql-python
4extractor: python

$HOME/.config/codeql/config 里面设置 ql repo 的路径, 这样才能被 import, 感觉现在这种手动配置好蛋痛 =.= 感觉跟自己编译 cpp 一样链接一堆库
而且文档 u1s1, 挺乱的, 不过刚被 github 收购, 可以理解, 希望之后能更方便一点.

1--search-path <path to ql repo>

注意不要写成 --search-path=<path to ql repo>, 不然识别不了… 坑了我好久.

然后保存

1import python
2
3from Function f
4where f.getName().matches("index")
5select f, "This is a function called get..."
1/home/rmb122/repos/codeql/codeql query run test.ql -d ../codeql-database/

就能运行了, 或者用 vscode 插件也行, 更方便一点. 可以找到我们写的 index 函数.

简单的 Taint demo

照着自带的示例, 可以依葫芦画瓢, 写出

 1import python
 2import semmle.python.security.TaintTracking
 3import semmle.python.web.flask.Request
 4
 5class SystemCommandExecution extends TaintTracking::Configuration {
 6    SystemCommandExecution() { this = "SystemCommandExecution Tracking" }
 7
 8    override predicate isSource(DataFlow::Node src, TaintKind kind) {
 9        exists(FlaskRequestArgs taintSrc |
10            src.asCfgNode() = taintSrc 
11            // and taintSrc.isSourceOf(kind) 这里示例用的 get, 不是 dickstring 这个 kind, 所以需要注释掉
12        )
13    }
14
15    override predicate isSink(DataFlow::Node sink, TaintKind kind) {
16        exists(
17            CallNode call |
18            call.getFunction().pointsTo(Value::named("subprocess.check_output")) and
19            call.getArg(0) = sink.asCfgNode()
20        )
21    }
22}
23
24from SystemCommandExecution config, DataFlow::Node src, DataFlow::Node sink
25where config.hasSimpleFlow(src, sink)
26select sink, src

再检测一下威力加强版示例

 1import flask
 2import subprocess
 3from subprocess import check_output
 4from flask import request
 5
 6app = flask.Flask(__name__)
 7
 8@app.route('/index')
 9def index():
10    return subprocess.check_output(flask.request.args.get('c', 'ls'))
11
12@app.route('/index2')
13def index2():
14    tmp = flask.request.args.get('c', 'ls')
15    tmp = tmp.split('|')
16    return subprocess.check_output(tmp)
17
18@app.route('/index3')
19def index3():
20    tmp = flask.request.args.get('c', 'ls')
21    tmp = tmp.split('|')
22    return check_output(tmp)
23
24@app.route('/index4')
25def index4():
26    tmp = request.args.get('c', 'ls')
27    tmp = tmp.split('|')
28    return subprocess.check_output(tmp)
29
30@app.route('/index5')
31def index5():
32    tmp = flask.request.args.get('c', 'ls')
33    tmp = tmp + "i"
34    return subprocess.check_output(tmp)
35
36@app.route('/index6')
37def index6():
38    tmp = request.args.get('c', 'ls')
39    tmp = tmp + "i"
40    return subprocess.check_output(tmp)
41
42@app.route('/index7')
43def index7():
44    tmp = request.args.get('c', 'ls')
45    tmp = tmp + "i"
46    return check_output(tmp)
47
48app.run()

结果是 index, index5-7 都能检测出来, 2-4 没检测出来. 原因应该是 spilt 的结果没被 taint 到.
将 isSink 替换成

1override predicate isSink(DataFlow::Node sink, TaintKind kind) {
2        "1" = "1"
3}

也可以观察到 tmp = tmp.split('|') 的返回值是没被 select 到的. 也可以验证这一点.

这样测试下来, 感觉最大的问题还是编写起来测试太费劲了, 就算只更改一点内容, 编译和运行都要 1~2 分钟才能完成. 对 database 也是同理, 只能重新生成, 没有更新的选项, 浪费了大量的时间.
剩下的可能是自带的 taint tracking 不够给力, 还好官方给了相关的接口 (DataFlowExtension), 需要自己研究下添加额外 taint 的方法, 至少 split 肯定是得加进去的吧.