从网站生成和下载tsv使用python

如何解决从网站生成和下载tsv使用python

我有 this website，并且想编写一个脚本，该脚本可以执行与单击“导出”->“生成tsv”->等待生成->“下载”相同的输出。最终目标是使用此列表。 .txt中包含1700种蛋白质（因此提取一种蛋白质，在这种情况下为'Q9BXF6'并将其放在url：https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table中），然后将所有结果下载到.tsv文件中。

我尝试检查“导出”按钮，但是源代码没有显示（或者我不知道在哪里看）。我也尝试过：

    private String SaveImage(String imageURL) {


        Bitmap bitmap = null;
        try {
            // Download Image from URL
            URL testUrl = new URL(imageURL);
            URLConnection urlConnection = testUrl.openConnection();
            HttpURLConnection httpURLConnection = (HttpURLConnection) urlConnection;
            InputStream is = httpURLConnection.getInputStream();
            // Decode Bitmap
            BitmapFactory.Options options = new BitmapFactory.Options();
            options.inJustDecodeBounds = true;

            BitmapFactory.decodeStream(is,null,options);

            Boolean scaleByHeight = Math.abs(options.outHeight - 300) >= Math.abs(options.outWidth - 300);

            if(options.outHeight * options.outWidth * 2 >= 200*200*2){
                // Load,scaling to smallest power of 2 that'll get it <= desired dimensions
                double sampleSize = scaleByHeight
                        ? options.outHeight / 300
                        : options.outWidth / 300;
                options.inSampleSize =
                        (int)Math.pow(2d,Math.floor(
                                Math.log(sampleSize)/Math.log(2d)));
            }

            // Do the actual decoding
            options.inJustDecodeBounds = false;

            is.close();
            is = httpURLConnection.getInputStream();
            bitmap = BitmapFactory.decodeStream(is,options);
            is.close();

            String root = getApplicationContext().getFilesDir().toString();
            File myDir = new File(root + "/saved_images");
            myDir.mkdirs();
            Random generator = new Random();
            int n = 100000;
            n = generator.nextInt(n);
            String fname = "Image-" + n + ".png";
            File file = new File(myDir,fname);
            if (file.exists()) file.delete();
            FileOutputStream out = new FileOutputStream(file);
            bitmap.compress(Bitmap.CompressFormat.JPEG,70,out); //here
            out.flush();
            out.close();

            return getApplicationContext().getFilesDir().toString() + "/saved_images/" + "Image-" + n + ".png";

        } catch (IOException e) {
            e.printstacktrace();
        }

        return null;
    }

定位我需要的内容，但是它会输出很多我无法真正理解的字符。我还尝试下载整个页面，就像使用urllib库一样：与

r = requests.get('https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table')
soup = BeautifulSoup(r.content,'html.parser')

或

myurl = 'https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table'
urllib.request.urlopen() as f:
          html = f.read().decode('utf-8')

似乎所有内容都写在其他地方并被引用，我尝试过的所有内容都输出了一些愚蠢的信息，但我对html一点都不了解，并且对python真的很陌生（我只使用R）。

解决方法

对于第一个问题，您可以使用以下元素的URL来检索下一个问题所需的蛋白质值。

href="blob:https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b"

URL设置为href标记，然后您可以使用它发出请求以下载文件。您还可以通过右键单击TSV的下载按钮并单击Inspect-Element来获得此标签，然后您就可以看到此href标签的存在。

在此之后，例如通过

进行下载

import urllib.request

url = 'https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b'
urllib.request.urlretrieve(url,'/Users/abc/Downloads/file.tsv') # any dir to save

with open("/Users/abc/Downloads/file.tsv") as file_in:
    for line in file_in:
        #here make your calls for your second problem.

您也可以使用Web-Automator（例如硒）来优雅地解决此问题。如果对后者感兴趣，请仔细研究-并不困难。