我有以下Django模型.我不确定在使用scrapy管道将蜘蛛扫描到Django中的数据库时,保存这些相互关联对象的最佳方法是什么.似乎scrapy管道只是为了处理一种“种类”的物品
models.py
class Parent(models.Model): field1 = CharField() class ParentX(models.Model): field2 = CharField() parent = models.OnetoOneField(Parent,related_name = 'extra_properties') class Child(models.Model): field3 = CharField() parent = models.ForeignKey(Parent,related_name='childs')
items.py
# uses DjangoItem https://github.com/scrapy-plugins/scrapy-djangoitem class ParentItem(DjangoItem): django_model = Parent class ParentXItem(DjangoItem): django_model = ParentX class ChildItem(DjangoItem): django_model = Child
spiders.py
class MySpider(scrapy.Spider): name = "myspider" allowed_domains = ["abc.com"] start_urls = [ "http://www.example.com",# this page has ids of several Parent objects whose full details are in their individual pages ] def parse(self,response): parent_object_ids = [] #list from scraping the ids of the parent objects for parent_id in parent_object_ids: url = "http://www.example.com/%s" % parent_id yield scrapy.Request(url,callback=self.parse_detail) def parse_detail(self,response): p = ParentItem() px = ParentXItem() c = ChildItem() # populate p,px and c1,c2 with varIoUs data from the response.body yield p yield px yield c1 yield c2 ... etc c3,c4
pipelines.py – 不知道该怎么做
class ScrapytestPipeline(object): def process_item(self,item,spider): # This is where typically storage to database happens # Now,I dont kNow whether the item is a ParentItem or ParentXItem or ChildItem # Ideally,I want to first create the Parent obj and then ParentX obj (and point p.extra_properties = px),and then child objects # c1.parent = p,c2.parent = p # But I am not sure how to have pipeline do this in a sequential way from any order of items received
解决方法
如果你想按顺序进行操作,如果你将一个项目存储在另一个项目中,我会支持,一个depakage – 它在管道中,它可能会起作用.
我认为在保存db之前更容易关联对象.
在spiders.py中,当你“使用来自response.body的各种数据填充p,px和c1,c2”时,你可以填充从对象数据构造的“假”主键.
然后你可以保存数据并在模型中更新 – 如果已经只在一个管道中被删除:
class ItemPersistencePipeline(object): def process_item(self,spider): try: item_model = item_to_model(item) except TypeError: return item model,created = get_or_create(item_model) try: update_model(model,item_model) except Exception,e: return e return item
当然方法:
def item_to_model(item): model_class = getattr(item,'django_model') if not model_class: raise TypeError("Item is not a `DjangoItem` or is misconfigured") return item.instance def get_or_create(model): model_class = type(model) created = False try: #We have no unique identifier at the moment #use the model.primary for Now obj = model_class.objects.get(primary=model.primary) except model_class.DoesNotExist: created = True obj = model # DjangoItem created a model for us. return (obj,created) from django.forms.models import model_to_dict def update_model(destination,source,commit=True): pk = destination.pk source_dict = model_to_dict(source) for (key,value) in source_dict.items(): setattr(destination,key,value) setattr(destination,'pk',pk) if commit: destination.save() return destination
来自:How to update DjangoItem in Scrapy
您还应该在django模型中定义字段“primary”以搜索是否已经在新项目中进行了搜索
models.py
class Parent(models.Model): field1 = CharField() #primary_key=True primary = models.CharField(max_length=80) class ParentX(models.Model): field2 = CharField() parent = models.OnetoOneField(Parent,related_name = 'extra_properties') primary = models.CharField(max_length=80) class Child(models.Model): field3 = CharField() parent = models.ForeignKey(Parent,related_name='childs') primary = models.CharField(max_length=80)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。